linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch
@ 2018-03-21 22:57 Dan Williams
  2018-03-21 22:57 ` [PATCH v7 01/14] dax: store pfns in the radix Dan Williams
                   ` (13 more replies)
  0 siblings, 14 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Darrick J. Wong, Jeff Moyer, Alexander Viro,
	Andreas Dilger, Jan Kara, Jan Kara, Michal Hocko,
	kbuild test robot, Christoph Hellwig, Ross Zwisler,
	Matthew Wilcox, Dave Chinner, Jérôme Glisse,
	linux-ext4, Theodore Ts'o, Andrew Morton, linux-fsdevel,
	linux-xfs, linux-kernel, jack

Changes since v6 [1]:
* Collect some more reviewed-by's from Christoph, thanks Christoph!

* Rework XFS_MMAPLOCK_EXCL handling to *permit* rather than require
  XFS_MMAPLOCK_EXCL to be held over calls to xfs_break_layouts()
  (Christoph)

* Clean up kbuild robot reports against "ext2, dax: introduce ext2_dax_aops"
  and "fs, dax: use page->mapping to warn if truncate collides with a
  busy page".

* Squash the xfs_break_leased_layouts() 'did_unlock' rework into "xfs:
  prepare xfs_break_layouts() for another layout type" (Christoph)

* Communicate the 'did_unlock' condition with an out parameter rather
  than a positive error code (Christoph). A few other small / welcome
  clean ups fell out as a result.

* Rename BREAK_TRUNCATE to BREAK_UNMAPI to make it clear the
  implementation is concerned with any and all inode extent unmap
  events, not just truncate(2). (Darrick)

* Rebase the branch on commit 6b2bb7265f0b "sched/wait: Introduce
  wait_var_event()" from tip.git/sched/wait. Thanks Peter!

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014806.html

---

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

---

Dan Williams (14):
      dax: store pfns in the radix
      fs, dax: prepare for dax-specific address_space_operations
      block, dax: remove dead code in blkdev_writepages()
      xfs, dax: introduce xfs_dax_aops
      ext4, dax: introduce ext4_dax_aops
      ext2, dax: introduce ext2_dax_aops
      fs, dax: use page->mapping to warn if truncate collides with a busy page
      mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
      mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
      memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
      mm, fs, dax: handle layout changes to pinned dax mappings
      xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
      xfs: prepare xfs_break_layouts() for another layout type
      xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/super.c      |   96 +++++++++++++++++--
 drivers/nvdimm/pmem.c    |    3 -
 fs/Kconfig               |    1 
 fs/block_dev.c           |    5 -
 fs/dax.c                 |  231 ++++++++++++++++++++++++++++++++++++----------
 fs/ext2/ext2.h           |    1 
 fs/ext2/inode.c          |   43 +++++----
 fs/ext2/namei.c          |   18 ----
 fs/ext2/super.c          |    6 +
 fs/ext4/inode.c          |   38 ++++++--
 fs/ext4/super.c          |    6 +
 fs/libfs.c               |   27 +++++
 fs/xfs/xfs_aops.c        |   21 +++-
 fs/xfs/xfs_aops.h        |    1 
 fs/xfs/xfs_file.c        |   76 ++++++++++++++-
 fs/xfs/xfs_inode.h       |   16 +++
 fs/xfs/xfs_ioctl.c       |    8 --
 fs/xfs/xfs_iops.c        |   21 +++-
 fs/xfs/xfs_pnfs.c        |   16 ++-
 fs/xfs/xfs_pnfs.h        |    6 +
 fs/xfs/xfs_super.c       |   20 ++--
 include/linux/dax.h      |   46 ++++++++-
 include/linux/fs.h       |    3 +
 include/linux/memremap.h |   28 ++----
 include/linux/mm.h       |   61 +++++++++---
 kernel/memremap.c        |   32 +++++-
 mm/Kconfig               |    5 +
 mm/gup.c                 |    5 +
 mm/hmm.c                 |   13 ---
 mm/swap.c                |    3 -
 30 files changed, 638 insertions(+), 218 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v7 01/14] dax: store pfns in the radix
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-21 22:57 ` [PATCH v7 02/14] fs, dax: prepare for dax-specific address_space_operations Dan Williams
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jeff Moyer, Christoph Hellwig, Matthew Wilcox, Ross Zwisler,
	Jan Kara, david, linux-fsdevel, linux-xfs, linux-kernel, jack

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 +++++++--
 fs/dax.c            |   83 +++++++++++++++++++--------------------------------
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto put_unlocked;
 	/*
 	 * Entry got reallocated elsewhere? No need to writeback. We have to
-	 * compare sectors as we must not bail out due to difference in lockbit
+	 * compare pfns as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,33 +709,15 @@ static int dax_writeback_one(struct block_device *bdev,
 	/*
 	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
 	 * in the middle of a PMD, the 'index' we are given will be aligned to
-	 * the start index of the PMD, as will the sector we pull from
-	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
-	 * worry about partial PMD writebacks.
+	 * the start index of the PMD, as will the pfn we pull from 'entry'.
+	 * This allows us to flush for PMD_SIZE and not have to worry about
+	 * partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -749,8 +728,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -808,8 +785,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -877,6 +854,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -884,14 +862,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1200,8 +1179,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto error_finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						 dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						 0, write && !sync);
 		if (IS_ERR(entry)) {
 			error = PTR_ERR(entry);
@@ -1280,13 +1258,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(ret))
 		goto fallback;
@@ -1409,8 +1389,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						RADIX_DAX_PMD, write && !sync);
 		if (IS_ERR(entry))
 			goto finish_iomap;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 02/14] fs, dax: prepare for dax-specific address_space_operations
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-03-21 22:57 ` [PATCH v7 01/14] dax: store pfns in the radix Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 15:28   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 03/14] block, dax: remove dead code in blkdev_writepages() Dan Williams
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jeff Moyer, Ross Zwisler, Matthew Wilcox, Jan Kara,
	Christoph Hellwig, Christoph Hellwig, Dave Chinner,
	linux-fsdevel, linux-xfs, linux-kernel, jack

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/libfs.c          |   27 +++++++++++++++++++++++++++
 include/linux/dax.h |   12 +++++++++---
 include/linux/fs.h  |    3 +++
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 7ff3cb904acd..e49d0ac6f800 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1060,6 +1060,33 @@ int noop_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 }
 EXPORT_SYMBOL(noop_fsync);
 
+int noop_set_page_dirty(struct page *page)
+{
+	/*
+	 * Unlike __set_page_dirty_no_writeback that handles dirty page
+	 * tracking in the page object, dax does all dirty tracking in
+	 * the inode address_space in response to mkwrite faults. In the
+	 * dax case we only need to worry about potentially dirty CPU
+	 * caches, not dirty page cache pages to write back.
+	 *
+	 * This callback is defined to prevent fallback to
+	 * __set_page_dirty_buffers() in set_page_dirty().
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(noop_set_page_dirty);
+
+void noop_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length)
+{
+	/*
+	 * There is no page cache to invalidate in the dax case, however
+	 * we need this callback defined to prevent falling back to
+	 * block_invalidatepage() in do_invalidatepage().
+	 */
+}
+EXPORT_SYMBOL_GPL(noop_invalidatepage);
+
 /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
 void kfree_link(void *p)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..ae27a7efe7ab 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,7 @@ static inline void put_dax(struct dax_device *dax_dev)
 }
 #endif
 
+struct writeback_control;
 int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #if IS_ENABLED(CONFIG_FS_DAX)
 int __bdev_dax_supported(struct super_block *sb, int blocksize);
@@ -57,6 +58,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -76,6 +79,12 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
 {
 	return NULL;
 }
+
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 int dax_read_lock(void);
@@ -121,7 +130,4 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
-struct writeback_control;
-int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..b57db31d294d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3129,6 +3129,9 @@ extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *,
 			 struct inode *, struct dentry *, unsigned int);
 extern int noop_fsync(struct file *, loff_t, loff_t, int);
+extern int noop_set_page_dirty(struct page *page);
+extern void noop_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
 extern int simple_write_begin(struct file *file, struct address_space *mapping,

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 03/14] block, dax: remove dead code in blkdev_writepages()
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-03-21 22:57 ` [PATCH v7 01/14] dax: store pfns in the radix Dan Williams
  2018-03-21 22:57 ` [PATCH v7 02/14] fs, dax: prepare for dax-specific address_space_operations Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 15:29   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 04/14] xfs, dax: introduce xfs_dax_aops Dan Williams
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jeff Moyer, Ross Zwisler, Matthew Wilcox, Jan Kara, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-kernel, jack

Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe09ef9c21f3..846ee2d31781 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1946,11 +1946,6 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 static int blkdev_writepages(struct address_space *mapping,
 			     struct writeback_control *wbc)
 {
-	if (dax_mapping(mapping)) {
-		struct block_device *bdev = I_BDEV(mapping->host);
-
-		return dax_writeback_mapping_range(mapping, bdev, wbc);
-	}
 	return generic_writepages(mapping, wbc);
 }
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 04/14] xfs, dax: introduce xfs_dax_aops
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (2 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 03/14] block, dax: remove dead code in blkdev_writepages() Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 15:30   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops Dan Williams
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jeff Moyer, Matthew Wilcox, Ross Zwisler, Christoph Hellwig,
	Jan Kara, Dave Chinner, linux-fsdevel, linux-xfs, linux-kernel,
	jack

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_aops.c |   21 +++++++++++++++++----
 fs/xfs/xfs_aops.h |    1 +
 fs/xfs/xfs_iops.c |    5 ++++-
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..5f1f5948ecc2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1194,16 +1194,22 @@ xfs_vm_writepages(
 	int			ret;
 
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
-	if (dax_mapping(mapping))
-		return dax_writeback_mapping_range(mapping,
-				xfs_find_bdev_for_inode(mapping->host), wbc);
-
 	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
 	if (wpc.ioend)
 		ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
 	return ret;
 }
 
+STATIC int
+xfs_dax_writepages(
+	struct address_space	*mapping,
+	struct writeback_control *wbc)
+{
+	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+	return dax_writeback_mapping_range(mapping,
+			xfs_find_bdev_for_inode(mapping->host), wbc);
+}
+
 /*
  * Called to move a page into cleanable state - and from there
  * to be released. The page should already be clean. We always
@@ -1505,3 +1511,10 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+	.direct_IO		= xfs_vm_direct_IO,
+	.writepages		= xfs_dax_writepages,
+	.set_page_dirty		= noop_set_page_dirty,
+	.invalidatepage		= noop_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		inode->i_mapping->a_ops = &xfs_address_space_operations;
+		if (IS_DAX(inode))
+			inode->i_mapping->a_ops = &xfs_dax_aops;
+		else
+			inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (3 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 04/14] xfs, dax: introduce xfs_dax_aops Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 15:40   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 06/14] ext2, dax: introduce ext2_dax_aops Dan Williams
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Theodore Ts'o, Andreas Dilger, linux-ext4, Jan Kara, david,
	hch, linux-fsdevel, linux-xfs, linux-kernel, jack

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/inode.c |   38 +++++++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..f9884e41cb39 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2725,12 +2725,6 @@ static int ext4_writepages(struct address_space *mapping,
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	if (dax_mapping(mapping)) {
-		ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
-						  wbc);
-		goto out_writepages;
-	}
-
 	/*
 	 * No pages to write? This is mainly a kludge to avoid starting
 	 * a transaction for special inodes like journal inode on last iput()
@@ -2955,6 +2949,27 @@ static int ext4_writepages(struct address_space *mapping,
 	return ret;
 }
 
+static int ext4_dax_writepages(struct address_space *mapping,
+			       struct writeback_control *wbc)
+{
+	int ret;
+	long nr_to_write = wbc->nr_to_write;
+	struct inode *inode = mapping->host;
+	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+		return -EIO;
+
+	percpu_down_read(&sbi->s_journal_flag_rwsem);
+	trace_ext4_writepages(inode, wbc);
+
+	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	trace_ext4_writepages_result(inode, wbc, ret,
+				     nr_to_write - wbc->nr_to_write);
+	percpu_up_read(&sbi->s_journal_flag_rwsem);
+	return ret;
+}
+
 static int ext4_nonda_switch(struct super_block *sb)
 {
 	s64 free_clusters, dirty_clusters;
@@ -3946,6 +3961,13 @@ static const struct address_space_operations ext4_da_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+	.direct_IO		= ext4_direct_IO,
+	.writepages		= ext4_dax_writepages,
+	.set_page_dirty		= noop_set_page_dirty,
+	.invalidatepage		= noop_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3980,9 @@ void ext4_set_aops(struct inode *inode)
 	default:
 		BUG();
 	}
-	if (test_opt(inode->i_sb, DELALLOC))
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext4_dax_aops;
+	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
 		inode->i_mapping->a_ops = &ext4_aops;

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 06/14] ext2, dax: introduce ext2_dax_aops
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (4 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 15:44   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, kbuild test robot, david, hch, linux-fsdevel,
	linux-xfs, linux-kernel, jack

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara <jack@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext2/ext2.h  |    1 +
 fs/ext2/inode.c |   43 +++++++++++++++++++++++++++----------------
 fs/ext2/namei.c |   18 ++----------------
 3 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..8ea394c8ffce 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -952,17 +952,16 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-#ifdef CONFIG_FS_DAX
-	if (dax_mapping(mapping)) {
-		return dax_writeback_mapping_range(mapping,
-						   mapping->host->i_sb->s_bdev,
-						   wbc);
-	}
-#endif
-
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static int
+ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
+{
+	return dax_writeback_mapping_range(mapping,
+			mapping->host->i_sb->s_bdev, wbc);
+}
+
 const struct address_space_operations ext2_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -990,6 +989,13 @@ const struct address_space_operations ext2_nobh_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+static const struct address_space_operations ext2_dax_aops = {
+	.direct_IO		= ext2_direct_IO,
+	.writepages		= ext2_dax_writepages,
+	.set_page_dirty		= noop_set_page_dirty,
+	.invalidatepage		= noop_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1394,18 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+	inode->i_op = &ext2_file_inode_operations;
+	inode->i_fop = &ext2_file_operations;
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext2_dax_aops;
+	else if (test_opt(inode->i_sb, NOBH))
+		inode->i_mapping->a_ops = &ext2_nobh_aops;
+	else
+		inode->i_mapping->a_ops = &ext2_aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
 	struct ext2_inode_info *ei;
@@ -1480,14 +1498,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 		ei->i_data[n] = raw_inode->i_block[n];
 
 	if (S_ISREG(inode->i_mode)) {
-		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, NOBH)) {
-			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
-		} else {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
-		}
+		ext2_set_file_ops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
 		inode->i_fop = &ext2_dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	return ext2_add_nondir(dentry, inode);
 }
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	d_tmpfile(dentry, inode);
 	unlock_new_inode(inode);

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (5 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 06/14] ext2, dax: introduce ext2_dax_aops Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 16:02   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 08/14] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jeff Moyer, Matthew Wilcox, Ross Zwisler, Jan Kara,
	Christoph Hellwig, david, linux-fsdevel, linux-xfs, linux-kernel,
	jack

Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..f21a8e7e47f6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -298,6 +298,56 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return PMD_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->mapping);
+		page->mapping = mapping;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+		bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		page->mapping = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -404,6 +454,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -453,6 +504,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -547,6 +599,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, mapping, false);
+		dax_associate_entry(new_entry, mapping);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 08/14] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (6 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-29 16:36   ` Jan Kara
  2018-03-21 22:57 ` [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jérôme Glisse, Christoph Hellwig, david,
	linux-fsdevel, linux-xfs, linux-kernel, jack

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   79 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |    3 +-
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    8 +++++
 7 files changed, 103 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..ecefe9f7eb60 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,66 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+	/* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f8dcc52ca6..e6d7351f3379 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
+			&pmem->pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7666c065b96f..6ae20e319bc4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39bf464c35f1..315a323729e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 93588ea3d3d2..ef7dd7148c0b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -724,7 +724,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -732,10 +732,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -753,9 +753,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -765,7 +765,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -779,7 +779,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -813,14 +813,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ae27a7efe7ab..92a1d3ee1615 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -52,12 +52,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 #else
@@ -71,13 +67,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
@@ -91,6 +88,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..02d6d042ee7f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,19 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * When MEMORY_DEVICE_HOST memory is represented by a device that can
+ * host a filesystem, for example /dev/pmem0, that filesystem can
+ * register for a callback when a page is idled. For the filesystem-dax
+ * case page idle callbacks are used to coordinate DMA vs
+ * hole-punch/truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (7 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 08/14] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-03-21 22:57 ` Dan Williams
  2018-03-22  7:49   ` Christoph Hellwig
  2018-03-30 10:37   ` Jan Kara
  2018-03-21 22:58 ` [PATCH v7 10/14] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:57 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jérôme Glisse, david, hch, linux-fsdevel,
	linux-xfs, linux-kernel, jack

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   20 ++-------------
 include/linux/mm.h       |   61 ++++++++++++++++++++++++++++++++--------------
 kernel/memremap.c        |   30 ++++++++++++++++++++---
 mm/Kconfig               |    5 ++++
 mm/hmm.c                 |   13 ++--------
 mm/swap.c                |    3 ++
 8 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecefe9f7eb60..619b1ed6434c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	dev_pagemap_get_ops();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	dev_pagemap_put_ops();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1f0832bbc32f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEV_PAGEMAP_OPS
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..9faf25d6abef 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -130,6 +129,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
 	if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..088c76bce360 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
+	return false;
 }
-#define IS_HMM_ENABLED 0
+
 static inline bool is_device_private_page(const struct page *page)
 {
-	return false;
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
 static inline bool is_device_public_page(const struct page *page)
 {
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool put_devmap_managed_page(struct page *page)
+{
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
@@ -850,16 +878,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4dd4274cabe2..72d0bb6fc47d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -476,8 +476,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL_GPL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -497,5 +519,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8fb7235..dc32828984a3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEV_PAGEMAP_OPS
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -720,6 +723,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -730,6 +734,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..4aa554e76d06 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -996,7 +987,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1090,7 +1081,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 0f17330dd0e5..eed846cfc8b8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -744,7 +745,7 @@ void release_pages(struct page **pages, int nr)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 10/14] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (8 preceding siblings ...)
  2018-03-21 22:57 ` [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-03-21 22:58 ` Dan Williams
  2018-03-22  7:51   ` Christoph Hellwig
  2018-03-21 22:58 ` [PATCH v7 11/14] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:58 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Christoph Hellwig, Jérôme Glisse, david,
	linux-fsdevel, linux-xfs, linux-kernel, jack

The devm_memremap_pages() facility is tightly integrated with the
kernel's memory hotplug functionality. It injects an altmap argument
deep into the architecture specific vmemmap implementation to allow
allocating from specific reserved pages, and it has Linux specific
assumptions about page structure reference counting relative to
get_user_pages() and get_user_pages_fast(). It was an oversight that
this was not marked EXPORT_SYMBOL_GPL from the outset.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 72d0bb6fc47d..766cb9a487ae 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -430,7 +430,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	devres_free(pgmap);
 	return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 11/14] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (9 preceding siblings ...)
  2018-03-21 22:58 ` [PATCH v7 10/14] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
@ 2018-03-21 22:58 ` Dan Williams
  2018-03-21 22:58 ` [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:58 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Jeff Moyer, Dave Chinner, Matthew Wilcox,
	Alexander Viro, Darrick J. Wong, Ross Zwisler, Dave Hansen,
	Andrew Morton, Christoph Hellwig, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-kernel, jack

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    2 +
 fs/dax.c            |   92 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   25 ++++++++++++++
 mm/gup.c            |    5 +++
 4 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 619b1ed6434c..17323bc6d441 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-	/* TODO: wakeup page-idle waiters */
+	wake_up_var(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index f21a8e7e47f6..fc9c6bf82bcc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -348,6 +348,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -489,6 +502,85 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	pgoff_t	indices[PAGEVEC_SIZE];
+	struct page *page = NULL;
+	struct pagevec pvec;
+	pgoff_t	index, end;
+	unsigned i;
+
+	/*
+	 * In the 'limited' case get_user_pages() for dax is disabled.
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return NULL;
+
+	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+		return NULL;
+
+	pagevec_init(&pvec);
+	index = 0;
+	end = -1;
+	/*
+	 * Flush dax_layout_lock() sections to ensure all possible page
+	 * references have been taken, or otherwise arrange for faults
+	 * to block on the filesystem lock that is taken for
+	 * establishing new mappings.
+	 */
+	unmap_mapping_range(mapping, 0, 0, 1);
+	synchronize_rcu();
+
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dax_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+			if (page)
+				break;
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+
+		if (page)
+			break;
+	}
+	return page;
+}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 92a1d3ee1615..fc854674323f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -56,6 +56,18 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
 void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+
+static inline void dax_layout_lock(void)
+{
+	rcu_read_lock();
+}
+
+static inline void dax_layout_unlock(void)
+{
+	rcu_read_unlock();
+}
+
+struct page *dax_layout_busy_page(struct address_space *mapping);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -82,6 +94,19 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline void dax_layout_lock(void)
+{
+}
+
+static inline void dax_layout_unlock(void)
+{
+}
+
+static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return NULL;
+}
 #endif
 
 int dax_read_lock(void);
diff --git a/mm/gup.c b/mm/gup.c
index 1b46e6e74881..a81efac6983a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,7 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -693,7 +694,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		dax_layout_lock();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_layout_unlock();
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,
@@ -1809,7 +1812,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
+		dax_layout_lock();
 		gup_pgd_range(addr, end, write, pages, &nr);
+		dax_layout_unlock();
 		local_irq_enable();
 		ret = nr;
 	}

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (10 preceding siblings ...)
  2018-03-21 22:58 ` [PATCH v7 11/14] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
@ 2018-03-21 22:58 ` Dan Williams
  2018-03-22  7:25   ` Christoph Hellwig
  2018-03-30 15:16   ` Darrick J. Wong
  2018-03-21 22:58 ` [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
  2018-03-21 22:58 ` [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  13 siblings, 2 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:58 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Darrick J. Wong, Ross Zwisler, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-kernel, jack

In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |    5 +----
 fs/xfs/xfs_ioctl.c |    5 +----
 fs/xfs/xfs_iops.c  |   10 +++++++---
 fs/xfs/xfs_pnfs.c  |    3 ++-
 4 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..18edf04811d0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -768,7 +768,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = false;
 
@@ -782,9 +782,6 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..4151fade4bb1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -614,7 +614,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -648,9 +648,6 @@ xfs_ioc_space(
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 951e84df5576..d23aa08426f9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1028,13 +1028,17 @@ xfs_vn_setattr(
 
 	if (iattr->ia_valid & ATTR_SIZE) {
 		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
-		uint			iolock = XFS_IOLOCK_EXCL;
+		uint			iolock;
+
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
 		error = xfs_break_layouts(d_inode(dentry), &iolock);
-		if (error)
+		if (error) {
+			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 			return error;
+		}
 
-		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		error = xfs_vn_setattr_size(dentry, iattr);
 		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..6ea7b0b55d02 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -43,7 +43,8 @@ xfs_break_layouts(
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
 		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
+		*iolock &= ~XFS_IOLOCK_SHARED;
+		*iolock |= XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (11 preceding siblings ...)
  2018-03-21 22:58 ` [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
@ 2018-03-21 22:58 ` Dan Williams
  2018-03-22  7:27   ` Christoph Hellwig
  2018-03-21 22:58 ` [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:58 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Darrick J. Wong, Ross Zwisler, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-kernel, jack

When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_UNMAPI are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAPI case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAPI breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.

Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   32 ++++++++++++++++++++++++++++++--
 fs/xfs/xfs_inode.h |   16 ++++++++++++++++
 fs/xfs/xfs_ioctl.c |    3 +--
 fs/xfs/xfs_iops.c  |    6 +++---
 fs/xfs/xfs_pnfs.c  |   13 +++++++------
 fs/xfs/xfs_pnfs.h  |    6 ++++--
 6 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 18edf04811d0..7f37fadf007e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
 	if (error <= 0)
 		return error;
 
-	error = xfs_break_layouts(inode, iolock);
+	error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
 	if (error)
 		return error;
 
@@ -752,6 +752,34 @@ xfs_file_write_iter(
 	return ret;
 }
 
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock,
+	enum layout_break_reason reason)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	bool			did_unlock = false;
+	int			error = 0;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
+				| (reason == BREAK_UNMAPI
+					? XFS_MMAPLOCK_EXCL : 0)));
+
+	switch (reason) {
+	case BREAK_UNMAPI:
+		/* fall through */
+	case BREAK_WRITE:
+		error = xfs_break_leased_layouts(inode, iolock, &did_unlock);
+		break;
+	default:
+		error = -EINVAL;
+		break;
+	}
+
+	return error;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
@@ -778,7 +806,7 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAPI);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..6951d57aae71 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 					>> XFS_ILOCK_SHIFT)
 
 /*
+ * Layouts are broken in the BREAK_WRITE case to ensure that
+ * layout-holders do not collide with local writes. Additionally,
+ * layouts are broken in the BREAK_UNMAPI case to make sure the
+ * layout-holder has a consistent view of the file's extent map. While
+ * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
+ * BREAK_UNMAPI breaks additionally require waiting for busy dax-pages
+ * to go idle.
+ */
+enum layout_break_reason {
+        BREAK_WRITE,
+        BREAK_UNMAPI,
+};
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +461,8 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
+int	xfs_break_layouts(struct inode *inode, uint *iolock,
+		enum layout_break_reason reason);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4151fade4bb1..967be496b052 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,7 +39,6 @@
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
 #include "xfs_trans.h"
-#include "xfs_pnfs.h"
 #include "xfs_acl.h"
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
@@ -644,7 +643,7 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAPI);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d23aa08426f9..7364474347d1 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,7 +37,6 @@
 #include "xfs_da_btree.h"
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
-#include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 
 #include <linux/capability.h>
@@ -1027,13 +1026,14 @@ xfs_vn_setattr(
 	int			error;
 
 	if (iattr->ia_valid & ATTR_SIZE) {
-		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
+		struct inode		*inode = d_inode(dentry);
+		struct xfs_inode	*ip = XFS_I(inode);
 		uint			iolock;
 
 		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
-		error = xfs_break_layouts(d_inode(dentry), &iolock);
+		error = xfs_break_layouts(inode, &iolock, BREAK_UNMAPI);
 		if (error) {
 			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 			return error;
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 6ea7b0b55d02..40e69edb7e2e 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -31,17 +31,18 @@
  * rules in the page fault path we don't bother.
  */
 int
-xfs_break_layouts(
+xfs_break_leased_layouts(
 	struct inode		*inode,
-	uint			*iolock)
+	uint			*iolock,
+	bool			*did_unlock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			error;
 
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
+	*did_unlock = false;
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		*did_unlock = true;
 		error = break_layout(inode, true);
 		*iolock &= ~XFS_IOLOCK_SHARED;
 		*iolock |= XFS_IOLOCK_EXCL;
@@ -121,8 +122,8 @@ xfs_fs_map_blocks(
 	 * Lock out any other I/O before we flush and invalidate the pagecache,
 	 * and then hand out a layout to the remote system.  This is very
 	 * similar to direct I/O, except that the synchronization is much more
-	 * complicated.  See the comment near xfs_break_layouts for a detailed
-	 * explanation.
+	 * complicated.  See the comment near xfs_break_leased_layouts
+	 * for a detailed explanation.
 	 */
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index bf45951e28fe..0f2f51037064 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -9,11 +9,13 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
 
-int xfs_break_layouts(struct inode *inode, uint *iolock);
+int xfs_break_leased_layouts(struct inode *inode, uint *iolock,
+		bool *did_unlock);
 #else
 static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
+xfs_break_leased_layouts(struct inode *inode, uint *iolock, bool *did_unlock)
 {
+	*did_unlock = false;
 	return 0;
 }
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (12 preceding siblings ...)
  2018-03-21 22:58 ` [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-03-21 22:58 ` Dan Williams
  2018-03-22  7:43   ` Christoph Hellwig
  13 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-21 22:58 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Dave Chinner, Darrick J. Wong, Ross Zwisler,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-kernel, jack

xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   59 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 49 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7f37fadf007e..d4573f93fddb 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,38 @@ xfs_file_write_iter(
 	return ret;
 }
 
+static void
+xfs_wait_var_event(
+	struct inode		*inode,
+	uint			iolock,
+	bool			*did_unlock)
+{
+	struct xfs_inode        *ip = XFS_I(inode);
+
+	*did_unlock = true;
+	xfs_iunlock(ip, iolock);
+	schedule();
+	xfs_ilock(ip, iolock);
+}
+
+static int
+xfs_break_dax_layouts(
+	struct inode		*inode,
+	uint			iolock,
+	bool			*did_unlock)
+{
+	struct page		*page;
+
+	*did_unlock = false;
+	page = dax_layout_busy_page(inode->i_mapping);
+	if (!page)
+		return 0;
+
+	return ___wait_var_event(&page->_refcount,
+			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+			0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
+}
+
 int
 xfs_break_layouts(
 	struct inode		*inode,
@@ -766,16 +798,23 @@ xfs_break_layouts(
 				| (reason == BREAK_UNMAPI
 					? XFS_MMAPLOCK_EXCL : 0)));
 
-	switch (reason) {
-	case BREAK_UNMAPI:
-		/* fall through */
-	case BREAK_WRITE:
-		error = xfs_break_leased_layouts(inode, iolock, &did_unlock);
-		break;
-	default:
-		error = -EINVAL;
-		break;
-	}
+	do {
+		switch (reason) {
+		case BREAK_UNMAPI:
+			error = xfs_break_dax_layouts(inode, *iolock,
+					&did_unlock);
+			/* fall through */
+		case BREAK_WRITE:
+			if (error || did_unlock)
+				break;
+			error = xfs_break_leased_layouts(inode, iolock,
+					&did_unlock);
+			break;
+		default:
+			error = -EINVAL;
+			break;
+		}
+	} while (error == 0 && did_unlock);
 
 	return error;
 }

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  2018-03-21 22:58 ` [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
@ 2018-03-22  7:25   ` Christoph Hellwig
  2018-03-30 15:16   ` Darrick J. Wong
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-22  7:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Darrick J. Wong, Ross Zwisler, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-kernel, jack

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-21 22:58 ` [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-03-22  7:27   ` Christoph Hellwig
  2018-03-22 15:50     ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-22  7:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Darrick J. Wong, Ross Zwisler, Dave Chinner,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-kernel, jack

> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
> +				| (reason == BREAK_UNMAPI
> +					? XFS_MMAPLOCK_EXCL : 0)));

please split the assert, e.g.:

	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));

	switch (reason) {
+	case BREAK_UNMAPI:
		ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
> +		/* fall through */
> +	case BREAK_WRITE:
> +		error = xfs_break_leased_layouts(inode, iolock, &did_unlock);
> +		break;
> +	default:
> +		error = -EINVAL;
> +		break;
> +	}
> +
> +	return error;

I have to say I'd prefer BREAK_UNMAP over BREAK_UNMAPI given that weird
I suffix doesn't buy us anything, but that's just a minor issue.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-21 22:58 ` [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
@ 2018-03-22  7:43   ` Christoph Hellwig
  2018-03-22 16:28     ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-22  7:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Dave Chinner, Darrick J. Wong,
	Ross Zwisler, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-kernel

On Wed, Mar 21, 2018 at 03:58:31PM -0700, Dan Williams wrote:
> xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
> for busy / pinned dax pages and waits for those pages to go idle before
> any potential extent unmap operation.
> 
> dax_layout_busy_page() handles synchronizing against new page-busy
> events (get_user_pages). It invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the xfs inode
> lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
> busy page it returns it for xfs to wait for the page-idle event that
> will fire when the page reference count reaches 1 (recall ZONE_DEVICE
> pages are idle at count 1, see generic_dax_pagefree()).
> 
> While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
> deadlock the process that might be trying to elevate the page count of
> more pages before arranging for any of them to go idle. I.e. the typical
> case of submitting I/O is that iov_iter_get_pages() elevates the
> reference count of all pages in the I/O before starting I/O on the first
> page. The process of elevating the reference count of all pages involved
> in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c |   59 ++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 49 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7f37fadf007e..d4573f93fddb 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -752,6 +752,38 @@ xfs_file_write_iter(
>  	return ret;
>  }
>  
> +static void
> +xfs_wait_var_event(
> +	struct inode		*inode,
> +	uint			iolock,
> +	bool			*did_unlock)
> +{
> +	struct xfs_inode        *ip = XFS_I(inode);
> +
> +	*did_unlock = true;
> +	xfs_iunlock(ip, iolock);
> +	schedule();
> +	xfs_ilock(ip, iolock);
> +}
> +
> +static int
> +xfs_break_dax_layouts(
> +	struct inode		*inode,
> +	uint			iolock,
> +	bool			*did_unlock)
> +{
> +	struct page		*page;
> +
> +	*did_unlock = false;
> +	page = dax_layout_busy_page(inode->i_mapping);
> +	if (!page)
> +		return 0;
> +
> +	return ___wait_var_event(&page->_refcount,
> +			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> +			0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
> +}
> +
>  int
>  xfs_break_layouts(
>  	struct inode		*inode,
> @@ -766,16 +798,23 @@ xfs_break_layouts(
>  				| (reason == BREAK_UNMAPI
>  					? XFS_MMAPLOCK_EXCL : 0)));
>  
> -	switch (reason) {
> -	case BREAK_UNMAPI:
> -		/* fall through */
> -	case BREAK_WRITE:
> -		error = xfs_break_leased_layouts(inode, iolock, &did_unlock);
> -		break;
> -	default:
> -		error = -EINVAL;
> -		break;
> -	}
> +	do {
> +		switch (reason) {
> +		case BREAK_UNMAPI:
> +			error = xfs_break_dax_layouts(inode, *iolock,
> +					&did_unlock);
> +			/* fall through */
> +		case BREAK_WRITE:
> +			if (error || did_unlock)
> +				break;
> +			error = xfs_break_leased_layouts(inode, iolock,
> +					&did_unlock);
> +			break;
> +		default:
> +			error = -EINVAL;
> +			break;
> +		}
> +	} while (error == 0 && did_unlock);

Nitpick:  I think did_unlock should probably be called done in this context,
and how about something like this:

	for (;;) {
		switch (reason) {
		case BREAK_UNMAPI:
			error = xfs_break_dax_layouts(inode, *iolock, &done);
			if (error || done)
				return error;
			/*fallthrough*/
		case BREAK_WRITE:
			error = xfs_break_leased_layouts(inode, iolock, &done);
			if (error || done)
				return error;
			break;
		default:
			WARN_ON_ONCE(1);
			return -EINVAL;
		}
	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-21 22:57 ` [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-03-22  7:49   ` Christoph Hellwig
  2018-03-30 10:37   ` Jan Kara
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-22  7:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, Jérôme Glisse, david, hch,
	linux-fsdevel, linux-xfs, linux-kernel, jack

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 10/14] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-03-21 22:58 ` [PATCH v7 10/14] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
@ 2018-03-22  7:51   ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-22  7:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, Christoph Hellwig,
	Jérôme Glisse, david, linux-fsdevel, linux-xfs,
	linux-kernel, jack

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-22  7:27   ` Christoph Hellwig
@ 2018-03-22 15:50     ` Dan Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-22 15:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, Darrick J. Wong, Ross Zwisler, Dave Chinner,
	linux-fsdevel, linux-xfs, Linux Kernel Mailing List, Jan Kara

On Thu, Mar 22, 2018 at 12:27 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +     ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
>> +                             | (reason == BREAK_UNMAPI
>> +                                     ? XFS_MMAPLOCK_EXCL : 0)));
>
> please split the assert, e.g.:
>
>         ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
>
>         switch (reason) {
> +       case BREAK_UNMAPI:
>                 ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));

Sure, I was thinking to keep it out of the loop, but the common case
is that this loop only iterates once.

>> +             /* fall through */
>> +     case BREAK_WRITE:
>> +             error = xfs_break_leased_layouts(inode, iolock, &did_unlock);
>> +             break;
>> +     default:
>> +             error = -EINVAL;
>> +             break;
>> +     }
>> +
>> +     return error;
>
> I have to say I'd prefer BREAK_UNMAP over BREAK_UNMAPI given that weird
> I suffix doesn't buy us anything, but that's just a minor issue.

Ok will do.

> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-22  7:43   ` Christoph Hellwig
@ 2018-03-22 16:28     ` Dan Williams
  2018-03-22 17:02       ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-22 16:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, Jan Kara, Dave Chinner, Darrick J. Wong,
	Ross Zwisler, linux-fsdevel, linux-xfs,
	Linux Kernel Mailing List

On Thu, Mar 22, 2018 at 12:43 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 21, 2018 at 03:58:31PM -0700, Dan Williams wrote:
>> xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
>> for busy / pinned dax pages and waits for those pages to go idle before
>> any potential extent unmap operation.
>>
>> dax_layout_busy_page() handles synchronizing against new page-busy
>> events (get_user_pages). It invalidates all mappings to trigger the
>> get_user_pages slow path which will eventually block on the xfs inode
>> lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
>> busy page it returns it for xfs to wait for the page-idle event that
>> will fire when the page reference count reaches 1 (recall ZONE_DEVICE
>> pages are idle at count 1, see generic_dax_pagefree()).
>>
>> While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
>> deadlock the process that might be trying to elevate the page count of
>> more pages before arranging for any of them to go idle. I.e. the typical
>> case of submitting I/O is that iov_iter_get_pages() elevates the
>> reference count of all pages in the I/O before starting I/O on the first
>> page. The process of elevating the reference count of all pages involved
>> in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_file.c |   59 ++++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 49 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 7f37fadf007e..d4573f93fddb 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -752,6 +752,38 @@ xfs_file_write_iter(
>>       return ret;
>>  }
>>
>> +static void
>> +xfs_wait_var_event(
>> +     struct inode            *inode,
>> +     uint                    iolock,
>> +     bool                    *did_unlock)
>> +{
>> +     struct xfs_inode        *ip = XFS_I(inode);
>> +
>> +     *did_unlock = true;
>> +     xfs_iunlock(ip, iolock);
>> +     schedule();
>> +     xfs_ilock(ip, iolock);
>> +}
>> +
>> +static int
>> +xfs_break_dax_layouts(
>> +     struct inode            *inode,
>> +     uint                    iolock,
>> +     bool                    *did_unlock)
>> +{
>> +     struct page             *page;
>> +
>> +     *did_unlock = false;
>> +     page = dax_layout_busy_page(inode->i_mapping);
>> +     if (!page)
>> +             return 0;
>> +
>> +     return ___wait_var_event(&page->_refcount,
>> +                     atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
>> +                     0, 0, xfs_wait_var_event(inode, iolock, did_unlock));
>> +}
>> +
>>  int
>>  xfs_break_layouts(
>>       struct inode            *inode,
>> @@ -766,16 +798,23 @@ xfs_break_layouts(
>>                               | (reason == BREAK_UNMAPI
>>                                       ? XFS_MMAPLOCK_EXCL : 0)));
>>
>> -     switch (reason) {
>> -     case BREAK_UNMAPI:
>> -             /* fall through */
>> -     case BREAK_WRITE:
>> -             error = xfs_break_leased_layouts(inode, iolock, &did_unlock);
>> -             break;
>> -     default:
>> -             error = -EINVAL;
>> -             break;
>> -     }
>> +     do {
>> +             switch (reason) {
>> +             case BREAK_UNMAPI:
>> +                     error = xfs_break_dax_layouts(inode, *iolock,
>> +                                     &did_unlock);
>> +                     /* fall through */
>> +             case BREAK_WRITE:
>> +                     if (error || did_unlock)
>> +                             break;
>> +                     error = xfs_break_leased_layouts(inode, iolock,
>> +                                     &did_unlock);
>> +                     break;
>> +             default:
>> +                     error = -EINVAL;
>> +                     break;
>> +             }
>> +     } while (error == 0 && did_unlock);
>
> Nitpick:  I think did_unlock should probably be called done in this context,
> and how about something like this:
>
>         for (;;) {
>                 switch (reason) {
>                 case BREAK_UNMAPI:
>                         error = xfs_break_dax_layouts(inode, *iolock, &done);
>                         if (error || done)
>                                 return error;
>                         /*fallthrough*/
>                 case BREAK_WRITE:
>                         error = xfs_break_leased_layouts(inode, iolock, &done);
>                         if (error || done)
>                                 return error;
>                         break;
>                 default:
>                         WARN_ON_ONCE(1);
>                         return -EINVAL;
>                 }
>         }

I can switch to the for() loop model, but it isn't 'done' flag, it's a
'retry' flag. I.e. if xfs_break_leased_layouts() dropped
XFS_MMAPLOCK_EXCL we need to retry xfs_break_dax_layouts(), and if
xfs_break_dax_layouts() slept we need to immediately retry it. So, how
about this?

for (;;) {
        switch (reason) {
        case BREAK_UNMAP:
                ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));

                error = xfs_break_dax_layouts(inode, *iolock, &retry);
                if (error)
                        return error;
                if (retry)
                        continue;
                /* fall through */
        case BREAK_WRITE:
                error = xfs_break_leased_layouts(inode, iolock, &retry);
                if (error || !retry)
                        return error;
                continue;
        default:
                WARN_ON_ONCE(1);
                return -EINVAL;
        }
}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-22 16:28     ` Dan Williams
@ 2018-03-22 17:02       ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-22 17:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, Jan Kara, Dave Chinner,
	Darrick J. Wong, Ross Zwisler, linux-fsdevel, linux-xfs,
	Linux Kernel Mailing List

On Thu, Mar 22, 2018 at 09:28:11AM -0700, Dan Williams wrote:
> I can switch to the for() loop model, but it isn't 'done' flag, it's a
> 'retry' flag. I.e. if xfs_break_leased_layouts() dropped
> XFS_MMAPLOCK_EXCL we need to retry xfs_break_dax_layouts(), and if

Oh, indeed.

> xfs_break_dax_layouts() slept we need to immediately retry it.
> So, how
> about this?

Ah, with that that the loop doesn't work too well, especially the
continue inside a switch statement is probably not to good.  I guess
we should just stick to your original, modulo better variable naming.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 02/14] fs, dax: prepare for dax-specific address_space_operations
  2018-03-21 22:57 ` [PATCH v7 02/14] fs, dax: prepare for dax-specific address_space_operations Dan Williams
@ 2018-03-29 15:28   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-29 15:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jeff Moyer, Ross Zwisler, Matthew Wilcox, Jan Kara,
	Christoph Hellwig, Dave Chinner, linux-fsdevel, linux-xfs,
	linux-kernel

On Wed 21-03-18 15:57:21, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Define some generic VFS aops
> helpers for dax. These noop implementations are there in the dax case to
> prevent the VFS from falling back to operations with page-cache
> assumptions, dax_writeback_mapping_range() may not be referenced in the
> FS_DAX=n case.
> 
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. I'm just wondering whether we should not add also some
stubs calling WARN_ON() and bailing out for stuff like ->migratepage(),
->releasepage(). But probably it is very unlikely we'll introduce bugs that
would trigger those paths so feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c          |   27 +++++++++++++++++++++++++++
>  include/linux/dax.h |   12 +++++++++---
>  include/linux/fs.h  |    3 +++
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 7ff3cb904acd..e49d0ac6f800 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1060,6 +1060,33 @@ int noop_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  }
>  EXPORT_SYMBOL(noop_fsync);
>  
> +int noop_set_page_dirty(struct page *page)
> +{
> +	/*
> +	 * Unlike __set_page_dirty_no_writeback that handles dirty page
> +	 * tracking in the page object, dax does all dirty tracking in
> +	 * the inode address_space in response to mkwrite faults. In the
> +	 * dax case we only need to worry about potentially dirty CPU
> +	 * caches, not dirty page cache pages to write back.
> +	 *
> +	 * This callback is defined to prevent fallback to
> +	 * __set_page_dirty_buffers() in set_page_dirty().
> +	 */
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(noop_set_page_dirty);
> +
> +void noop_invalidatepage(struct page *page, unsigned int offset,
> +		unsigned int length)
> +{
> +	/*
> +	 * There is no page cache to invalidate in the dax case, however
> +	 * we need this callback defined to prevent falling back to
> +	 * block_invalidatepage() in do_invalidatepage().
> +	 */
> +}
> +EXPORT_SYMBOL_GPL(noop_invalidatepage);
> +
>  /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
>  void kfree_link(void *p)
>  {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 0185ecdae135..ae27a7efe7ab 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -38,6 +38,7 @@ static inline void put_dax(struct dax_device *dax_dev)
>  }
>  #endif
>  
> +struct writeback_control;
>  int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
>  #if IS_ENABLED(CONFIG_FS_DAX)
>  int __bdev_dax_supported(struct super_block *sb, int blocksize);
> @@ -57,6 +58,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
>  }
>  
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> +int dax_writeback_mapping_range(struct address_space *mapping,
> +		struct block_device *bdev, struct writeback_control *wbc);
>  #else
>  static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
>  {
> @@ -76,6 +79,12 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
>  {
>  	return NULL;
>  }
> +
> +static inline int dax_writeback_mapping_range(struct address_space *mapping,
> +		struct block_device *bdev, struct writeback_control *wbc)
> +{
> +	return -EOPNOTSUPP;
> +}
>  #endif
>  
>  int dax_read_lock(void);
> @@ -121,7 +130,4 @@ static inline bool dax_mapping(struct address_space *mapping)
>  	return mapping->host && IS_DAX(mapping->host);
>  }
>  
> -struct writeback_control;
> -int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct block_device *bdev, struct writeback_control *wbc);
>  #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 79c413985305..b57db31d294d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3129,6 +3129,9 @@ extern int simple_rmdir(struct inode *, struct dentry *);
>  extern int simple_rename(struct inode *, struct dentry *,
>  			 struct inode *, struct dentry *, unsigned int);
>  extern int noop_fsync(struct file *, loff_t, loff_t, int);
> +extern int noop_set_page_dirty(struct page *page);
> +extern void noop_invalidatepage(struct page *page, unsigned int offset,
> +		unsigned int length);
>  extern int simple_empty(struct dentry *);
>  extern int simple_readpage(struct file *file, struct page *page);
>  extern int simple_write_begin(struct file *file, struct address_space *mapping,
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 03/14] block, dax: remove dead code in blkdev_writepages()
  2018-03-21 22:57 ` [PATCH v7 03/14] block, dax: remove dead code in blkdev_writepages() Dan Williams
@ 2018-03-29 15:29   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-29 15:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jeff Moyer, Ross Zwisler, Matthew Wilcox, Jan Kara,
	Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-kernel

On Wed 21-03-18 15:57:27, Dan Williams wrote:
> Block device inodes never have S_DAX set, so kill the check for DAX and
> diversion to dax_writeback_mapping_range().
> 
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/block_dev.c |    5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index fe09ef9c21f3..846ee2d31781 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1946,11 +1946,6 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
>  static int blkdev_writepages(struct address_space *mapping,
>  			     struct writeback_control *wbc)
>  {
> -	if (dax_mapping(mapping)) {
> -		struct block_device *bdev = I_BDEV(mapping->host);
> -
> -		return dax_writeback_mapping_range(mapping, bdev, wbc);
> -	}
>  	return generic_writepages(mapping, wbc);
>  }
>  
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 04/14] xfs, dax: introduce xfs_dax_aops
  2018-03-21 22:57 ` [PATCH v7 04/14] xfs, dax: introduce xfs_dax_aops Dan Williams
@ 2018-03-29 15:30   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-29 15:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jeff Moyer, Matthew Wilcox, Ross Zwisler,
	Christoph Hellwig, Jan Kara, Dave Chinner, linux-fsdevel,
	linux-xfs, linux-kernel

On Wed 21-03-18 15:57:32, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings like the
> following:
> 
>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>  xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>  [..]
>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
>  [..]
>  Call Trace:
>   set_page_dirty_lock+0x40/0x60
>   bio_set_pages_dirty+0x37/0x50
>   iomap_dio_actor+0x2b7/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   iomap_apply+0xa4/0x110
>   iomap_dio_rw+0x29e/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>   __vfs_read+0xf9/0x170
>   vfs_read+0xa6/0x150
>   SyS_pread64+0x93/0xb0
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> ...where the default set_page_dirty() handler assumes that dirty state
> is being tracked in 'struct page' flags.
> 
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/xfs/xfs_aops.c |   21 +++++++++++++++++----
>  fs/xfs/xfs_aops.h |    1 +
>  fs/xfs/xfs_iops.c |    5 ++++-
>  3 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 9c6a830da0ee..5f1f5948ecc2 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1194,16 +1194,22 @@ xfs_vm_writepages(
>  	int			ret;
>  
>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> -	if (dax_mapping(mapping))
> -		return dax_writeback_mapping_range(mapping,
> -				xfs_find_bdev_for_inode(mapping->host), wbc);
> -
>  	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
>  	if (wpc.ioend)
>  		ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
>  	return ret;
>  }
>  
> +STATIC int
> +xfs_dax_writepages(
> +	struct address_space	*mapping,
> +	struct writeback_control *wbc)
> +{
> +	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> +	return dax_writeback_mapping_range(mapping,
> +			xfs_find_bdev_for_inode(mapping->host), wbc);
> +}
> +
>  /*
>   * Called to move a page into cleanable state - and from there
>   * to be released. The page should already be clean. We always
> @@ -1505,3 +1511,10 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  	.error_remove_page	= generic_error_remove_page,
>  };
> +
> +const struct address_space_operations xfs_dax_aops = {
> +	.direct_IO		= xfs_vm_direct_IO,
> +	.writepages		= xfs_dax_writepages,
> +	.set_page_dirty		= noop_set_page_dirty,
> +	.invalidatepage		= noop_invalidatepage,
> +};
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 88c85ea63da0..69346d460dfa 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -54,6 +54,7 @@ struct xfs_ioend {
>  };
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> +extern const struct address_space_operations xfs_dax_aops;
>  
>  int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
>  
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 56475fcd76f2..951e84df5576 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1272,7 +1272,10 @@ xfs_setup_iops(
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
>  		inode->i_fop = &xfs_file_operations;
> -		inode->i_mapping->a_ops = &xfs_address_space_operations;
> +		if (IS_DAX(inode))
> +			inode->i_mapping->a_ops = &xfs_dax_aops;
> +		else
> +			inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
>  		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops
  2018-03-21 22:57 ` [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops Dan Williams
@ 2018-03-29 15:40   ` Jan Kara
  2018-03-29 18:09     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2018-03-29 15:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Theodore Ts'o, Andreas Dilger, linux-ext4,
	Jan Kara, david, hch, linux-fsdevel, linux-xfs, linux-kernel

On Wed 21-03-18 15:57:37, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings.
> 
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Cc: linux-ext4@vger.kernel.org
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good, just one nit below.

> @@ -3946,6 +3961,13 @@ static const struct address_space_operations ext4_da_aops = {
>  	.error_remove_page	= generic_error_remove_page,
>  };
>  
> +static const struct address_space_operations ext4_dax_aops = {
> +	.direct_IO		= ext4_direct_IO,

So ext4_direct_IO() for IS_DAX() files will just bail out. So could you
just provide ext4_dax_direct_IO() which will bail out and use it here? With
a similar comment as in xfs_vm_direct_IO() that open still needs this
method set... Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 06/14] ext2, dax: introduce ext2_dax_aops
  2018-03-21 22:57 ` [PATCH v7 06/14] ext2, dax: introduce ext2_dax_aops Dan Williams
@ 2018-03-29 15:44   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-29 15:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, kbuild test robot, david, hch,
	linux-fsdevel, linux-xfs, linux-kernel, jack

On Wed 21-03-18 15:57:43, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings.
> 
> Cc: Jan Kara <jack@suse.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Thanks for the cleanup in the patch when setting inode ops. The patch looks
good except the same comment applies as for ext4 - please provide
ext2_dax_direct_IO() which will just bail and use it in ext2_dax_aops.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-21 22:57 ` [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
@ 2018-03-29 16:02   ` Jan Kara
  2018-03-29 19:02     ` Dan Williams
  2018-03-29 23:02     ` Dan Williams
  0 siblings, 2 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-29 16:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jeff Moyer, Matthew Wilcox, Ross Zwisler, Jan Kara,
	Christoph Hellwig, david, linux-fsdevel, linux-xfs, linux-kernel

On Wed 21-03-18 15:57:48, Dan Williams wrote:
> Catch cases where extent unmap operations encounter pages that are
> pinned / busy. Typically this is pinned pages that are under active dma.
> This warning is a canary for potential data corruption as truncated
> blocks could be allocated to a new file while the device is still
> performing i/o.
> 
> Here is an example of a collision that this implementation catches:
> 
>  WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
>  [..]
>  Call Trace:
>   __dax_invalidate_mapping_entry+0x6c/0xf0
>   dax_delete_mapping_entry+0xf/0x20
>   truncate_exceptional_pvec_entries.part.12+0x1af/0x200
>   truncate_inode_pages_range+0x268/0x970
>   ? tlb_gather_mmu+0x10/0x20
>   ? up_write+0x1c/0x40
>   ? unmap_mapping_range+0x73/0x140
>   xfs_free_file_space+0x1b6/0x5b0 [xfs]
>   ? xfs_file_fallocate+0x7f/0x320 [xfs]
>   ? down_write_nested+0x40/0x70
>   ? xfs_ilock+0x21d/0x2f0 [xfs]
>   xfs_file_fallocate+0x162/0x320 [xfs]
>   ? rcu_read_lock_sched_held+0x3f/0x70
>   ? rcu_sync_lockdep_assert+0x2a/0x50
>   ? __sb_start_write+0xd0/0x1b0
>   ? vfs_fallocate+0x20c/0x270
>   vfs_fallocate+0x154/0x270
>   SyS_fallocate+0x43/0x80
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Two comments when looking at this now:

> +#define for_each_entry_pfn(entry, pfn, end_pfn) \
> +	for (pfn = dax_radix_pfn(entry), \
> +			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
> +			pfn < end_pfn; \
> +			pfn++)

Why don't you declare 'end_pfn' inside the for() block? That way you don't
have to pass the variable as an argument to for_each_entry_pfn(). It's not
like you need end_pfn anywhere in the loop body, you just use it to cache
loop termination index.

> @@ -547,6 +599,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	new_entry = dax_radix_locked_entry(pfn, flags);
> +	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> +		dax_disassociate_entry(entry, mapping, false);
> +		dax_associate_entry(new_entry, mapping);
> +	}

I find it quite tricky that in case we pass zero page / empty entry into
dax_[dis]associate_entry(), it will not do anything because
dax_entry_size() will return 0. Can we add an explicit check into
dax_[dis]associate_entry() or at least a comment there?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 08/14] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-03-21 22:57 ` [PATCH v7 08/14] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-03-29 16:36   ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-29 16:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, Jérôme Glisse,
	Christoph Hellwig, david, linux-fsdevel, linux-xfs, linux-kernel,
	jack

On Wed 21-03-18 15:57:53, Dan Williams wrote:
> In order to resolve collisions between filesystem operations and DMA to
> DAX mapped pages we need a callback when DMA completes. With a callback
> we can hold off filesystem operations while DMA is in-flight and then
> resume those operations when the last put_page() occurs on a DMA page.
> 
> Recall that the 'struct page' entries for DAX memory are created with
> devm_memremap_pages(). That routine arranges for the pages to be
> allocated, but never onlined, so a DAX page is DMA-idle when its
> reference count reaches one.
> 
> Also recall that the HMM sub-system added infrastructure to trap the
> page-idle (2-to-1 reference count) transition of the pages allocated by
> devm_memremap_pages() and trigger a callback via the 'struct
> dev_pagemap' associated with the page range. Whereas the HMM callbacks
> are going to a device driver to manage bounce pages in device-memory in
> the filesystem-dax case we will call back to filesystem specified
> callback.
> 
> Since the callback is not known at devm_memremap_pages() time we arrange
> for the filesystem to install it at mount time. No functional changes
> are expected as this only registers a nop handler for the ->page_free()
> event for device-mapped pages.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Reviewed-by: "J�r�me Glisse" <jglisse@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/super.c      |   79 ++++++++++++++++++++++++++++++++++++++++------
>  drivers/nvdimm/pmem.c    |    3 +-
>  fs/ext2/super.c          |    6 ++-
>  fs/ext4/super.c          |    6 ++-
>  fs/xfs/xfs_super.c       |   20 ++++++------
>  include/linux/dax.h      |   17 +++++-----
>  include/linux/memremap.h |    8 +++++
>  7 files changed, 103 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 2b2332b605e4..ecefe9f7eb60 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
>  static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
>  static struct super_block *dax_superblock __read_mostly;
> +static DEFINE_MUTEX(devmap_lock);
>  
>  #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
>  static struct hlist_head dax_host_list[DAX_HASH_SIZE];
> @@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
>  }
>  EXPORT_SYMBOL(bdev_dax_pgoff);
>  
> -#if IS_ENABLED(CONFIG_FS_DAX)
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> -	if (!blk_queue_dax(bdev->bd_queue))
> -		return NULL;
> -	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
> -}
> -EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -#endif
> -
>  /**
>   * __bdev_dax_supported() - Check if the device supports dax for filesystem
>   * @sb: The superblock of the device
> @@ -169,9 +160,66 @@ struct dax_device {
>  	const char *host;
>  	void *private;
>  	unsigned long flags;
> +	struct dev_pagemap *pgmap;
>  	const struct dax_operations *ops;
>  };
>  
> +#if IS_ENABLED(CONFIG_FS_DAX)
> +static void generic_dax_pagefree(struct page *page, void *data)
> +{
> +	/* TODO: wakeup page-idle waiters */
> +}
> +
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
> +{
> +	struct dax_device *dax_dev;
> +	struct dev_pagemap *pgmap;
> +
> +	if (!blk_queue_dax(bdev->bd_queue))
> +		return NULL;
> +	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
> +	if (!dax_dev->pgmap)
> +		return dax_dev;
> +	pgmap = dax_dev->pgmap;
> +
> +	mutex_lock(&devmap_lock);
> +	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
> +			|| pgmap->page_fault
> +			|| pgmap->type != MEMORY_DEVICE_HOST) {
> +		put_dax(dax_dev);
> +		mutex_unlock(&devmap_lock);
> +		return NULL;
> +	}
> +
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = generic_dax_pagefree;
> +	pgmap->data = owner;
> +	mutex_unlock(&devmap_lock);
> +
> +	return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
> +
> +void fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> +	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
> +
> +	put_dax(dax_dev);
> +	if (!pgmap)
> +		return;
> +	if (!pgmap->data)
> +		return;
> +
> +	mutex_lock(&devmap_lock);
> +	WARN_ON(pgmap->data != owner);
> +	pgmap->type = MEMORY_DEVICE_HOST;
> +	pgmap->page_free = NULL;
> +	pgmap->data = NULL;
> +	mutex_unlock(&devmap_lock);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_release);
> +#endif
> +
>  static ssize_t write_cache_show(struct device *dev,
>  		struct device_attribute *attr, char *buf)
>  {
> @@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
>  }
>  EXPORT_SYMBOL_GPL(alloc_dax);
>  
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> +		const struct dax_operations *ops, struct dev_pagemap *pgmap)
> +{
> +	struct dax_device *dax_dev = alloc_dax(private, host, ops);
> +
> +	if (dax_dev)
> +		dax_dev->pgmap = pgmap;
> +	return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(alloc_dax_devmap);
> +
>  void put_dax(struct dax_device *dax_dev)
>  {
>  	if (!dax_dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 06f8dcc52ca6..e6d7351f3379 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
>  	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
>  	disk->bb = &pmem->bb;
>  
> -	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
> +	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
> +			&pmem->pgmap);
>  	if (!dax_dev) {
>  		put_disk(disk);
>  		return -ENOMEM;
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 7666c065b96f..6ae20e319bc4 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
>  	brelse (sbi->s_sbh);
>  	sb->s_fs_info = NULL;
>  	kfree(sbi->s_blockgroup_lock);
> -	fs_put_dax(sbi->s_daxdev);
> +	fs_dax_release(sbi->s_daxdev, sb);
>  	kfree(sbi);
>  }
>  
> @@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
>  
>  static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>  {
> -	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> +	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
>  	struct buffer_head * bh;
>  	struct ext2_sb_info * sbi;
>  	struct ext2_super_block * es;
> @@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>  	kfree(sbi->s_blockgroup_lock);
>  	kfree(sbi);
>  failed:
> -	fs_put_dax(dax_dev);
> +	fs_dax_release(dax_dev, sb);
>  	return ret;
>  }
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 39bf464c35f1..315a323729e3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
>  	if (sbi->s_chksum_driver)
>  		crypto_free_shash(sbi->s_chksum_driver);
>  	kfree(sbi->s_blockgroup_lock);
> -	fs_put_dax(sbi->s_daxdev);
> +	fs_dax_release(sbi->s_daxdev, sb);
>  	kfree(sbi);
>  }
>  
> @@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
>  
>  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  {
> -	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> +	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
>  	char *orig_data = kstrdup(data, GFP_KERNEL);
>  	struct buffer_head *bh;
>  	struct ext4_super_block *es = NULL;
> @@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  out_free_base:
>  	kfree(sbi);
>  	kfree(orig_data);
> -	fs_put_dax(dax_dev);
> +	fs_dax_release(dax_dev, sb);
>  	return err ? err : ret;
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 93588ea3d3d2..ef7dd7148c0b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -724,7 +724,7 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp, mp->m_logdev_targp);
>  		xfs_blkdev_put(logdev);
> -		fs_put_dax(dax_logdev);
> +		fs_dax_release(dax_logdev, mp);
>  	}
>  	if (mp->m_rtdev_targp) {
>  		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
> @@ -732,10 +732,10 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp, mp->m_rtdev_targp);
>  		xfs_blkdev_put(rtdev);
> -		fs_put_dax(dax_rtdev);
> +		fs_dax_release(dax_rtdev, mp);
>  	}
>  	xfs_free_buftarg(mp, mp->m_ddev_targp);
> -	fs_put_dax(dax_ddev);
> +	fs_dax_release(dax_ddev, mp);
>  }
>  
>  /*
> @@ -753,9 +753,9 @@ xfs_open_devices(
>  	struct xfs_mount	*mp)
>  {
>  	struct block_device	*ddev = mp->m_super->s_bdev;
> -	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
> -	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
> +	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
>  	struct block_device	*logdev = NULL, *rtdev = NULL;
> +	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
>  	int			error;
>  
>  	/*
> @@ -765,7 +765,7 @@ xfs_open_devices(
>  		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
>  		if (error)
>  			goto out;
> -		dax_logdev = fs_dax_get_by_bdev(logdev);
> +		dax_logdev = fs_dax_claim_bdev(logdev, mp);
>  	}
>  
>  	if (mp->m_rtname) {
> @@ -779,7 +779,7 @@ xfs_open_devices(
>  			error = -EINVAL;
>  			goto out_close_rtdev;
>  		}
> -		dax_rtdev = fs_dax_get_by_bdev(rtdev);
> +		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
>  	}
>  
>  	/*
> @@ -813,14 +813,14 @@ xfs_open_devices(
>  	xfs_free_buftarg(mp, mp->m_ddev_targp);
>   out_close_rtdev:
>  	xfs_blkdev_put(rtdev);
> -	fs_put_dax(dax_rtdev);
> +	fs_dax_release(dax_rtdev, mp);
>   out_close_logdev:
>  	if (logdev && logdev != ddev) {
>  		xfs_blkdev_put(logdev);
> -		fs_put_dax(dax_logdev);
> +		fs_dax_release(dax_logdev, mp);
>  	}
>   out:
> -	fs_put_dax(dax_ddev);
> +	fs_dax_release(dax_ddev, mp);
>  	return error;
>  }
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index ae27a7efe7ab..92a1d3ee1615 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -52,12 +52,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
>  	return dax_get_by_host(host);
>  }
>  
> -static inline void fs_put_dax(struct dax_device *dax_dev)
> -{
> -	put_dax(dax_dev);
> -}
> -
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
> +void fs_dax_release(struct dax_device *dax_dev, void *owner);
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct block_device *bdev, struct writeback_control *wbc);
>  #else
> @@ -71,13 +67,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
>  	return NULL;
>  }
>  
> -static inline void fs_put_dax(struct dax_device *dax_dev)
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> +		void *owner)
>  {
> +	return NULL;
>  }
>  
> -static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> +static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
>  {
> -	return NULL;
>  }
>  
>  static inline int dax_writeback_mapping_range(struct address_space *mapping,
> @@ -91,6 +88,8 @@ int dax_read_lock(void);
>  void dax_read_unlock(int id);
>  struct dax_device *alloc_dax(void *private, const char *host,
>  		const struct dax_operations *ops);
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> +		const struct dax_operations *ops, struct dev_pagemap *pgmap);
>  bool dax_alive(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);
>  void *dax_get_private(struct dax_device *dax_dev);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 7b4899c06f49..02d6d042ee7f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -53,11 +53,19 @@ struct vmem_altmap {
>   * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>   * type. Any page of a process can be migrated to such memory. However no one
>   * should be allow to pin such memory so that it can always be evicted.
> + *
> + * MEMORY_DEVICE_FS_DAX:
> + * When MEMORY_DEVICE_HOST memory is represented by a device that can
> + * host a filesystem, for example /dev/pmem0, that filesystem can
> + * register for a callback when a page is idled. For the filesystem-dax
> + * case page idle callbacks are used to coordinate DMA vs
> + * hole-punch/truncate.
>   */
>  enum memory_type {
>  	MEMORY_DEVICE_HOST = 0,
>  	MEMORY_DEVICE_PRIVATE,
>  	MEMORY_DEVICE_PUBLIC,
> +	MEMORY_DEVICE_FS_DAX,
>  };
>  
>  /*
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops
  2018-03-29 15:40   ` Jan Kara
@ 2018-03-29 18:09     ` Christoph Hellwig
  2018-03-29 22:47       ` Dan Williams
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2018-03-29 18:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, linux-nvdimm, Theodore Ts'o, Andreas Dilger,
	linux-ext4, david, hch, linux-fsdevel, linux-xfs, linux-kernel

On Thu, Mar 29, 2018 at 05:40:35PM +0200, Jan Kara wrote:
> So ext4_direct_IO() for IS_DAX() files will just bail out. So could you
> just provide ext4_dax_direct_IO() which will bail out and use it here? With
> a similar comment as in xfs_vm_direct_IO() that open still needs this
> method set... Thanks!

In fact a common noop_direct_IO might make sense.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-29 16:02   ` Jan Kara
@ 2018-03-29 19:02     ` Dan Williams
  2018-03-29 23:02     ` Dan Williams
  1 sibling, 0 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-29 19:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Jeff Moyer, Matthew Wilcox, Ross Zwisler,
	Christoph Hellwig, david, linux-fsdevel, linux-xfs,
	Linux Kernel Mailing List

On Thu, Mar 29, 2018 at 9:02 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 21-03-18 15:57:48, Dan Williams wrote:
>> Catch cases where extent unmap operations encounter pages that are
>> pinned / busy. Typically this is pinned pages that are under active dma.
>> This warning is a canary for potential data corruption as truncated
>> blocks could be allocated to a new file while the device is still
>> performing i/o.
>>
>> Here is an example of a collision that this implementation catches:
>>
>>  WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
>>  [..]
>>  Call Trace:
>>   __dax_invalidate_mapping_entry+0x6c/0xf0
>>   dax_delete_mapping_entry+0xf/0x20
>>   truncate_exceptional_pvec_entries.part.12+0x1af/0x200
>>   truncate_inode_pages_range+0x268/0x970
>>   ? tlb_gather_mmu+0x10/0x20
>>   ? up_write+0x1c/0x40
>>   ? unmap_mapping_range+0x73/0x140
>>   xfs_free_file_space+0x1b6/0x5b0 [xfs]
>>   ? xfs_file_fallocate+0x7f/0x320 [xfs]
>>   ? down_write_nested+0x40/0x70
>>   ? xfs_ilock+0x21d/0x2f0 [xfs]
>>   xfs_file_fallocate+0x162/0x320 [xfs]
>>   ? rcu_read_lock_sched_held+0x3f/0x70
>>   ? rcu_sync_lockdep_assert+0x2a/0x50
>>   ? __sb_start_write+0xd0/0x1b0
>>   ? vfs_fallocate+0x20c/0x270
>>   vfs_fallocate+0x154/0x270
>>   SyS_fallocate+0x43/0x80
>>   entry_SYSCALL_64_fastpath+0x1f/0x96
>>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Two comments when looking at this now:
>
>> +#define for_each_entry_pfn(entry, pfn, end_pfn) \
>> +     for (pfn = dax_radix_pfn(entry), \
>> +                     end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
>> +                     pfn < end_pfn; \
>> +                     pfn++)
>
> Why don't you declare 'end_pfn' inside the for() block? That way you don't
> have to pass the variable as an argument to for_each_entry_pfn(). It's not
> like you need end_pfn anywhere in the loop body, you just use it to cache
> loop termination index.

Agreed, good catch.

>
>> @@ -547,6 +599,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>>
>>       spin_lock_irq(&mapping->tree_lock);
>>       new_entry = dax_radix_locked_entry(pfn, flags);
>> +     if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
>> +             dax_disassociate_entry(entry, mapping, false);
>> +             dax_associate_entry(new_entry, mapping);
>> +     }
>
> I find it quite tricky that in case we pass zero page / empty entry into
> dax_[dis]associate_entry(), it will not do anything because
> dax_entry_size() will return 0. Can we add an explicit check into
> dax_[dis]associate_entry() or at least a comment there?

Ok, will do.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops
  2018-03-29 18:09     ` Christoph Hellwig
@ 2018-03-29 22:47       ` Dan Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2018-03-29 22:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Theodore Ts'o, Andreas Dilger,
	linux-ext4, david, linux-fsdevel, linux-xfs,
	Linux Kernel Mailing List

On Thu, Mar 29, 2018 at 11:09 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 29, 2018 at 05:40:35PM +0200, Jan Kara wrote:
>> So ext4_direct_IO() for IS_DAX() files will just bail out. So could you
>> just provide ext4_dax_direct_IO() which will bail out and use it here? With
>> a similar comment as in xfs_vm_direct_IO() that open still needs this
>> method set... Thanks!
>
> In fact a common noop_direct_IO might make sense.

Ok, I introduced noop_direct_IO() in "fs, dax: prepare for
dax-specific address_space_operations", and cleaned up xfs, ext4, and
ext2 accordingly. Let me know if you want to see a resend of the
series with those changes. Otherwise this will appear in -next
shortly.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-29 16:02   ` Jan Kara
  2018-03-29 19:02     ` Dan Williams
@ 2018-03-29 23:02     ` Dan Williams
  2018-03-30  8:22       ` Jan Kara
  1 sibling, 1 reply; 37+ messages in thread
From: Dan Williams @ 2018-03-29 23:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Jeff Moyer, Matthew Wilcox, Ross Zwisler,
	Christoph Hellwig, david, linux-fsdevel, linux-xfs,
	Linux Kernel Mailing List

On Thu, Mar 29, 2018 at 9:02 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 21-03-18 15:57:48, Dan Williams wrote:
[..]
> I find it quite tricky that in case we pass zero page / empty entry into
> dax_[dis]associate_entry(), it will not do anything because
> dax_entry_size() will return 0. Can we add an explicit check into
> dax_[dis]associate_entry() or at least a comment there?

How about the following, i.e. rename the loop helper to
for_each_dax_pfn() to make it clearer that we're only operating on
mapped pfns, and also add a comment to indicate the same:

/*
 * Iterate through all mapped pfns represented by an entry, i.e. skip
 * 'empty' and 'zero' entries.
 */
#define for_each_dax_pfn(entry, pfn) \
        for (pfn = dax_radix_pfn(entry); \
                        pfn < dax_radix_end_pfn(entry); pfn++)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-29 23:02     ` Dan Williams
@ 2018-03-30  8:22       ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-30  8:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Jeff Moyer, Matthew Wilcox, Ross Zwisler,
	Christoph Hellwig, david, linux-fsdevel, linux-xfs,
	Linux Kernel Mailing List

On Thu 29-03-18 16:02:45, Dan Williams wrote:
> On Thu, Mar 29, 2018 at 9:02 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 21-03-18 15:57:48, Dan Williams wrote:
> [..]
> > I find it quite tricky that in case we pass zero page / empty entry into
> > dax_[dis]associate_entry(), it will not do anything because
> > dax_entry_size() will return 0. Can we add an explicit check into
> > dax_[dis]associate_entry() or at least a comment there?
> 
> How about the following, i.e. rename the loop helper to
> for_each_dax_pfn() to make it clearer that we're only operating on
> mapped pfns, and also add a comment to indicate the same:
> 
> /*
>  * Iterate through all mapped pfns represented by an entry, i.e. skip
>  * 'empty' and 'zero' entries.
>  */
> #define for_each_dax_pfn(entry, pfn) \
>         for (pfn = dax_radix_pfn(entry); \
>                         pfn < dax_radix_end_pfn(entry); pfn++)

Maybe call it for_each_mapped_pfn()? Other than that it looks fine to me.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-21 22:57 ` [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
  2018-03-22  7:49   ` Christoph Hellwig
@ 2018-03-30 10:37   ` Jan Kara
  1 sibling, 0 replies; 37+ messages in thread
From: Jan Kara @ 2018-03-30 10:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, Jérôme Glisse, david, hch,
	linux-fsdevel, linux-xfs, linux-kernel, jack

On Wed 21-03-18 15:57:59, Dan Williams wrote:
> The HMM sub-system extended dev_pagemap to arrange a callback when a
> dev_pagemap managed page is freed. Since a dev_pagemap page is free /
> idle when its reference count is 1 it requires an additional branch to
> check the page-type at put_page() time. Given put_page() is a hot-path
> we do not want to incur that check if HMM is not in use, so a static
> branch is used to avoid that overhead when not necessary.
> 
> Now, the FS_DAX implementation wants to reuse this mechanism for
> receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
> static-key into a generic mechanism that either HMM or FS_DAX code paths
> can enable.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Reviewed-by: "J�r�me Glisse" <jglisse@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/super.c      |    2 ++
>  fs/Kconfig               |    1 +
>  include/linux/memremap.h |   20 ++-------------
>  include/linux/mm.h       |   61 ++++++++++++++++++++++++++++++++--------------
>  kernel/memremap.c        |   30 ++++++++++++++++++++---
>  mm/Kconfig               |    5 ++++
>  mm/hmm.c                 |   13 ++--------
>  mm/swap.c                |    3 ++
>  8 files changed, 84 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index ecefe9f7eb60..619b1ed6434c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
>  		return NULL;
>  	}
>  
> +	dev_pagemap_get_ops();
>  	pgmap->type = MEMORY_DEVICE_FS_DAX;
>  	pgmap->page_free = generic_dax_pagefree;
>  	pgmap->data = owner;
> @@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
>  	pgmap->type = MEMORY_DEVICE_HOST;
>  	pgmap->page_free = NULL;
>  	pgmap->data = NULL;
> +	dev_pagemap_put_ops();
>  	mutex_unlock(&devmap_lock);
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_release);
> diff --git a/fs/Kconfig b/fs/Kconfig
> index bc821a86d965..1f0832bbc32f 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -38,6 +38,7 @@ config FS_DAX
>  	bool "Direct Access (DAX) support"
>  	depends on MMU
>  	depends on !(ARM || MIPS || SPARC)
> +	select DEV_PAGEMAP_OPS
>  	select FS_IOMAP
>  	select DAX
>  	help
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 02d6d042ee7f..9faf25d6abef 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -1,7 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _LINUX_MEMREMAP_H_
>  #define _LINUX_MEMREMAP_H_
> -#include <linux/mm.h>
>  #include <linux/ioport.h>
>  #include <linux/percpu-refcount.h>
>  
> @@ -130,6 +129,9 @@ struct dev_pagemap {
>  	enum memory_type type;
>  };
>  
> +void dev_pagemap_get_ops(void);
> +void dev_pagemap_put_ops(void);
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> @@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> -
> -static inline bool is_zone_device_page(const struct page *page);
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>  		struct dev_pagemap *pgmap)
> @@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
>  }
>  #endif /* CONFIG_ZONE_DEVICE */
>  
> -#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
> -static inline bool is_device_private_page(const struct page *page)
> -{
> -	return is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
> -}
> -
> -static inline bool is_device_public_page(const struct page *page)
> -{
> -	return is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> -}
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -
>  static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
>  {
>  	if (pgmap)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad06d42adb1a..088c76bce360 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page *page)
>  }
>  #endif
>  
> -#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
> -void put_zone_device_private_or_public_page(struct page *page);
> -DECLARE_STATIC_KEY_FALSE(device_private_key);
> -#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
> -static inline bool is_device_private_page(const struct page *page);
> -static inline bool is_device_public_page(const struct page *page);
> -#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -static inline void put_zone_device_private_or_public_page(struct page *page)
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +void __put_devmap_managed_page(struct page *page);
> +DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> +static inline bool put_devmap_managed_page(struct page *page)
>  {
> +	if (!static_branch_unlikely(&devmap_managed_key))
> +		return false;
> +	if (!is_zone_device_page(page))
> +		return false;
> +	switch (page->pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_PUBLIC:
> +	case MEMORY_DEVICE_FS_DAX:
> +		__put_devmap_managed_page(page);
> +		return true;
> +	default:
> +		break;
> +	}
> +	return false;
>  }
> -#define IS_HMM_ENABLED 0
> +
>  static inline bool is_device_private_page(const struct page *page)
>  {
> -	return false;
> +	return is_zone_device_page(page) &&
> +		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
>  }
> +
>  static inline bool is_device_public_page(const struct page *page)
>  {
> +	return is_zone_device_page(page) &&
> +		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> +}
> +
> +#else /* CONFIG_DEV_PAGEMAP_OPS */
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
>  	return false;
>  }
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
>  
> +static inline bool is_device_private_page(const struct page *page)
> +{
> +	return false;
> +}
> +
> +static inline bool is_device_public_page(const struct page *page)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
>  
>  static inline void get_page(struct page *page)
>  {
> @@ -850,16 +878,13 @@ static inline void put_page(struct page *page)
>  	page = compound_head(page);
>  
>  	/*
> -	 * For private device pages we need to catch refcount transition from
> -	 * 2 to 1, when refcount reach one it means the private device page is
> -	 * free and we need to inform the device driver through callback. See
> +	 * For devmap managed pages we need to catch refcount transition from
> +	 * 2 to 1, when refcount reach one it means the page is free and we
> +	 * need to inform the device driver through callback. See
>  	 * include/linux/memremap.h and HMM for details.
>  	 */
> -	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
> -	    unlikely(is_device_public_page(page)))) {
> -		put_zone_device_private_or_public_page(page);
> +	if (put_devmap_managed_page(page))
>  		return;
> -	}
>  
>  	if (put_page_testzero(page))
>  		__put_page(page);
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 4dd4274cabe2..72d0bb6fc47d 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -476,8 +476,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  }
>  #endif /* CONFIG_ZONE_DEVICE */
>  
> -#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
> -void put_zone_device_private_or_public_page(struct page *page)
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
> +EXPORT_SYMBOL_GPL(devmap_managed_key);
> +static atomic_t devmap_enable;
> +
> +/*
> + * Toggle the static key for ->page_free() callbacks when dev_pagemap
> + * pages go idle.
> + */
> +void dev_pagemap_get_ops(void)
> +{
> +	if (atomic_inc_return(&devmap_enable) == 1)
> +		static_branch_enable(&devmap_managed_key);
> +}
> +EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
> +
> +void dev_pagemap_put_ops(void)
> +{
> +	if (atomic_dec_and_test(&devmap_enable))
> +		static_branch_disable(&devmap_managed_key);
> +}
> +EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
> +
> +void __put_devmap_managed_page(struct page *page)
>  {
>  	int count = page_ref_dec_return(page);
>  
> @@ -497,5 +519,5 @@ void put_zone_device_private_or_public_page(struct page *page)
>  	} else if (!count)
>  		__put_page(page);
>  }
> -EXPORT_SYMBOL(put_zone_device_private_or_public_page);
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> +EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c782e8fb7235..dc32828984a3 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -700,6 +700,9 @@ config ARCH_HAS_HMM
>  config MIGRATE_VMA_HELPER
>  	bool
>  
> +config DEV_PAGEMAP_OPS
> +	bool
> +
>  config HMM
>  	bool
>  	select MIGRATE_VMA_HELPER
> @@ -720,6 +723,7 @@ config DEVICE_PRIVATE
>  	bool "Unaddressable device memory (GPU memory, ...)"
>  	depends on ARCH_HAS_HMM
>  	select HMM
> +	select DEV_PAGEMAP_OPS
>  
>  	help
>  	  Allows creation of struct pages to represent unaddressable device
> @@ -730,6 +734,7 @@ config DEVICE_PUBLIC
>  	bool "Addressable device memory (like GPU memory)"
>  	depends on ARCH_HAS_HMM
>  	select HMM
> +	select DEV_PAGEMAP_OPS
>  
>  	help
>  	  Allows creation of struct pages to represent addressable device
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..4aa554e76d06 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -35,15 +35,6 @@
>  
>  #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
>  
> -#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
> -/*
> - * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
> - */
> -DEFINE_STATIC_KEY_FALSE(device_private_key);
> -EXPORT_SYMBOL(device_private_key);
> -#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -
> -
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> @@ -996,7 +987,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
>  	resource_size_t addr;
>  	int ret;
>  
> -	static_branch_enable(&device_private_key);
> +	dev_pagemap_get_ops();
>  
>  	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
>  				   GFP_KERNEL, dev_to_node(device));
> @@ -1090,7 +1081,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
>  	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
>  		return ERR_PTR(-EINVAL);
>  
> -	static_branch_enable(&device_private_key);
> +	dev_pagemap_get_ops();
>  
>  	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
>  				   GFP_KERNEL, dev_to_node(device));
> diff --git a/mm/swap.c b/mm/swap.c
> index 0f17330dd0e5..eed846cfc8b8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -29,6 +29,7 @@
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
>  #include <linux/backing-dev.h>
> +#include <linux/memremap.h>
>  #include <linux/memcontrol.h>
>  #include <linux/gfp.h>
>  #include <linux/uio.h>
> @@ -744,7 +745,7 @@ void release_pages(struct page **pages, int nr)
>  						       flags);
>  				locked_pgdat = NULL;
>  			}
> -			put_zone_device_private_or_public_page(page);
> +			put_devmap_managed_page(page);
>  			continue;
>  		}
>  
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  2018-03-21 22:58 ` [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
  2018-03-22  7:25   ` Christoph Hellwig
@ 2018-03-30 15:16   ` Darrick J. Wong
  1 sibling, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2018-03-30 15:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Ross Zwisler, Dave Chinner, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-kernel, jack

On Wed, Mar 21, 2018 at 03:58:20PM -0700, Dan Williams wrote:
> In preparation for adding coordination between extent unmap operations
> and busy dax-pages, update xfs_break_layouts() to permit it to be called
> with the mmap lock held. This lock scheme will be required for
> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
> pages mapped into the file's address space). Breaking dax layouts will
> be added to xfs_break_layouts() in a future patch, for now this preps
> the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
> xfs_break_layouts().
> 
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_file.c  |    5 +----
>  fs/xfs/xfs_ioctl.c |    5 +----
>  fs/xfs/xfs_iops.c  |   10 +++++++---
>  fs/xfs/xfs_pnfs.c  |    3 ++-
>  4 files changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 9ea08326f876..18edf04811d0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -768,7 +768,7 @@ xfs_file_fallocate(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	long			error;
>  	enum xfs_prealloc_flags	flags = 0;
> -	uint			iolock = XFS_IOLOCK_EXCL;
> +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	loff_t			new_size = 0;
>  	bool			do_file_insert = false;
>  
> @@ -782,9 +782,6 @@ xfs_file_fallocate(
>  	if (error)
>  		goto out_unlock;
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> -	iolock |= XFS_MMAPLOCK_EXCL;
> -
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 89fb1eb80aae..4151fade4bb1 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -614,7 +614,7 @@ xfs_ioc_space(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct iattr		iattr;
>  	enum xfs_prealloc_flags	flags = 0;
> -	uint			iolock = XFS_IOLOCK_EXCL;
> +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	int			error;
>  
>  	/*
> @@ -648,9 +648,6 @@ xfs_ioc_space(
>  	if (error)
>  		goto out_unlock;
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> -	iolock |= XFS_MMAPLOCK_EXCL;
> -
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
>  		break;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 951e84df5576..d23aa08426f9 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1028,13 +1028,17 @@ xfs_vn_setattr(
>  
>  	if (iattr->ia_valid & ATTR_SIZE) {
>  		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
> -		uint			iolock = XFS_IOLOCK_EXCL;
> +		uint			iolock;
> +
> +		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  
>  		error = xfs_break_layouts(d_inode(dentry), &iolock);
> -		if (error)
> +		if (error) {
> +			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  			return error;
> +		}
>  
> -		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  		error = xfs_vn_setattr_size(dentry, iattr);
>  		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  	} else {
> diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
> index aa6c5c193f45..6ea7b0b55d02 100644
> --- a/fs/xfs/xfs_pnfs.c
> +++ b/fs/xfs/xfs_pnfs.c
> @@ -43,7 +43,8 @@ xfs_break_layouts(
>  	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>  		xfs_iunlock(ip, *iolock);
>  		error = break_layout(inode, true);
> -		*iolock = XFS_IOLOCK_EXCL;
> +		*iolock &= ~XFS_IOLOCK_SHARED;
> +		*iolock |= XFS_IOLOCK_EXCL;
>  		xfs_ilock(ip, *iolock);
>  	}
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2018-03-30 15:17 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 22:57 [PATCH v7 00/14] dax: fix dma vs truncate/hole-punch Dan Williams
2018-03-21 22:57 ` [PATCH v7 01/14] dax: store pfns in the radix Dan Williams
2018-03-21 22:57 ` [PATCH v7 02/14] fs, dax: prepare for dax-specific address_space_operations Dan Williams
2018-03-29 15:28   ` Jan Kara
2018-03-21 22:57 ` [PATCH v7 03/14] block, dax: remove dead code in blkdev_writepages() Dan Williams
2018-03-29 15:29   ` Jan Kara
2018-03-21 22:57 ` [PATCH v7 04/14] xfs, dax: introduce xfs_dax_aops Dan Williams
2018-03-29 15:30   ` Jan Kara
2018-03-21 22:57 ` [PATCH v7 05/14] ext4, dax: introduce ext4_dax_aops Dan Williams
2018-03-29 15:40   ` Jan Kara
2018-03-29 18:09     ` Christoph Hellwig
2018-03-29 22:47       ` Dan Williams
2018-03-21 22:57 ` [PATCH v7 06/14] ext2, dax: introduce ext2_dax_aops Dan Williams
2018-03-29 15:44   ` Jan Kara
2018-03-21 22:57 ` [PATCH v7 07/14] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
2018-03-29 16:02   ` Jan Kara
2018-03-29 19:02     ` Dan Williams
2018-03-29 23:02     ` Dan Williams
2018-03-30  8:22       ` Jan Kara
2018-03-21 22:57 ` [PATCH v7 08/14] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
2018-03-29 16:36   ` Jan Kara
2018-03-21 22:57 ` [PATCH v7 09/14] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
2018-03-22  7:49   ` Christoph Hellwig
2018-03-30 10:37   ` Jan Kara
2018-03-21 22:58 ` [PATCH v7 10/14] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
2018-03-22  7:51   ` Christoph Hellwig
2018-03-21 22:58 ` [PATCH v7 11/14] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
2018-03-21 22:58 ` [PATCH v7 12/14] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
2018-03-22  7:25   ` Christoph Hellwig
2018-03-30 15:16   ` Darrick J. Wong
2018-03-21 22:58 ` [PATCH v7 13/14] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
2018-03-22  7:27   ` Christoph Hellwig
2018-03-22 15:50     ` Dan Williams
2018-03-21 22:58 ` [PATCH v7 14/14] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
2018-03-22  7:43   ` Christoph Hellwig
2018-03-22 16:28     ` Dan Williams
2018-03-22 17:02       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).