nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch
@ 2018-03-15 15:51 Dan Williams
  2018-03-15 15:51 ` [PATCH v6 01/15] dax: store pfns in the radix Dan Williams
                   ` (14 more replies)
  0 siblings, 15 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, jack, Matthew Wilcox, Darrick J. Wong, Dave Hansen,
	Dave Chinner, Jérôme Glisse, Andrew Morton,
	linux-kernel, linux-xfs, Andreas Dilger, Alexander Viro,
	Jan Kara, linux-fsdevel, Theodore Ts'o, linux-ext4, hch

Changes since v5 [1]:
* Split the introduction of dax-specific address_space_operations into
  its own patch, and place them in fs/libfs.c (Christoph)

* Kill some more straggling dead code implementing dax support for block
  devices.

* Mark devm_memremap_pages EXPORT_SYMBOL_GPL (Christoph)

* Introduce {xfs,ext4,ext2}_dax_writepages() and kill the dynamic check
  for IS_DAX() in the typical _writepages() implementations in these
  filesystems. (Christoph)

* Rework xfs_break_layouts() to assume the XFS_MMAPLOCK_EXCL is held at
  entry. (Christoph)

* Replace the XFS_BREAK_WRITE and XFS_BREAK_MAPS flags with the
  BREAK_WRITE and BREAK_TRUNCATE enum values since BREAK_WRITE is a
  subset of BREAK_TRUNCATE. (Christoph)

* Replace wait_for_atomic_one() with Peter's new wait_var_event()
  facility [2]. (Peter)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014585.html
[2]: https://patchwork.kernel.org/patch/10284383/

---

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

---

Dan Williams (15):
      dax: store pfns in the radix
      fs, dax: prepare for dax-specific address_space_operations
      block, dax: remove dead code in blkdev_writepages()
      xfs, dax: introduce xfs_dax_aops
      ext4, dax: introduce ext4_dax_aops
      ext2, dax: introduce ext2_dax_aops
      fs, dax: use page->mapping to warn if truncate collides with a busy page
      mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
      mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
      memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
      mm, fs, dax: handle layout changes to pinned dax mappings
      xfs: require mmap lock for xfs_break_layouts()
      xfs: communicate lock drop events from xfs_break_layouts()
      xfs: prepare xfs_break_layouts() for another layout type
      xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/super.c      |   96 ++++++++++++++++---
 drivers/nvdimm/pmem.c    |    3 -
 fs/Kconfig               |    1 
 fs/block_dev.c           |    5 -
 fs/dax.c                 |  232 ++++++++++++++++++++++++++++++++++++----------
 fs/ext2/ext2.h           |    1 
 fs/ext2/inode.c          |   43 +++++----
 fs/ext2/namei.c          |   18 ----
 fs/ext2/super.c          |    6 +
 fs/ext4/inode.c          |   38 ++++++--
 fs/ext4/super.c          |    6 +
 fs/libfs.c               |   27 +++++
 fs/xfs/xfs_aops.c        |   21 +++-
 fs/xfs/xfs_aops.h        |    1 
 fs/xfs/xfs_file.c        |   87 ++++++++++++++++-
 fs/xfs/xfs_inode.h       |   16 +++
 fs/xfs/xfs_ioctl.c       |    8 --
 fs/xfs/xfs_iops.c        |   21 +++-
 fs/xfs/xfs_pnfs.c        |   17 ++-
 fs/xfs/xfs_pnfs.h        |    4 -
 fs/xfs/xfs_super.c       |   20 ++--
 include/linux/dax.h      |   51 +++++++++-
 include/linux/fs.h       |    3 +
 include/linux/memremap.h |   28 ++----
 include/linux/mm.h       |   61 +++++++++---
 kernel/memremap.c        |   32 +++++-
 mm/Kconfig               |    5 +
 mm/gup.c                 |    5 +
 mm/hmm.c                 |   13 ---
 mm/swap.c                |    3 -
 30 files changed, 653 insertions(+), 219 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v6 01/15] dax: store pfns in the radix
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
@ 2018-03-15 15:51 ` Dan Williams
  2018-03-15 15:51 ` [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations Dan Williams
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, david, linux-kernel, linux-xfs, linux-fsdevel, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 +++++++--
 fs/dax.c            |   83 +++++++++++++++++++--------------------------------
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
 		goto put_unlocked;
 	/*
 	 * Entry got reallocated elsewhere? No need to writeback. We have to
-	 * compare sectors as we must not bail out due to difference in lockbit
+	 * compare pfns as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,33 +709,15 @@ static int dax_writeback_one(struct block_device *bdev,
 	/*
 	 * Even if dax_writeback_mapping_range() was given a wbc->range_start
 	 * in the middle of a PMD, the 'index' we are given will be aligned to
-	 * the start index of the PMD, as will the sector we pull from
-	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
-	 * worry about partial PMD writebacks.
+	 * the start index of the PMD, as will the pfn we pull from 'entry'.
+	 * This allows us to flush for PMD_SIZE and not have to worry about
+	 * partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -749,8 +728,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -808,8 +785,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -877,6 +854,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -884,14 +862,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1200,8 +1179,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto error_finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						 dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						 0, write && !sync);
 		if (IS_ERR(entry)) {
 			error = PTR_ERR(entry);
@@ -1280,13 +1258,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(ret))
 		goto fallback;
@@ -1409,8 +1389,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						RADIX_DAX_PMD, write && !sync);
 		if (IS_ERR(entry))
 			goto finish_iomap;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-03-15 15:51 ` [PATCH v6 01/15] dax: store pfns in the radix Dan Williams
@ 2018-03-15 15:51 ` Dan Williams
  2018-03-16 18:59   ` Christoph Hellwig
  2018-03-15 15:51 ` [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages() Dan Williams
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/libfs.c          |   27 +++++++++++++++++++++++++++
 include/linux/dax.h |   12 +++++++++---
 include/linux/fs.h  |    3 +++
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 7ff3cb904acd..e49d0ac6f800 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1060,6 +1060,33 @@ int noop_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 }
 EXPORT_SYMBOL(noop_fsync);
 
+int noop_set_page_dirty(struct page *page)
+{
+	/*
+	 * Unlike __set_page_dirty_no_writeback that handles dirty page
+	 * tracking in the page object, dax does all dirty tracking in
+	 * the inode address_space in response to mkwrite faults. In the
+	 * dax case we only need to worry about potentially dirty CPU
+	 * caches, not dirty page cache pages to write back.
+	 *
+	 * This callback is defined to prevent fallback to
+	 * __set_page_dirty_buffers() in set_page_dirty().
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(noop_set_page_dirty);
+
+void noop_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length)
+{
+	/*
+	 * There is no page cache to invalidate in the dax case, however
+	 * we need this callback defined to prevent falling back to
+	 * block_invalidatepage() in do_invalidatepage().
+	 */
+}
+EXPORT_SYMBOL_GPL(noop_invalidatepage);
+
 /* Because kfree isn't assignment-compatible with void(void*) ;-/ */
 void kfree_link(void *p)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..ae27a7efe7ab 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,7 @@ static inline void put_dax(struct dax_device *dax_dev)
 }
 #endif
 
+struct writeback_control;
 int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
 #if IS_ENABLED(CONFIG_FS_DAX)
 int __bdev_dax_supported(struct super_block *sb, int blocksize);
@@ -57,6 +58,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -76,6 +79,12 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
 {
 	return NULL;
 }
+
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 int dax_read_lock(void);
@@ -121,7 +130,4 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
-struct writeback_control;
-int dax_writeback_mapping_range(struct address_space *mapping,
-		struct block_device *bdev, struct writeback_control *wbc);
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..b57db31d294d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3129,6 +3129,9 @@ extern int simple_rmdir(struct inode *, struct dentry *);
 extern int simple_rename(struct inode *, struct dentry *,
 			 struct inode *, struct dentry *, unsigned int);
 extern int noop_fsync(struct file *, loff_t, loff_t, int);
+extern int noop_set_page_dirty(struct page *page);
+extern void noop_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length);
 extern int simple_empty(struct dentry *);
 extern int simple_readpage(struct file *file, struct page *page);
 extern int simple_write_begin(struct file *file, struct address_space *mapping,

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages()
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-03-15 15:51 ` [PATCH v6 01/15] dax: store pfns in the radix Dan Williams
  2018-03-15 15:51 ` [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations Dan Williams
@ 2018-03-15 15:51 ` Dan Williams
  2018-03-16 18:59   ` Christoph Hellwig
  2018-03-15 15:51 ` [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops Dan Williams
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe09ef9c21f3..846ee2d31781 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1946,11 +1946,6 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 static int blkdev_writepages(struct address_space *mapping,
 			     struct writeback_control *wbc)
 {
-	if (dax_mapping(mapping)) {
-		struct block_device *bdev = I_BDEV(mapping->host);
-
-		return dax_writeback_mapping_range(mapping, bdev, wbc);
-	}
 	return generic_writepages(mapping, wbc);
 }
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (2 preceding siblings ...)
  2018-03-15 15:51 ` [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages() Dan Williams
@ 2018-03-15 15:51 ` Dan Williams
  2018-03-16 19:00   ` Christoph Hellwig
  2018-03-15 15:51 ` [PATCH v6 05/15] ext4, dax: introduce ext4_dax_aops Dan Williams
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_aops.c |   21 +++++++++++++++++----
 fs/xfs/xfs_aops.h |    1 +
 fs/xfs/xfs_iops.c |    5 ++++-
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..5f1f5948ecc2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1194,16 +1194,22 @@ xfs_vm_writepages(
 	int			ret;
 
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
-	if (dax_mapping(mapping))
-		return dax_writeback_mapping_range(mapping,
-				xfs_find_bdev_for_inode(mapping->host), wbc);
-
 	ret = write_cache_pages(mapping, wbc, xfs_do_writepage, &wpc);
 	if (wpc.ioend)
 		ret = xfs_submit_ioend(wbc, wpc.ioend, ret);
 	return ret;
 }
 
+STATIC int
+xfs_dax_writepages(
+	struct address_space	*mapping,
+	struct writeback_control *wbc)
+{
+	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+	return dax_writeback_mapping_range(mapping,
+			xfs_find_bdev_for_inode(mapping->host), wbc);
+}
+
 /*
  * Called to move a page into cleanable state - and from there
  * to be released. The page should already be clean. We always
@@ -1505,3 +1511,10 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+	.direct_IO		= xfs_vm_direct_IO,
+	.writepages		= xfs_dax_writepages,
+	.set_page_dirty		= noop_set_page_dirty,
+	.invalidatepage		= noop_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		inode->i_mapping->a_ops = &xfs_address_space_operations;
+		if (IS_DAX(inode))
+			inode->i_mapping->a_ops = &xfs_dax_aops;
+		else
+			inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 05/15] ext4, dax: introduce ext4_dax_aops
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (3 preceding siblings ...)
  2018-03-15 15:51 ` [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops Dan Williams
@ 2018-03-15 15:51 ` Dan Williams
  2018-03-15 15:51 ` [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops Dan Williams
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, david, linux-kernel, linux-xfs, Andreas Dilger,
	linux-fsdevel, Theodore Ts'o, linux-ext4, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/inode.c |   38 +++++++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..f9884e41cb39 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2725,12 +2725,6 @@ static int ext4_writepages(struct address_space *mapping,
 	percpu_down_read(&sbi->s_journal_flag_rwsem);
 	trace_ext4_writepages(inode, wbc);
 
-	if (dax_mapping(mapping)) {
-		ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
-						  wbc);
-		goto out_writepages;
-	}
-
 	/*
 	 * No pages to write? This is mainly a kludge to avoid starting
 	 * a transaction for special inodes like journal inode on last iput()
@@ -2955,6 +2949,27 @@ static int ext4_writepages(struct address_space *mapping,
 	return ret;
 }
 
+static int ext4_dax_writepages(struct address_space *mapping,
+			       struct writeback_control *wbc)
+{
+	int ret;
+	long nr_to_write = wbc->nr_to_write;
+	struct inode *inode = mapping->host;
+	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+		return -EIO;
+
+	percpu_down_read(&sbi->s_journal_flag_rwsem);
+	trace_ext4_writepages(inode, wbc);
+
+	ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+	trace_ext4_writepages_result(inode, wbc, ret,
+				     nr_to_write - wbc->nr_to_write);
+	percpu_up_read(&sbi->s_journal_flag_rwsem);
+	return ret;
+}
+
 static int ext4_nonda_switch(struct super_block *sb)
 {
 	s64 free_clusters, dirty_clusters;
@@ -3946,6 +3961,13 @@ static const struct address_space_operations ext4_da_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+	.direct_IO		= ext4_direct_IO,
+	.writepages		= ext4_dax_writepages,
+	.set_page_dirty		= noop_set_page_dirty,
+	.invalidatepage		= noop_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3980,9 @@ void ext4_set_aops(struct inode *inode)
 	default:
 		BUG();
 	}
-	if (test_opt(inode->i_sb, DELALLOC))
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext4_dax_aops;
+	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
 		inode->i_mapping->a_ops = &ext4_aops;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (4 preceding siblings ...)
  2018-03-15 15:51 ` [PATCH v6 05/15] ext4, dax: introduce ext4_dax_aops Dan Williams
@ 2018-03-15 15:51 ` Dan Williams
  2018-03-18  4:02   ` kbuild test robot
  2018-03-18  4:02   ` [RFC PATCH] ext2, dax: ext2_dax_aops can be static kbuild test robot
  2018-03-15 15:52 ` [PATCH v6 07/15] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:51 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, david, hch, linux-xfs, linux-fsdevel, jack, ross.zwisler

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext2/ext2.h  |    1 +
 fs/ext2/inode.c |   43 +++++++++++++++++++++++++++----------------
 fs/ext2/namei.c |   18 ++----------------
 3 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..9590712f2f88 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -952,17 +952,16 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-#ifdef CONFIG_FS_DAX
-	if (dax_mapping(mapping)) {
-		return dax_writeback_mapping_range(mapping,
-						   mapping->host->i_sb->s_bdev,
-						   wbc);
-	}
-#endif
-
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static int
+ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
+{
+	return dax_writeback_mapping_range(mapping,
+			mapping->host->i_sb->s_bdev, wbc);
+}
+
 const struct address_space_operations ext2_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
@@ -990,6 +989,13 @@ const struct address_space_operations ext2_nobh_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+const struct address_space_operations ext2_dax_aops = {
+	.direct_IO		= ext2_direct_IO,
+	.writepages		= ext2_dax_writepages,
+	.set_page_dirty		= noop_set_page_dirty,
+	.invalidatepage		= noop_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1394,18 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+	inode->i_op = &ext2_file_inode_operations;
+	inode->i_fop = &ext2_file_operations;
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext2_dax_aops;
+	else if (test_opt(inode->i_sb, NOBH))
+		inode->i_mapping->a_ops = &ext2_nobh_aops;
+	else
+		inode->i_mapping->a_ops = &ext2_aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
 	struct ext2_inode_info *ei;
@@ -1480,14 +1498,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 		ei->i_data[n] = raw_inode->i_block[n];
 
 	if (S_ISREG(inode->i_mode)) {
-		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, NOBH)) {
-			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
-		} else {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
-		}
+		ext2_set_file_ops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
 		inode->i_fop = &ext2_dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	return ext2_add_nondir(dentry, inode);
 }
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	d_tmpfile(dentry, inode);
 	unlock_new_inode(inode);

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 07/15] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (5 preceding siblings ...)
  2018-03-15 15:51 ` [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-18  6:26   ` kbuild test robot
  2018-03-15 15:52 ` [PATCH v6 08/15] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Matthew Wilcox, david, linux-kernel, linux-xfs, linux-fsdevel, hch

Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..9ba043eb6294 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -298,6 +298,56 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->mapping);
+		page->mapping = mapping;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+		bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		page->mapping = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -404,6 +454,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -453,6 +504,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -547,6 +599,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, mapping, false);
+		dax_associate_entry(new_entry, mapping);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 08/15] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (6 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 07/15] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-15 15:52 ` [PATCH v6 09/15] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, jack, david, linux-kernel, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   79 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |    3 +-
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    8 +++++
 7 files changed, 103 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..ecefe9f7eb60 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,66 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+	/* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f8dcc52ca6..e6d7351f3379 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
+			&pmem->pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7666c065b96f..6ae20e319bc4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 39bf464c35f1..315a323729e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -952,7 +952,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3398,7 +3398,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4408,7 +4408,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 93588ea3d3d2..ef7dd7148c0b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -724,7 +724,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -732,10 +732,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -753,9 +753,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -765,7 +765,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -779,7 +779,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -813,14 +813,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ae27a7efe7ab..92a1d3ee1615 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -52,12 +52,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 #else
@@ -71,13 +67,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
@@ -91,6 +88,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..02d6d042ee7f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,19 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * When MEMORY_DEVICE_HOST memory is represented by a device that can
+ * host a filesystem, for example /dev/pmem0, that filesystem can
+ * register for a callback when a page is idled. For the filesystem-dax
+ * case page idle callbacks are used to coordinate DMA vs
+ * hole-punch/truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 09/15] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (7 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 08/15] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-15 15:52 ` [PATCH v6 10/15] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, jack, david, linux-kernel, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   20 ++-------------
 include/linux/mm.h       |   61 ++++++++++++++++++++++++++++++++--------------
 kernel/memremap.c        |   30 ++++++++++++++++++++---
 mm/Kconfig               |    5 ++++
 mm/hmm.c                 |   13 ++--------
 mm/swap.c                |    3 ++
 8 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecefe9f7eb60..619b1ed6434c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	dev_pagemap_get_ops();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	dev_pagemap_put_ops();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1f0832bbc32f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEV_PAGEMAP_OPS
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..9faf25d6abef 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -130,6 +129,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
 	if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..088c76bce360 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
+	return false;
 }
-#define IS_HMM_ENABLED 0
+
 static inline bool is_device_private_page(const struct page *page)
 {
-	return false;
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
 static inline bool is_device_public_page(const struct page *page)
 {
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool put_devmap_managed_page(struct page *page)
+{
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
@@ -850,16 +878,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4dd4274cabe2..72d0bb6fc47d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -476,8 +476,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL_GPL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -497,5 +519,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8fb7235..dc32828984a3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEV_PAGEMAP_OPS
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -720,6 +723,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -730,6 +734,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..4aa554e76d06 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -996,7 +987,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1090,7 +1081,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 0f17330dd0e5..eed846cfc8b8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -744,7 +745,7 @@ void release_pages(struct page **pages, int nr)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 10/15] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (8 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 09/15] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-15 15:52 ` [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, jack, david, linux-kernel, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

The devm_memremap_pages() facility is tightly integrated with the
kernel's memory hotplug functionality. It injects an altmap argument
deep into the architecture specific vmemmap implementation to allow
allocating from specific reserved pages, and it has Linux specific
assumptions about page structure reference counting relative to
get_user_pages() and get_user_pages_fast(). It was an oversight that
this was not marked EXPORT_SYMBOL_GPL from the outset.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 72d0bb6fc47d..766cb9a487ae 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -430,7 +430,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	devres_free(pgmap);
 	return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (9 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 10/15] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-16 19:01   ` Christoph Hellwig
  2018-03-17 22:14   ` kbuild test robot
  2018-03-15 15:52 ` [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts() Dan Williams
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Jeff Moyer, Dave Chinner, Matthew Wilcox,
	Alexander Viro, Darrick J. Wong, Ross Zwisler, Dave Hansen,
	Andrew Morton, Christoph Hellwig, linux-xfs, linux-fsdevel

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    2 +
 fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   30 ++++++++++++++++
 mm/gup.c            |    5 +++
 4 files changed, 129 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 619b1ed6434c..17323bc6d441 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-	/* TODO: wakeup page-idle waiters */
+	wake_up_var(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index 9ba043eb6294..29425c197154 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -348,6 +348,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -489,6 +502,85 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	pgoff_t	indices[PAGEVEC_SIZE];
+	struct page *page = NULL;
+	struct pagevec pvec;
+	pgoff_t	index, end;
+	unsigned i;
+
+	/*
+	 * In the 'limited' case get_user_pages() for dax is disabled.
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return NULL;
+
+	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+		return NULL;
+
+	pagevec_init(&pvec);
+	index = 0;
+	end = -1;
+	/*
+	 * Flush dax_layout_lock() sections to ensure all possible page
+	 * references have been taken, or otherwise arrange for faults
+	 * to block on the filesystem lock that is taken for
+	 * establishing new mappings.
+	 */
+	unmap_mapping_range(mapping, 0, 0, 1);
+	synchronize_rcu();
+
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dax_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+			if (page)
+				break;
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+
+		if (page)
+			break;
+	}
+	return page;
+}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -513,6 +605,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 92a1d3ee1615..bf1f30c1c8cc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -56,6 +56,18 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
 void fs_dax_release(struct dax_device *dax_dev, void *owner);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+
+static inline void dax_layout_lock(void)
+{
+	rcu_read_lock();
+}
+
+static inline void dax_layout_unlock(void)
+{
+	rcu_read_unlock();
+}
+
+struct page *dax_layout_busy_page(struct address_space *mapping);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -82,6 +94,19 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline void dax_layout_lock(void)
+{
+}
+
+static inline void dax_layout_unlock(void)
+{
+}
+
+static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return NULL;
+}
 #endif
 
 int dax_read_lock(void);
@@ -111,6 +136,11 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
+static inline struct page *refcount_to_page(atomic_t *c)
+{
+	return container_of(c, struct page, _refcount);
+}
+
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
diff --git a/mm/gup.c b/mm/gup.c
index 1b46e6e74881..a81efac6983a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,7 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -693,7 +694,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		dax_layout_lock();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_layout_unlock();
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,
@@ -1809,7 +1812,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
+		dax_layout_lock();
 		gup_pgd_range(addr, end, write, pages, &nr);
+		dax_layout_unlock();
 		local_irq_enable();
 		ret = nr;
 	}

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (10 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-16 19:04   ` Christoph Hellwig
  2018-03-19 17:33   ` Darrick J. Wong
  2018-03-15 15:52 ` [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts() Dan Williams
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

In preparation for adding coordination between truncate operations and
busy dax-pages, extend xfs_break_layouts() to assume it must be called
with the mmap lock held. This locking scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space).

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   14 +++++++++-----
 fs/xfs/xfs_ioctl.c |    5 +----
 fs/xfs/xfs_iops.c  |   10 +++++++---
 fs/xfs/xfs_pnfs.c  |    6 ++++--
 4 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..ba969019bf26 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,9 +350,16 @@ xfs_file_aio_write_checks(
 	if (error <= 0)
 		return error;
 
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	*iolock |= XFS_MMAPLOCK_EXCL;
 	error = xfs_break_layouts(inode, iolock);
-	if (error)
+	if (error) {
+		*iolock &= ~XFS_MMAPLOCK_EXCL;
+		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 		return error;
+	}
+	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+	*iolock &= ~XFS_MMAPLOCK_EXCL;
 
 	/*
 	 * For changing security info in file_remove_privs() we need i_rwsem
@@ -768,7 +775,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = false;
 
@@ -782,9 +789,6 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..4151fade4bb1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -614,7 +614,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -648,9 +648,6 @@ xfs_ioc_space(
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 951e84df5576..d23aa08426f9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1028,13 +1028,17 @@ xfs_vn_setattr(
 
 	if (iattr->ia_valid & ATTR_SIZE) {
 		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
-		uint			iolock = XFS_IOLOCK_EXCL;
+		uint			iolock;
+
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
 		error = xfs_break_layouts(d_inode(dentry), &iolock);
-		if (error)
+		if (error) {
+			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 			return error;
+		}
 
-		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		error = xfs_vn_setattr_size(dentry, iattr);
 		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..9fe661c2d59c 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -38,12 +38,14 @@ xfs_break_layouts(
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			error;
 
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
+				| XFS_MMAPLOCK_EXCL));
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
 		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
+		*iolock &= ~XFS_IOLOCK_SHARED;
+		*iolock |= XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts()
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (11 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts() Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-16 19:08   ` Christoph Hellwig
  2018-03-15 15:52 ` [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
  2018-03-15 15:52 ` [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  14 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

In preparation for adding a new layout type, teach xfs_break_layouts()
to return a positive number if it needed to drop locks while trying to
break leases. For all layouts to be successfully broken each layout type
needs to be able to assert that the layouts were broken with the locks
held.

The existing a xfs_break_layouts() is pushed down a level to
xfs_break_leased_layouts() and the new xfs_break_layouts() will
coordinate interpreting the return code from the low level 'break'
helpers.

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   17 +++++++++++++++++
 fs/xfs/xfs_inode.h |    2 ++
 fs/xfs/xfs_ioctl.c |    1 -
 fs/xfs/xfs_iops.c  |    1 -
 fs/xfs/xfs_pnfs.c  |   15 ++++++++-------
 fs/xfs/xfs_pnfs.h  |    4 ++--
 6 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ba969019bf26..5742d395a4e4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -759,6 +759,23 @@ xfs_file_write_iter(
 	return ret;
 }
 
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			ret;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
+				| XFS_MMAPLOCK_EXCL));
+
+	ret = xfs_break_leased_layouts(inode, iolock);
+	if (ret > 0)
+		ret = 0;
+	return ret;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..74c63f3a720f 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -447,6 +447,8 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
+int	xfs_break_layouts(struct inode *inode, uint *iolock);
+
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4151fade4bb1..d70a1919e787 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,7 +39,6 @@
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
 #include "xfs_trans.h"
-#include "xfs_pnfs.h"
 #include "xfs_acl.h"
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d23aa08426f9..78eb56d447df 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,7 +37,6 @@
 #include "xfs_da_btree.h"
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
-#include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 
 #include <linux/capability.h>
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 9fe661c2d59c..49784828929d 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -31,25 +31,26 @@
  * rules in the page fault path we don't bother.
  */
 int
-xfs_break_layouts(
+xfs_break_leased_layouts(
 	struct inode		*inode,
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
-				| XFS_MMAPLOCK_EXCL));
+	int			did_unlock = 0;
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		did_unlock = 1;
 		error = break_layout(inode, true);
 		*iolock &= ~XFS_IOLOCK_SHARED;
 		*iolock |= XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 
-	return error;
+	if (error < 0)
+		return error;
+	return did_unlock;
 }
 
 /*
@@ -122,8 +123,8 @@ xfs_fs_map_blocks(
 	 * Lock out any other I/O before we flush and invalidate the pagecache,
 	 * and then hand out a layout to the remote system.  This is very
 	 * similar to direct I/O, except that the synchronization is much more
-	 * complicated.  See the comment near xfs_break_layouts for a detailed
-	 * explanation.
+	 * complicated.  See the comment near xfs_break_leased_layouts
+	 * for a detailed explanation.
 	 */
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index bf45951e28fe..12f46fe6d902 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -9,10 +9,10 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
 
-int xfs_break_layouts(struct inode *inode, uint *iolock);
+int xfs_break_leased_layouts(struct inode *inode, uint *iolock);
 #else
 static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
+xfs_break_leased_layouts(struct inode *inode, uint *iolock)
 {
 	return 0;
 }

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (12 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts() Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-16 19:08   ` Christoph Hellwig
  2018-03-19 17:45   ` Darrick J. Wong
  2018-03-15 15:52 ` [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  14 siblings, 2 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: jack, Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, hch

When xfs is operating as the back-end of a pNFS block server, it prevents
collisions between local and remote operations by requiring a lease to
be held for remotely accessed blocks. Local filesystem operations break
those leases before writing or mutating the extent map of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_TRUNCATE are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_TRUNCATE case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_TRUNCATE breaks additionally
require waiting for busy dax-pages to go idle.

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   23 +++++++++++++++++------
 fs/xfs/xfs_inode.h |   18 ++++++++++++++++--
 fs/xfs/xfs_ioctl.c |    2 +-
 fs/xfs/xfs_iops.c  |    5 +++--
 4 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5742d395a4e4..399c5221f101 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -352,7 +352,7 @@ xfs_file_aio_write_checks(
 
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	*iolock |= XFS_MMAPLOCK_EXCL;
-	error = xfs_break_layouts(inode, iolock);
+	error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
 	if (error) {
 		*iolock &= ~XFS_MMAPLOCK_EXCL;
 		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
@@ -762,7 +762,8 @@ xfs_file_write_iter(
 int
 xfs_break_layouts(
 	struct inode		*inode,
-	uint			*iolock)
+	uint			*iolock,
+	enum layout_break_reason reason)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
@@ -770,9 +771,19 @@ xfs_break_layouts(
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
 				| XFS_MMAPLOCK_EXCL));
 
-	ret = xfs_break_leased_layouts(inode, iolock);
-	if (ret > 0)
-		ret = 0;
+	switch (reason) {
+	case BREAK_TRUNCATE:
+		/* fall through */
+	case BREAK_WRITE:
+		ret = xfs_break_leased_layouts(inode, iolock);
+		if (ret > 0)
+			ret = 0;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
 	return ret;
 }
 
@@ -802,7 +813,7 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 74c63f3a720f..1a66c7afcf45 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 					>> XFS_ILOCK_SHIFT)
 
 /*
+ * Layouts are broken in the BREAK_WRITE case to ensure that
+ * layout-holders do not collide with local writes. Additionally,
+ * layouts are broken in the BREAK_TRUNCATE case to make sure the
+ * layout-holder has a consistent view of the file's extent map. While
+ * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
+ * BREAK_TRUNCATE breaks additionally require waiting for busy dax-pages
+ * to go idle.
+ */
+enum layout_break_reason {
+        BREAK_WRITE,
+        BREAK_TRUNCATE,
+};
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,8 +461,8 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
-int	xfs_break_layouts(struct inode *inode, uint *iolock);
-
+int	xfs_break_layouts(struct inode *inode, uint *iolock,
+		enum layout_break_reason reason);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d70a1919e787..847a67186d95 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -643,7 +643,7 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 78eb56d447df..f9fcadb5b555 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1026,13 +1026,14 @@ xfs_vn_setattr(
 	int			error;
 
 	if (iattr->ia_valid & ATTR_SIZE) {
-		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
+		struct inode		*inode = d_inode(dentry);
+		struct xfs_inode	*ip = XFS_I(inode);
 		uint			iolock;
 
 		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
-		error = xfs_break_layouts(d_inode(dentry), &iolock);
+		error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
 		if (error) {
 			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 			return error;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (13 preceding siblings ...)
  2018-03-15 15:52 ` [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-03-15 15:52 ` Dan Williams
  2018-03-16 19:09   ` Christoph Hellwig
                     ` (2 more replies)
  14 siblings, 3 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-15 15:52 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Dave Chinner, Darrick J. Wong, Ross Zwisler,
	Christoph Hellwig, linux-xfs, linux-fsdevel

xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
log held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page.

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   67 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 54 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 399c5221f101..2ccdbb19e31a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -759,6 +759,38 @@ xfs_file_write_iter(
 	return ret;
 }
 
+static void
+xfs_wait_var_event(
+	struct inode		*inode,
+	uint			iolock)
+{
+	struct xfs_inode        *ip = XFS_I(inode);
+
+	xfs_iunlock(ip, iolock);
+	schedule();
+	xfs_ilock(ip, iolock);
+}
+
+static int
+xfs_break_dax_layouts(
+	struct inode		*inode,
+	uint			iolock)
+{
+	struct page		*page;
+	int			ret;
+
+	page = dax_layout_busy_page(inode->i_mapping);
+	if (!page)
+		return 0;
+
+	ret = ___wait_var_event(&page->_refcount,
+			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+			0, 0, xfs_wait_var_event(inode, iolock));
+	if (ret < 0)
+		return ret;
+	return 1;
+}
+
 int
 xfs_break_layouts(
 	struct inode		*inode,
@@ -766,23 +798,32 @@ xfs_break_layouts(
 	enum layout_break_reason reason)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			ret;
+	int			ret = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
 				| XFS_MMAPLOCK_EXCL));
 
-	switch (reason) {
-	case BREAK_TRUNCATE:
-		/* fall through */
-	case BREAK_WRITE:
-		ret = xfs_break_leased_layouts(inode, iolock);
-		if (ret > 0)
-			ret = 0;
-		break;
-	default:
-		ret = -EINVAL;
-		break;
-	}
+	do {
+		switch (reason) {
+		case BREAK_TRUNCATE:
+			ret = xfs_break_dax_layouts(inode, *iolock);
+			/* fall through */
+		case BREAK_WRITE:
+			if (ret != 0)
+				break;
+			ret = xfs_break_leased_layouts(inode, iolock);
+			break;
+		default:
+			ret = -EINVAL;
+			break;
+		}
+		/*
+		 * This loop terminates when either layout break attempt
+		 * returns an error, or both layout break attempts
+		 * return 0, i.e. layouts are verified broken while
+		 * holding all required locks.
+		 */
+	} while (ret > 0);
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations
  2018-03-15 15:51 ` [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations Dan Williams
@ 2018-03-16 18:59   ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 18:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, linux-fsdevel, Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages()
  2018-03-15 15:51 ` [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages() Dan Williams
@ 2018-03-16 18:59   ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 18:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, linux-fsdevel, Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops
  2018-03-15 15:51 ` [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops Dan Williams
@ 2018-03-16 19:00   ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 19:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, linux-fsdevel, Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-03-15 15:52 ` [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
@ 2018-03-16 19:01   ` Christoph Hellwig
  2018-03-17 22:14   ` kbuild test robot
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 19:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Jan Kara, Andrew Morton, Matthew Wilcox,
	linux-nvdimm, Dave Chinner, linux-kernel, Christoph Hellwig,
	linux-xfs, Alexander Viro, linux-fsdevel, Darrick J. Wong

Looks good (at least up to my comprehension of the mm code :))

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-15 15:52 ` [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts() Dan Williams
@ 2018-03-16 19:04   ` Christoph Hellwig
  2018-03-16 19:10     ` Dan Williams
  2018-03-19 17:33   ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 19:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Darrick J. Wong, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel, Christoph Hellwig

On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
> In preparation for adding coordination between truncate operations and
> busy dax-pages, extend xfs_break_layouts() to assume it must be called
> with the mmap lock held. This locking scheme will be required for
> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
> pages mapped into the file's address space).

This requirement wasn't really there in the last series, why do we
require it now?

As far as I can tell all we'd need is to just drop this assert:

> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
> +				| XFS_MMAPLOCK_EXCL));

entirely.

>  	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>  		xfs_iunlock(ip, *iolock);
>  		error = break_layout(inode, true);
> -		*iolock = XFS_IOLOCK_EXCL;
> +		*iolock &= ~XFS_IOLOCK_SHARED;
> +		*iolock |= XFS_IOLOCK_EXCL;
>  		xfs_ilock(ip, *iolock);

And take this one hunk from your patch.

To enable the DAX use case.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts()
  2018-03-15 15:52 ` [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts() Dan Williams
@ 2018-03-16 19:08   ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 19:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Darrick J. Wong, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel, Christoph Hellwig

On Thu, Mar 15, 2018 at 08:52:39AM -0700, Dan Williams wrote:
> In preparation for adding a new layout type, teach xfs_break_layouts()
> to return a positive number if it needed to drop locks while trying to
> break leases. For all layouts to be successfully broken each layout type
> needs to be able to assert that the layouts were broken with the locks
> held.
> 
> The existing a xfs_break_layouts() is pushed down a level to
> xfs_break_leased_layouts() and the new xfs_break_layouts() will
> coordinate interpreting the return code from the low level 'break'
> helpers.

With that the subject line is rather confusing, given that the
externally visible xfs_break_layouts does not communicate the lock
drop events.  So maybe this should just be titled something about
refactoring.  Or just merged into the next patch which reshuffles
everything again anyway.

>  int
> -xfs_break_layouts(
> +xfs_break_leased_layouts(
>  	struct inode		*inode,
>  	uint			*iolock)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	int			error;
> -
> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
> -				| XFS_MMAPLOCK_EXCL));
> +	int			did_unlock = 0;
>  
>  	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>  		xfs_iunlock(ip, *iolock);
> +		did_unlock = 1;
>  		error = break_layout(inode, true);
>  		*iolock &= ~XFS_IOLOCK_SHARED;
>  		*iolock |= XFS_IOLOCK_EXCL;
>  		xfs_ilock(ip, *iolock);
>  	}
>  
> -	return error;
> +	if (error < 0)
> +		return error;
> +	return did_unlock;

And I suspect the cleaner interface would be to just pass a 
bool *did_unlock argument.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-15 15:52 ` [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-03-16 19:08   ` Christoph Hellwig
  2018-03-19 17:45   ` Darrick J. Wong
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 19:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Darrick J. Wong, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-xfs, linux-fsdevel, Christoph Hellwig

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-15 15:52 ` [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
@ 2018-03-16 19:09   ` Christoph Hellwig
  2018-03-17 22:11   ` kbuild test robot
  2018-03-17 23:47   ` kbuild test robot
  2 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-16 19:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, linux-fsdevel, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-16 19:04   ` Christoph Hellwig
@ 2018-03-16 19:10     ` Dan Williams
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-16 19:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel

On Fri, Mar 16, 2018 at 12:04 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
>> In preparation for adding coordination between truncate operations and
>> busy dax-pages, extend xfs_break_layouts() to assume it must be called
>> with the mmap lock held. This locking scheme will be required for
>> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
>> pages mapped into the file's address space).
>
> This requirement wasn't really there in the last series, why do we
> require it now?

It seems I misinterpreted your feedback.

>
> As far as I can tell all we'd need is to just drop this assert:
>
>> -     ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>> +     ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
>> +                             | XFS_MMAPLOCK_EXCL));
>
> entirely.
>
>>       while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>>               xfs_iunlock(ip, *iolock);
>>               error = break_layout(inode, true);
>> -             *iolock = XFS_IOLOCK_EXCL;
>> +             *iolock &= ~XFS_IOLOCK_SHARED;
>> +             *iolock |= XFS_IOLOCK_EXCL;
>>               xfs_ilock(ip, *iolock);
>
> And take this one hunk from your patch.
>
> To enable the DAX use case.

Yeah, that looks good to me.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-15 15:52 ` [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  2018-03-16 19:09   ` Christoph Hellwig
@ 2018-03-17 22:11   ` kbuild test robot
  2018-03-17 23:47   ` kbuild test robot
  2 siblings, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2018-03-17 22:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, kbuild-all, linux-fsdevel,
	Christoph Hellwig

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc5]
[cannot apply to next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/dax-fix-dma-vs-truncate-hole-punch/20180318-050250
config: x86_64-randconfig-x014-201811 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/xfs/xfs_file.c: In function 'xfs_break_dax_layouts':
>> fs/xfs/xfs_file.c:786:8: error: implicit declaration of function '___wait_var_event'; did you mean 'xfs_wait_var_event'? [-Werror=implicit-function-declaration]
     ret = ___wait_var_event(&page->_refcount,
           ^~~~~~~~~~~~~~~~~
           xfs_wait_var_event
>> fs/xfs/xfs_file.c:788:10: error: invalid use of void expression
       0, 0, xfs_wait_var_event(inode, iolock));
             ^~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +786 fs/xfs/xfs_file.c

   773	
   774	static int
   775	xfs_break_dax_layouts(
   776		struct inode		*inode,
   777		uint			iolock)
   778	{
   779		struct page		*page;
   780		int			ret;
   781	
   782		page = dax_layout_busy_page(inode->i_mapping);
   783		if (!page)
   784			return 0;
   785	
 > 786		ret = ___wait_var_event(&page->_refcount,
   787				atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
 > 788				0, 0, xfs_wait_var_event(inode, iolock));
   789		if (ret < 0)
   790			return ret;
   791		return 1;
   792	}
   793	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-03-15 15:52 ` [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
  2018-03-16 19:01   ` Christoph Hellwig
@ 2018-03-17 22:14   ` kbuild test robot
  1 sibling, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2018-03-17 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, Jan Kara, Andrew Morton, Matthew Wilcox,
	linux-nvdimm, Dave Chinner, linux-kernel, Christoph Hellwig,
	linux-xfs, kbuild-all, linux-fsdevel, Darrick J. Wong,
	Alexander Viro

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc5 next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/dax-fix-dma-vs-truncate-hole-punch/20180318-050250
config: x86_64-randconfig-x010-201811 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/dax/super.c: In function 'generic_dax_pagefree':
>> drivers/dax/super.c:170:2: error: implicit declaration of function 'wake_up_var'; did you mean 'wake_up_nr'? [-Werror=implicit-function-declaration]
     wake_up_var(&page->_refcount);
     ^~~~~~~~~~~
     wake_up_nr
   cc1: some warnings being treated as errors

vim +170 drivers/dax/super.c

   166	
   167	#if IS_ENABLED(CONFIG_FS_DAX)
   168	static void generic_dax_pagefree(struct page *page, void *data)
   169	{
 > 170		wake_up_var(&page->_refcount);
   171	}
   172	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts()
  2018-03-15 15:52 ` [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  2018-03-16 19:09   ` Christoph Hellwig
  2018-03-17 22:11   ` kbuild test robot
@ 2018-03-17 23:47   ` kbuild test robot
  2 siblings, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2018-03-17 23:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, kbuild-all, linux-fsdevel,
	Christoph Hellwig

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc5]
[cannot apply to next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/dax-fix-dma-vs-truncate-hole-punch/20180318-050250
config: x86_64-randconfig-s2-03180641 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs//xfs/xfs_file.c: In function 'xfs_break_dax_layouts':
>> fs//xfs/xfs_file.c:786:8: error: implicit declaration of function '___wait_var_event' [-Werror=implicit-function-declaration]
     ret = ___wait_var_event(&page->_refcount,
           ^~~~~~~~~~~~~~~~~
   fs//xfs/xfs_file.c:788:4: error: invalid use of void expression
       0, 0, xfs_wait_var_event(inode, iolock));
       ^
   cc1: some warnings being treated as errors
--
   drivers//dax/super.c: In function 'generic_dax_pagefree':
>> drivers//dax/super.c:170:2: error: implicit declaration of function 'wake_up_var' [-Werror=implicit-function-declaration]
     wake_up_var(&page->_refcount);
     ^~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/___wait_var_event +786 fs//xfs/xfs_file.c

   773	
   774	static int
   775	xfs_break_dax_layouts(
   776		struct inode		*inode,
   777		uint			iolock)
   778	{
   779		struct page		*page;
   780		int			ret;
   781	
   782		page = dax_layout_busy_page(inode->i_mapping);
   783		if (!page)
   784			return 0;
   785	
 > 786		ret = ___wait_var_event(&page->_refcount,
   787				atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
   788				0, 0, xfs_wait_var_event(inode, iolock));
   789		if (ret < 0)
   790			return ret;
   791		return 1;
   792	}
   793	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops
  2018-03-15 15:51 ` [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops Dan Williams
@ 2018-03-18  4:02   ` kbuild test robot
  2018-03-18  4:02   ` [RFC PATCH] ext2, dax: ext2_dax_aops can be static kbuild test robot
  1 sibling, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2018-03-18  4:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, david, linux-kernel, linux-xfs, kbuild-all,
	Jan Kara, linux-fsdevel, hch

Hi Dan,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.16-rc5 next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/dax-fix-dma-vs-truncate-hole-punch/20180318-050250
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> fs/ext2/inode.c:992:39: sparse: symbol 'ext2_dax_aops' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH] ext2, dax: ext2_dax_aops can be static
  2018-03-15 15:51 ` [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops Dan Williams
  2018-03-18  4:02   ` kbuild test robot
@ 2018-03-18  4:02   ` kbuild test robot
  1 sibling, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2018-03-18  4:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, david, linux-kernel, linux-xfs, kbuild-all,
	Jan Kara, linux-fsdevel, hch


Fixes: 97affa6a9d1c ("ext2, dax: introduce ext2_dax_aops")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 inode.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9590712..8ea394c8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -989,7 +989,7 @@ const struct address_space_operations ext2_nobh_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_dax_aops = {
+static const struct address_space_operations ext2_dax_aops = {
 	.direct_IO		= ext2_direct_IO,
 	.writepages		= ext2_dax_writepages,
 	.set_page_dirty		= noop_set_page_dirty,
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 07/15] fs, dax: use page->mapping to warn if truncate collides with a busy page
  2018-03-15 15:52 ` [PATCH v6 07/15] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
@ 2018-03-18  6:26   ` kbuild test robot
  0 siblings, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2018-03-18  6:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, david, linux-kernel,
	linux-xfs, kbuild-all, linux-fsdevel, Christoph Hellwig

Hi Dan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc5 next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/dax-fix-dma-vs-truncate-hole-punch/20180318-050250
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

   fs/dax.c: In function 'dax_entry_size':
>> fs/dax.c:308:10: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
      return HPAGE_SIZE;
             ^~~~~~~~~~
             PAGE_SIZE
   fs/dax.c:308:10: note: each undeclared identifier is reported only once for each function it appears in
>> fs/dax.c:311:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^

vim +308 fs/dax.c

   300	
   301	static unsigned long dax_entry_size(void *entry)
   302	{
   303		if (dax_is_zero_entry(entry))
   304			return 0;
   305		else if (dax_is_empty_entry(entry))
   306			return 0;
   307		else if (dax_is_pmd_entry(entry))
 > 308			return HPAGE_SIZE;
   309		else
   310			return PAGE_SIZE;
 > 311	}
   312	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-15 15:52 ` [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts() Dan Williams
  2018-03-16 19:04   ` Christoph Hellwig
@ 2018-03-19 17:33   ` Darrick J. Wong
  2018-03-19 17:57     ` Dan Williams
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-03-19 17:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Christoph Hellwig

On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
> In preparation for adding coordination between truncate operations and
> busy dax-pages, extend xfs_break_layouts() to assume it must be called
> with the mmap lock held. This locking scheme will be required for
> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
> pages mapped into the file's address space).

If I'm reading this right, you've added a requirement (for xfs anyway)
that we have to have grabbed MMAPLOCK_EXCL before calling break_layout()
so that the layout breaking process will block until active dmas have
finished.

In 4.16 we added xfs_iolock_two_inodes_and_break_layout (in
xfs_reflink.c) to break pnfs leases for files that are about to be
reflinked (since pnfs and reflink aren't compatible either).  I think
that function will have to be adapted to take the appropriate mmap locks
too -- definitely the exclusive mmap lock for the destination file
because we anticipate punching out blocks.  I'm not sure about the
source file; I think taking the shared mmap lock is fine for that?

--D

> 
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c  |   14 +++++++++-----
>  fs/xfs/xfs_ioctl.c |    5 +----
>  fs/xfs/xfs_iops.c  |   10 +++++++---
>  fs/xfs/xfs_pnfs.c  |    6 ++++--
>  4 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 9ea08326f876..ba969019bf26 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -350,9 +350,16 @@ xfs_file_aio_write_checks(
>  	if (error <= 0)
>  		return error;
>  
> +	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +	*iolock |= XFS_MMAPLOCK_EXCL;
>  	error = xfs_break_layouts(inode, iolock);
> -	if (error)
> +	if (error) {
> +		*iolock &= ~XFS_MMAPLOCK_EXCL;
> +		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  		return error;
> +	}
> +	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> +	*iolock &= ~XFS_MMAPLOCK_EXCL;
>  
>  	/*
>  	 * For changing security info in file_remove_privs() we need i_rwsem
> @@ -768,7 +775,7 @@ xfs_file_fallocate(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	long			error;
>  	enum xfs_prealloc_flags	flags = 0;
> -	uint			iolock = XFS_IOLOCK_EXCL;
> +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	loff_t			new_size = 0;
>  	bool			do_file_insert = false;
>  
> @@ -782,9 +789,6 @@ xfs_file_fallocate(
>  	if (error)
>  		goto out_unlock;
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> -	iolock |= XFS_MMAPLOCK_EXCL;
> -
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 89fb1eb80aae..4151fade4bb1 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -614,7 +614,7 @@ xfs_ioc_space(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct iattr		iattr;
>  	enum xfs_prealloc_flags	flags = 0;
> -	uint			iolock = XFS_IOLOCK_EXCL;
> +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	int			error;
>  
>  	/*
> @@ -648,9 +648,6 @@ xfs_ioc_space(
>  	if (error)
>  		goto out_unlock;
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> -	iolock |= XFS_MMAPLOCK_EXCL;
> -
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
>  		break;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 951e84df5576..d23aa08426f9 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1028,13 +1028,17 @@ xfs_vn_setattr(
>  
>  	if (iattr->ia_valid & ATTR_SIZE) {
>  		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
> -		uint			iolock = XFS_IOLOCK_EXCL;
> +		uint			iolock;
> +
> +		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  
>  		error = xfs_break_layouts(d_inode(dentry), &iolock);
> -		if (error)
> +		if (error) {
> +			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  			return error;
> +		}
>  
> -		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  		error = xfs_vn_setattr_size(dentry, iattr);
>  		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  	} else {
> diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
> index aa6c5c193f45..9fe661c2d59c 100644
> --- a/fs/xfs/xfs_pnfs.c
> +++ b/fs/xfs/xfs_pnfs.c
> @@ -38,12 +38,14 @@ xfs_break_layouts(
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	int			error;
>  
> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
> +				| XFS_MMAPLOCK_EXCL));
>  
>  	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>  		xfs_iunlock(ip, *iolock);
>  		error = break_layout(inode, true);
> -		*iolock = XFS_IOLOCK_EXCL;
> +		*iolock &= ~XFS_IOLOCK_SHARED;
> +		*iolock |= XFS_IOLOCK_EXCL;
>  		xfs_ilock(ip, *iolock);
>  	}
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-15 15:52 ` [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
  2018-03-16 19:08   ` Christoph Hellwig
@ 2018-03-19 17:45   ` Darrick J. Wong
  2018-03-19 18:09     ` Dan Williams
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-03-19 17:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	linux-fsdevel, Christoph Hellwig

On Thu, Mar 15, 2018 at 08:52:45AM -0700, Dan Williams wrote:
> When xfs is operating as the back-end of a pNFS block server, it prevents
> collisions between local and remote operations by requiring a lease to
> be held for remotely accessed blocks. Local filesystem operations break
> those leases before writing or mutating the extent map of the file.
> 
> A similar mechanism is needed to prevent operations on pinned dax
> mappings, like device-DMA, from colliding with extent unmap operations.
> 
> BREAK_WRITE and BREAK_TRUNCATE are introduced as two distinct levels of
> layout breaking.
> 
> Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
> do not collide with local writes. Additionally, layouts are broken in
> the BREAK_TRUNCATE case to make sure the layout-holder has a consistent
> view of the file's extent map. While BREAK_WRITE breaks can be satisfied
> be recalling FL_LAYOUT leases, BREAK_TRUNCATE breaks additionally
> require waiting for busy dax-pages to go idle.
> 
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Dave Chinner <david@fromorbit.com>
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c  |   23 +++++++++++++++++------
>  fs/xfs/xfs_inode.h |   18 ++++++++++++++++--
>  fs/xfs/xfs_ioctl.c |    2 +-
>  fs/xfs/xfs_iops.c  |    5 +++--
>  4 files changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5742d395a4e4..399c5221f101 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -352,7 +352,7 @@ xfs_file_aio_write_checks(
>  
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  	*iolock |= XFS_MMAPLOCK_EXCL;
> -	error = xfs_break_layouts(inode, iolock);
> +	error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
>  	if (error) {
>  		*iolock &= ~XFS_MMAPLOCK_EXCL;
>  		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> @@ -762,7 +762,8 @@ xfs_file_write_iter(
>  int
>  xfs_break_layouts(
>  	struct inode		*inode,
> -	uint			*iolock)
> +	uint			*iolock,
> +	enum layout_break_reason reason)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	int			ret;
> @@ -770,9 +771,19 @@ xfs_break_layouts(
>  	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
>  				| XFS_MMAPLOCK_EXCL));
>  
> -	ret = xfs_break_leased_layouts(inode, iolock);
> -	if (ret > 0)
> -		ret = 0;
> +	switch (reason) {
> +	case BREAK_TRUNCATE:
> +		/* fall through */
> +	case BREAK_WRITE:
> +		ret = xfs_break_leased_layouts(inode, iolock);
> +		if (ret > 0)
> +			ret = 0;
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
>  	return ret;
>  }
>  
> @@ -802,7 +813,7 @@ xfs_file_fallocate(
>  		return -EOPNOTSUPP;
>  
>  	xfs_ilock(ip, iolock);
> -	error = xfs_break_layouts(inode, &iolock);
> +	error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
>  	if (error)
>  		goto out_unlock;
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 74c63f3a720f..1a66c7afcf45 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>  					>> XFS_ILOCK_SHIFT)
>  
>  /*
> + * Layouts are broken in the BREAK_WRITE case to ensure that
> + * layout-holders do not collide with local writes. Additionally,
> + * layouts are broken in the BREAK_TRUNCATE case to make sure the
> + * layout-holder has a consistent view of the file's extent map. While
> + * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
> + * BREAK_TRUNCATE breaks additionally require waiting for busy dax-pages
> + * to go idle.
> + */
> +enum layout_break_reason {
> +        BREAK_WRITE,

I wonder if this belongs in a system header file since the same general
principle is going to apply to ext4 and others?  If this really is a
special xfs thing, then please use the "XFS_" prefix.

> +        BREAK_TRUNCATE,

The "truncate" part of the name misled me for a bit.  That operation is
really about "make sure there are no busy dax pages because we're about
to change some file pmem^Wblock mappings", right?  Truncation, hole
punching, reflinking, and zero_range (because it punches the range and
reallocates unwritten extents) all have to wait for busy dax pages to
become free before they can begin unmapping blocks.  So this isn't just
about truncation specifically; can I suggest "BREAK_UNMAPI" ?

(I also haven't figured out whether the same requirements apply to
things that *add* extent maps to the file, but I have to run to a
meeting in a few so wanted to get this email sent.)

--D

> +};
> +
> +/*
>   * For multiple groups support: if S_ISGID bit is set in the parent
>   * directory, group of new file is set to that of the parent, and
>   * new subdirectory gets S_ISGID bit from parent.
> @@ -447,8 +461,8 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
>  		     xfs_fsize_t isize, bool *did_zeroing);
>  int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
>  		bool *did_zero);
> -int	xfs_break_layouts(struct inode *inode, uint *iolock);
> -
> +int	xfs_break_layouts(struct inode *inode, uint *iolock,
> +		enum layout_break_reason reason);
>  
>  /* from xfs_iops.c */
>  extern void xfs_setup_inode(struct xfs_inode *ip);
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d70a1919e787..847a67186d95 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -643,7 +643,7 @@ xfs_ioc_space(
>  		return error;
>  
>  	xfs_ilock(ip, iolock);
> -	error = xfs_break_layouts(inode, &iolock);
> +	error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
>  	if (error)
>  		goto out_unlock;
>  
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 78eb56d447df..f9fcadb5b555 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1026,13 +1026,14 @@ xfs_vn_setattr(
>  	int			error;
>  
>  	if (iattr->ia_valid & ATTR_SIZE) {
> -		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
> +		struct inode		*inode = d_inode(dentry);
> +		struct xfs_inode	*ip = XFS_I(inode);
>  		uint			iolock;
>  
>  		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  
> -		error = xfs_break_layouts(d_inode(dentry), &iolock);
> +		error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
>  		if (error) {
>  			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  			return error;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-19 17:33   ` Darrick J. Wong
@ 2018-03-19 17:57     ` Dan Williams
  2018-03-19 18:19       ` Darrick J. Wong
  2018-03-19 19:45       ` Christoph Hellwig
  0 siblings, 2 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-19 17:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Christoph Hellwig

On Mon, Mar 19, 2018 at 10:33 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
>> In preparation for adding coordination between truncate operations and
>> busy dax-pages, extend xfs_break_layouts() to assume it must be called
>> with the mmap lock held. This locking scheme will be required for
>> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
>> pages mapped into the file's address space).
>
> If I'm reading this right, you've added a requirement (for xfs anyway)
> that we have to have grabbed MMAPLOCK_EXCL before calling break_layout()
> so that the layout breaking process will block until active dmas have
> finished.
>
> In 4.16 we added xfs_iolock_two_inodes_and_break_layout (in
> xfs_reflink.c) to break pnfs leases for files that are about to be
> reflinked (since pnfs and reflink aren't compatible either).  I think
> that function will have to be adapted to take the appropriate mmap locks
> too -- definitely the exclusive mmap lock for the destination file
> because we anticipate punching out blocks.  I'm not sure about the
> source file; I think taking the shared mmap lock is fine for that?

I don't see anything to adapt with respect to mmap locks since reflink
and dax are mutually exclusive. The new lock coordination and layout
breaking in xfs_break_dax_layouts() only applies to dax vs truncate,
so xfs_iolock_two_inodes_and_break_layout() looks ok to me as is.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type
  2018-03-19 17:45   ` Darrick J. Wong
@ 2018-03-19 18:09     ` Dan Williams
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-19 18:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Christoph Hellwig

On Mon, Mar 19, 2018 at 10:45 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Mar 15, 2018 at 08:52:45AM -0700, Dan Williams wrote:
>> When xfs is operating as the back-end of a pNFS block server, it prevents
>> collisions between local and remote operations by requiring a lease to
>> be held for remotely accessed blocks. Local filesystem operations break
>> those leases before writing or mutating the extent map of the file.
>>
>> A similar mechanism is needed to prevent operations on pinned dax
>> mappings, like device-DMA, from colliding with extent unmap operations.
>>
>> BREAK_WRITE and BREAK_TRUNCATE are introduced as two distinct levels of
>> layout breaking.
>>
>> Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
>> do not collide with local writes. Additionally, layouts are broken in
>> the BREAK_TRUNCATE case to make sure the layout-holder has a consistent
>> view of the file's extent map. While BREAK_WRITE breaks can be satisfied
>> be recalling FL_LAYOUT leases, BREAK_TRUNCATE breaks additionally
>> require waiting for busy dax-pages to go idle.
>>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reported-by: Dave Chinner <david@fromorbit.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_file.c  |   23 +++++++++++++++++------
>>  fs/xfs/xfs_inode.h |   18 ++++++++++++++++--
>>  fs/xfs/xfs_ioctl.c |    2 +-
>>  fs/xfs/xfs_iops.c  |    5 +++--
>>  4 files changed, 37 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 5742d395a4e4..399c5221f101 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -352,7 +352,7 @@ xfs_file_aio_write_checks(
>>
>>       xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>>       *iolock |= XFS_MMAPLOCK_EXCL;
>> -     error = xfs_break_layouts(inode, iolock);
>> +     error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
>>       if (error) {
>>               *iolock &= ~XFS_MMAPLOCK_EXCL;
>>               xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>> @@ -762,7 +762,8 @@ xfs_file_write_iter(
>>  int
>>  xfs_break_layouts(
>>       struct inode            *inode,
>> -     uint                    *iolock)
>> +     uint                    *iolock,
>> +     enum layout_break_reason reason)
>>  {
>>       struct xfs_inode        *ip = XFS_I(inode);
>>       int                     ret;
>> @@ -770,9 +771,19 @@ xfs_break_layouts(
>>       ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
>>                               | XFS_MMAPLOCK_EXCL));
>>
>> -     ret = xfs_break_leased_layouts(inode, iolock);
>> -     if (ret > 0)
>> -             ret = 0;
>> +     switch (reason) {
>> +     case BREAK_TRUNCATE:
>> +             /* fall through */
>> +     case BREAK_WRITE:
>> +             ret = xfs_break_leased_layouts(inode, iolock);
>> +             if (ret > 0)
>> +                     ret = 0;
>> +             break;
>> +     default:
>> +             ret = -EINVAL;
>> +             break;
>> +     }
>> +
>>       return ret;
>>  }
>>
>> @@ -802,7 +813,7 @@ xfs_file_fallocate(
>>               return -EOPNOTSUPP;
>>
>>       xfs_ilock(ip, iolock);
>> -     error = xfs_break_layouts(inode, &iolock);
>> +     error = xfs_break_layouts(inode, &iolock, BREAK_TRUNCATE);
>>       if (error)
>>               goto out_unlock;
>>
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index 74c63f3a720f..1a66c7afcf45 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>>                                       >> XFS_ILOCK_SHIFT)
>>
>>  /*
>> + * Layouts are broken in the BREAK_WRITE case to ensure that
>> + * layout-holders do not collide with local writes. Additionally,
>> + * layouts are broken in the BREAK_TRUNCATE case to make sure the
>> + * layout-holder has a consistent view of the file's extent map. While
>> + * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
>> + * BREAK_TRUNCATE breaks additionally require waiting for busy dax-pages
>> + * to go idle.
>> + */
>> +enum layout_break_reason {
>> +        BREAK_WRITE,
>
> I wonder if this belongs in a system header file since the same general
> principle is going to apply to ext4 and others?  If this really is a
> special xfs thing, then please use the "XFS_" prefix.
>
>> +        BREAK_TRUNCATE,
>
> The "truncate" part of the name misled me for a bit.  That operation is
> really about "make sure there are no busy dax pages because we're about
> to change some file pmem^Wblock mappings", right?  Truncation, hole
> punching, reflinking, and zero_range (because it punches the range and
> reallocates unwritten extents) all have to wait for busy dax pages to
> become free before they can begin unmapping blocks.  So this isn't just
> about truncation specifically; can I suggest "BREAK_UNMAPI" ?

I like that name better.

It isn't clear that this needs to go to a system header because I
expect ext4 will only support BREAK_UNMAPI for dax and not care about
BREAK_WRITE since that is an XFS-only / pNFS-only distinction in
current code. However, I'll let Jan chime in if this guess is
incorrect and ext4 plans to add pNFS block-server support.

> (I also haven't figured out whether the same requirements apply to
> things that *add* extent maps to the file, but I have to run to a
> meeting in a few so wanted to get this email sent.)

Add should not be a problem because DMA only ever starts after extents
are added, i.e. there's no risk of DMA starting to invalid address
because fs/dax.c arranges for extents to be allocated at fault time.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-19 17:57     ` Dan Williams
@ 2018-03-19 18:19       ` Darrick J. Wong
  2018-03-19 18:34         ` Dan Williams
  2018-03-19 19:45       ` Christoph Hellwig
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2018-03-19 18:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Christoph Hellwig

On Mon, Mar 19, 2018 at 10:57:55AM -0700, Dan Williams wrote:
> On Mon, Mar 19, 2018 at 10:33 AM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
> >> In preparation for adding coordination between truncate operations and
> >> busy dax-pages, extend xfs_break_layouts() to assume it must be called
> >> with the mmap lock held. This locking scheme will be required for
> >> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
> >> pages mapped into the file's address space).
> >
> > If I'm reading this right, you've added a requirement (for xfs anyway)
> > that we have to have grabbed MMAPLOCK_EXCL before calling break_layout()
> > so that the layout breaking process will block until active dmas have
> > finished.
> >
> > In 4.16 we added xfs_iolock_two_inodes_and_break_layout (in
> > xfs_reflink.c) to break pnfs leases for files that are about to be
> > reflinked (since pnfs and reflink aren't compatible either).  I think
> > that function will have to be adapted to take the appropriate mmap locks
> > too -- definitely the exclusive mmap lock for the destination file
> > because we anticipate punching out blocks.  I'm not sure about the
> > source file; I think taking the shared mmap lock is fine for that?
> 
> I don't see anything to adapt with respect to mmap locks since reflink
> and dax are mutually exclusive. The new lock coordination and layout
> breaking in xfs_break_dax_layouts() only applies to dax vs truncate,
> so xfs_iolock_two_inodes_and_break_layout() looks ok to me as is.

I was aiming for DAX + reflink someday working together.  :)

As far as that goes, I think the only thing we really have to do is
figure out how to make iomap_begin do the cow operation before letting
userspace write to (or MAP_SYNC a writable region) pmem, right?

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-19 18:19       ` Darrick J. Wong
@ 2018-03-19 18:34         ` Dan Williams
  0 siblings, 0 replies; 40+ messages in thread
From: Dan Williams @ 2018-03-19 18:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Christoph Hellwig

On Mon, Mar 19, 2018 at 11:19 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, Mar 19, 2018 at 10:57:55AM -0700, Dan Williams wrote:
>> On Mon, Mar 19, 2018 at 10:33 AM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
>> >> In preparation for adding coordination between truncate operations and
>> >> busy dax-pages, extend xfs_break_layouts() to assume it must be called
>> >> with the mmap lock held. This locking scheme will be required for
>> >> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
>> >> pages mapped into the file's address space).
>> >
>> > If I'm reading this right, you've added a requirement (for xfs anyway)
>> > that we have to have grabbed MMAPLOCK_EXCL before calling break_layout()
>> > so that the layout breaking process will block until active dmas have
>> > finished.
>> >
>> > In 4.16 we added xfs_iolock_two_inodes_and_break_layout (in
>> > xfs_reflink.c) to break pnfs leases for files that are about to be
>> > reflinked (since pnfs and reflink aren't compatible either).  I think
>> > that function will have to be adapted to take the appropriate mmap locks
>> > too -- definitely the exclusive mmap lock for the destination file
>> > because we anticipate punching out blocks.  I'm not sure about the
>> > source file; I think taking the shared mmap lock is fine for that?
>>
>> I don't see anything to adapt with respect to mmap locks since reflink
>> and dax are mutually exclusive. The new lock coordination and layout
>> breaking in xfs_break_dax_layouts() only applies to dax vs truncate,
>> so xfs_iolock_two_inodes_and_break_layout() looks ok to me as is.
>
> I was aiming for DAX + reflink someday working together.  :)

Of course, but then fixing up xfs_iolock_two_inodes_and_break_layout()
is a follow on consideration from this patch series.

> As far as that goes, I think the only thing we really have to do is
> figure out how to make iomap_begin do the cow operation before letting
> userspace write to (or MAP_SYNC a writable region) pmem, right?

Yes, I think so, trigger xfs_break_dax_layouts() on the destination
inode on any write fault.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-19 17:57     ` Dan Williams
  2018-03-19 18:19       ` Darrick J. Wong
@ 2018-03-19 19:45       ` Christoph Hellwig
  2018-03-19 20:10         ` Dan Williams
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-19 19:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel,
	Christoph Hellwig

On Mon, Mar 19, 2018 at 10:57:55AM -0700, Dan Williams wrote:
> I don't see anything to adapt with respect to mmap locks since reflink
> and dax are mutually exclusive.

For now.  I'll change that pretty soon.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-19 19:45       ` Christoph Hellwig
@ 2018-03-19 20:10         ` Dan Williams
  2018-03-19 21:14           ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Dan Williams @ 2018-03-19 20:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel

On Mon, Mar 19, 2018 at 12:45 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Mar 19, 2018 at 10:57:55AM -0700, Dan Williams wrote:
>> I don't see anything to adapt with respect to mmap locks since reflink
>> and dax are mutually exclusive.
>
> For now.  I'll change that pretty soon.

Right, so which patch set will be staged first? This one or the one
that causes us to consider reflink vs dax locking?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()
  2018-03-19 20:10         ` Dan Williams
@ 2018-03-19 21:14           ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2018-03-19 21:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Dave Chinner,
	Linux Kernel Mailing List, linux-xfs, linux-fsdevel,
	Christoph Hellwig

On Mon, Mar 19, 2018 at 01:10:52PM -0700, Dan Williams wrote:
> On Mon, Mar 19, 2018 at 12:45 PM, Christoph Hellwig <hch@lst.de> wrote:
> > On Mon, Mar 19, 2018 at 10:57:55AM -0700, Dan Williams wrote:
> >> I don't see anything to adapt with respect to mmap locks since reflink
> >> and dax are mutually exclusive.
> >
> > For now.  I'll change that pretty soon.
> 
> Right, so which patch set will be staged first? This one or the one
> that causes us to consider reflink vs dax locking?

Given that I haven't even posted the reflink+DAX support (not fixed
all issues with it) I'm pretty sure you'll be faster :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-03-19 21:07 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-15 15:51 [PATCH v6 00/15] dax: fix dma vs truncate/hole-punch Dan Williams
2018-03-15 15:51 ` [PATCH v6 01/15] dax: store pfns in the radix Dan Williams
2018-03-15 15:51 ` [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations Dan Williams
2018-03-16 18:59   ` Christoph Hellwig
2018-03-15 15:51 ` [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages() Dan Williams
2018-03-16 18:59   ` Christoph Hellwig
2018-03-15 15:51 ` [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops Dan Williams
2018-03-16 19:00   ` Christoph Hellwig
2018-03-15 15:51 ` [PATCH v6 05/15] ext4, dax: introduce ext4_dax_aops Dan Williams
2018-03-15 15:51 ` [PATCH v6 06/15] ext2, dax: introduce ext2_dax_aops Dan Williams
2018-03-18  4:02   ` kbuild test robot
2018-03-18  4:02   ` [RFC PATCH] ext2, dax: ext2_dax_aops can be static kbuild test robot
2018-03-15 15:52 ` [PATCH v6 07/15] fs, dax: use page->mapping to warn if truncate collides with a busy page Dan Williams
2018-03-18  6:26   ` kbuild test robot
2018-03-15 15:52 ` [PATCH v6 08/15] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
2018-03-15 15:52 ` [PATCH v6 09/15] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
2018-03-15 15:52 ` [PATCH v6 10/15] memremap: mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
2018-03-15 15:52 ` [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
2018-03-16 19:01   ` Christoph Hellwig
2018-03-17 22:14   ` kbuild test robot
2018-03-15 15:52 ` [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts() Dan Williams
2018-03-16 19:04   ` Christoph Hellwig
2018-03-16 19:10     ` Dan Williams
2018-03-19 17:33   ` Darrick J. Wong
2018-03-19 17:57     ` Dan Williams
2018-03-19 18:19       ` Darrick J. Wong
2018-03-19 18:34         ` Dan Williams
2018-03-19 19:45       ` Christoph Hellwig
2018-03-19 20:10         ` Dan Williams
2018-03-19 21:14           ` Christoph Hellwig
2018-03-15 15:52 ` [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts() Dan Williams
2018-03-16 19:08   ` Christoph Hellwig
2018-03-15 15:52 ` [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
2018-03-16 19:08   ` Christoph Hellwig
2018-03-19 17:45   ` Darrick J. Wong
2018-03-19 18:09     ` Dan Williams
2018-03-15 15:52 ` [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
2018-03-16 19:09   ` Christoph Hellwig
2018-03-17 22:11   ` kbuild test robot
2018-03-17 23:47   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).