linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch
@ 2018-04-24 23:33 Dan Williams
  2018-04-24 23:33 ` [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure Dan Williams
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Dave Chinner, Darrick J. Wong, Jeff Moyer,
	Alexander Viro, Heiko Carstens, Jan Kara, Mike Snitzer,
	Alasdair Kergon, kbuild test robot, Christoph Hellwig,
	Jérôme Glisse, Andrew Morton, Matthew Wilcox,
	Michal Hocko, Ross Zwisler, stable, Thomas Meyer,
	Martin Schwidefsky, linux-fsdevel, linux-xfs, linux-mm, jack,
	hch

Changes since v8 [1]:
* Rebase on v4.17-rc2

* Fix get_user_pages_fast() for ZONE_DEVICE pages to revalidate the pte,
  pmd, pud after taking references (Jan)

* Kill dax_layout_lock(). With get_user_pages_fast() for ZONE_DEVICE
  fixed we can then rely on the {pte,pmd}_lock to synchronize
  dax_layout_busy_page() vs new page references (Jan)

* Hold the iolock over repeated invocations of dax_layout_busy_page() to
  enable truncate/hole-punch to make forward progress in the presence of
  a constant stream of new direct-I/O requests (Jan).

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/015058.html

---

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

---

Dan Williams (9):
      dax, dm: introduce ->fs_{claim,release}() dax_device infrastructure
      mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
      memremap: split devm_memremap_pages() and memremap() infrastructure
      mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
      mm: fix __gup_device_huge vs unmap
      mm, fs, dax: handle layout changes to pinned dax mappings
      xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
      xfs: prepare xfs_break_layouts() for another layout type
      xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/super.c      |   99 ++++++++++++++++++++--
 drivers/md/dm.c          |   57 +++++++++++++
 drivers/nvdimm/pmem.c    |    3 -
 fs/Kconfig               |    2 
 fs/dax.c                 |   97 +++++++++++++++++++++
 fs/ext2/super.c          |    6 +
 fs/ext4/super.c          |    6 +
 fs/xfs/xfs_file.c        |   72 +++++++++++++++-
 fs/xfs/xfs_inode.h       |   16 ++++
 fs/xfs/xfs_ioctl.c       |    8 --
 fs/xfs/xfs_iops.c        |   16 ++--
 fs/xfs/xfs_pnfs.c        |   16 ++--
 fs/xfs/xfs_pnfs.h        |    6 +
 fs/xfs/xfs_super.c       |   20 ++--
 include/linux/dax.h      |   71 +++++++++++++++-
 include/linux/memremap.h |   25 ++----
 include/linux/mm.h       |   71 ++++++++++++----
 kernel/Makefile          |    3 -
 kernel/iomem.c           |  167 +++++++++++++++++++++++++++++++++++++
 kernel/memremap.c        |  208 ++++++----------------------------------------
 mm/Kconfig               |    5 +
 mm/gup.c                 |   37 ++++++--
 mm/hmm.c                 |   13 ---
 mm/swap.c                |    3 -
 24 files changed, 730 insertions(+), 297 deletions(-)
 create mode 100644 kernel/iomem.c

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-05-09 10:37   ` Jan Kara
  2018-04-24 23:33 ` [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Alasdair Kergon, Matthew Wilcox, Ross Zwisler,
	Jérôme Glisse, Christoph Hellwig, Jan Kara,
	Mike Snitzer, david, linux-fsdevel, linux-xfs, linux-mm, jack,
	hch

In preparation for allowing filesystems to augment the dev_pagemap
associated with a dax_device, add an ->fs_claim() callback. The
->fs_claim() callback is leveraged by the device-mapper dax
implementation to iterate all member devices in the map and repeat the
claim operation across the array.

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.
The ->fs_claim() operation arranges for this callback to be registered,
although that implementation is saved for a later patch.

Cc: Alasdair Kergon <agk@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   76 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm.c          |   57 +++++++++++++++++++++++++++++++++++
 include/linux/dax.h      |   48 +++++++++++++++++++++++++++++
 include/linux/memremap.h |    8 +++++
 4 files changed, 189 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..e62a64b9c9fb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -169,9 +170,84 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+	/* TODO: wakeup page-idle waiters */
+}
+
+static struct dax_device *__fs_dax_claim(struct dax_device *dax_dev,
+		void *owner)
+{
+	struct dev_pagemap *pgmap;
+
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if (pgmap->data && pgmap->data == owner) {
+		/* dm might try to claim the same device more than once... */
+		mutex_unlock(&devmap_lock);
+		return dax_dev;
+	} else if (pgmap->page_free || pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
+{
+	if (dax_dev->ops->fs_claim)
+		return dax_dev->ops->fs_claim(dax_dev, owner);
+	else
+		return __fs_dax_claim(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim);
+
+static void __fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev->pgmap;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	if (!dax_dev)
+		return;
+	if (dax_dev->ops->fs_release)
+		dax_dev->ops->fs_release(dax_dev, owner);
+	else
+		__fs_dax_release(dax_dev, owner);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4ea404dbcf0b..468aff8d9694 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1088,6 +1088,61 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 	return ret;
 }
 
+static int dm_dax_dev_claim(struct dm_target *ti, struct dm_dev *dev,
+		sector_t start, sector_t len, void *owner)
+{
+	if (fs_dax_claim(dev->dax_dev, owner))
+		return 0;
+	/*
+	 * Outside of a kernel bug there is no reason a dax_dev should
+	 * fail a claim attempt. Device-mapper should have exclusive
+	 * ownership of the dm_dev and the filesystem should have
+	 * exclusive ownership of the dm_target.
+	 */
+	WARN_ON_ONCE(1);
+	return -ENXIO;
+}
+
+static int dm_dax_dev_release(struct dm_target *ti, struct dm_dev *dev,
+		sector_t start, sector_t len, void *owner)
+{
+	fs_dax_release(dev->dax_dev, owner);
+	return 0;
+}
+
+static void dm_dax_iterate_devices(struct dax_device *dax_dev,
+		iterate_devices_callout_fn fn, void *arg)
+{
+	struct mapped_device *md = dax_get_private(dax_dev);
+	struct dm_table *map;
+	struct dm_target *ti;
+	int i, srcu_idx;
+
+	map = dm_get_live_table(md, &srcu_idx);
+
+	for (i = 0; i < dm_table_get_num_targets(map); i++) {
+		ti = dm_table_get_target(map, i);
+
+		if (ti->type->iterate_devices)
+			ti->type->iterate_devices(ti, fn, arg);
+	}
+
+	dm_put_live_table(md, srcu_idx);
+}
+
+static struct dax_device *dm_dax_fs_claim(struct dax_device *dax_dev,
+		void *owner)
+{
+	dm_dax_iterate_devices(dax_dev, dm_dax_dev_claim, owner);
+	/* see comment in dm_dax_dev_claim about this unconditional return */
+	return dax_dev;
+}
+
+static void dm_dax_fs_release(struct dax_device *dax_dev, void *owner)
+{
+	dm_dax_iterate_devices(dax_dev, dm_dax_dev_release, owner);
+}
+
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
@@ -3133,6 +3188,8 @@ static const struct block_device_operations dm_blk_dops = {
 static const struct dax_operations dm_dax_ops = {
 	.direct_access = dm_dax_direct_access,
 	.copy_from_iter = dm_dax_copy_from_iter,
+	.fs_claim = dm_dax_fs_claim,
+	.fs_release = dm_dax_fs_release,
 };
 
 /*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f9eb22ad341e..af02f93c943a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,6 +4,7 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/blkdev.h>
 #include <linux/radix-tree.h>
 #include <asm/pgtable.h>
 
@@ -20,6 +21,10 @@ struct dax_operations {
 	/* copy_from_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
 			struct iov_iter *);
+	/* fs_claim: setup filesytem parameters for the device's dev_pagemap */
+	struct dax_device *(*fs_claim)(struct dax_device *, void *);
+	/* fs_release: restore device's dev_pagemap to its default state */
+	void (*fs_release)(struct dax_device *, void *);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -83,6 +88,26 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
+#ifdef CONFIG_BLOCK
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
+{
+	struct dax_device *dax_dev;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	return fs_dax_claim(dax_dev, owner);
+}
+#else
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
+{
+	return NULL;
+}
+#endif
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -108,6 +133,29 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
+		void *owner)
+{
+	return NULL;
+}
+
+#ifdef CONFIG_BLOCK
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
+{
+	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
+}
+#else
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
+{
+	return NULL;
+}
+#endif
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+}
 #endif
 
 int dax_read_lock(void);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..02d6d042ee7f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,19 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * When MEMORY_DEVICE_HOST memory is represented by a device that can
+ * host a filesystem, for example /dev/pmem0, that filesystem can
+ * register for a callback when a page is idled. For the filesystem-dax
+ * case page idle callbacks are used to coordinate DMA vs
+ * hole-punch/truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-04-24 23:33 ` [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-05-16  7:20   ` Dan Williams
  2018-04-24 23:33 ` [PATCH v9 3/9] memremap: split devm_memremap_pages() and memremap() infrastructure Dan Williams
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jérôme Glisse, Christoph Hellwig,
	Jan Kara, david, linux-fsdevel, linux-xfs, linux-mm, jack, hch

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c   |   21 +++++++++++----------
 drivers/nvdimm/pmem.c |    3 ++-
 fs/ext2/super.c       |    6 +++---
 fs/ext4/super.c       |    6 +++---
 fs/xfs/xfs_super.c    |   20 ++++++++++----------
 include/linux/dax.h   |   14 ++++++++------
 6 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e62a64b9c9fb..e4864f319e16 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -575,6 +565,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..fc1a1ab25e9e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
+			&pmem->pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index de1694512f1f..421c7d4bed39 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 185f7e61f4cf..3e5d0f9e8772 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -954,7 +954,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3407,7 +3407,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4429,7 +4429,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d71424052917..f53f8a47a526 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -724,7 +724,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -732,10 +732,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -753,9 +753,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -765,7 +765,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -779,7 +779,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -813,14 +813,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index af02f93c943a..fe322d67856e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -33,6 +33,8 @@ extern struct attribute_group dax_attribute_group;
 struct dax_device *dax_get_by_host(const char *host);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
@@ -51,6 +53,12 @@ static inline struct dax_device *alloc_dax(void *private, const char *host,
 	 */
 	return NULL;
 }
+static inline struct dax_device *alloc_dax_devmap(void *private,
+		const char *host, const struct dax_operations *ops,
+		struct dev_pagemap *pgmap)
+{
+	return NULL;
+}
 static inline void put_dax(struct dax_device *dax_dev)
 {
 }
@@ -85,7 +93,6 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 	put_dax(dax_dev);
 }
 
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
@@ -123,11 +130,6 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 {
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	return NULL;
-}
-
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc)
 {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 3/9] memremap: split devm_memremap_pages() and memremap() infrastructure
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
  2018-04-24 23:33 ` [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure Dan Williams
  2018-04-24 23:33 ` [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-05-09 10:29   ` Jan Kara
  2018-04-24 23:33 ` [PATCH v9 4/9] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Christoph Hellwig, Jérôme Glisse,
	Ross Zwisler, david, linux-fsdevel, linux-xfs, linux-mm, jack,
	hch

Currently, kernel/memremap.c contains generic code for supporting
memremap() (CONFIG_HAS_IOMEM) and devm_memremap_pages()
(CONFIG_ZONE_DEVICE). This causes ongoing build maintenance problems as
additions to memremap.c, especially for the ZONE_DEVICE case, need to be
careful about being placed in ifdef guards. Remove the need for these
ifdef guards by moving the ZONE_DEVICE support functions to their own
compilation unit.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/Makefile   |    3 +
 kernel/iomem.c    |  167 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c |  178 +----------------------------------------------------
 3 files changed, 171 insertions(+), 177 deletions(-)
 create mode 100644 kernel/iomem.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f85ae5dfa474..9b9241361311 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -112,7 +112,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 
-obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_HAS_IOMEM) += iomem.o
+obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/iomem.c b/kernel/iomem.c
new file mode 100644
index 000000000000..f7525e14ebc6
--- /dev/null
+++ b/kernel/iomem.c
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/device.h>
+#include <linux/types.h>
+#include <linux/io.h>
+#include <linux/mm.h>
+
+#ifndef ioremap_cache
+/* temporary while we convert existing ioremap_cache users to memremap */
+__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
+{
+	return ioremap(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_wb
+static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
+{
+	return (__force void *)ioremap_cache(offset, size);
+}
+#endif
+
+#ifndef arch_memremap_can_ram_remap
+static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
+					unsigned long flags)
+{
+	return true;
+}
+#endif
+
+static void *try_ram_remap(resource_size_t offset, size_t size,
+			   unsigned long flags)
+{
+	unsigned long pfn = PHYS_PFN(offset);
+
+	/* In the simple case just return the existing linear address */
+	if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
+	    arch_memremap_can_ram_remap(offset, size, flags))
+		return __va(offset);
+
+	return NULL; /* fallback to arch_memremap_wb */
+}
+
+/**
+ * memremap() - remap an iomem_resource as cacheable memory
+ * @offset: iomem resource start address
+ * @size: size of remap
+ * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
+ *		  MEMREMAP_ENC, MEMREMAP_DEC
+ *
+ * memremap() is "ioremap" for cases where it is known that the resource
+ * being mapped does not have i/o side effects and the __iomem
+ * annotation is not applicable. In the case of multiple flags, the different
+ * mapping types will be attempted in the order listed below until one of
+ * them succeeds.
+ *
+ * MEMREMAP_WB - matches the default mapping for System RAM on
+ * the architecture.  This is usually a read-allocate write-back cache.
+ * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
+ * memremap() will bypass establishing a new mapping and instead return
+ * a pointer into the direct map.
+ *
+ * MEMREMAP_WT - establish a mapping whereby writes either bypass the
+ * cache or are written through to memory and never exist in a
+ * cache-dirty state with respect to program visibility.  Attempts to
+ * map System RAM with this mapping type will fail.
+ *
+ * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
+ * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
+ * uncached. Attempts to map System RAM with this mapping type will fail.
+ */
+void *memremap(resource_size_t offset, size_t size, unsigned long flags)
+{
+	int is_ram = region_intersects(offset, size,
+				       IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
+	void *addr = NULL;
+
+	if (!flags)
+		return NULL;
+
+	if (is_ram == REGION_MIXED) {
+		WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
+				&offset, (unsigned long) size);
+		return NULL;
+	}
+
+	/* Try all mapping types requested until one returns non-NULL */
+	if (flags & MEMREMAP_WB) {
+		/*
+		 * MEMREMAP_WB is special in that it can be satisifed
+		 * from the direct map.  Some archs depend on the
+		 * capability of memremap() to autodetect cases where
+		 * the requested range is potentially in System RAM.
+		 */
+		if (is_ram == REGION_INTERSECTS)
+			addr = try_ram_remap(offset, size, flags);
+		if (!addr)
+			addr = arch_memremap_wb(offset, size);
+	}
+
+	/*
+	 * If we don't have a mapping yet and other request flags are
+	 * present then we will be attempting to establish a new virtual
+	 * address mapping.  Enforce that this mapping is not aliasing
+	 * System RAM.
+	 */
+	if (!addr && is_ram == REGION_INTERSECTS && flags != MEMREMAP_WB) {
+		WARN_ONCE(1, "memremap attempted on ram %pa size: %#lx\n",
+				&offset, (unsigned long) size);
+		return NULL;
+	}
+
+	if (!addr && (flags & MEMREMAP_WT))
+		addr = ioremap_wt(offset, size);
+
+	if (!addr && (flags & MEMREMAP_WC))
+		addr = ioremap_wc(offset, size);
+
+	return addr;
+}
+EXPORT_SYMBOL(memremap);
+
+void memunmap(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		iounmap((void __iomem *) addr);
+}
+EXPORT_SYMBOL(memunmap);
+
+static void devm_memremap_release(struct device *dev, void *res)
+{
+	memunmap(*(void **)res);
+}
+
+static int devm_memremap_match(struct device *dev, void *res, void *match_data)
+{
+	return *(void **)res == match_data;
+}
+
+void *devm_memremap(struct device *dev, resource_size_t offset,
+		size_t size, unsigned long flags)
+{
+	void **ptr, *addr;
+
+	ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
+			dev_to_node(dev));
+	if (!ptr)
+		return ERR_PTR(-ENOMEM);
+
+	addr = memremap(offset, size, flags);
+	if (addr) {
+		*ptr = addr;
+		devres_add(dev, ptr);
+	} else {
+		devres_free(ptr);
+		return ERR_PTR(-ENXIO);
+	}
+
+	return addr;
+}
+EXPORT_SYMBOL(devm_memremap);
+
+void devm_memunmap(struct device *dev, void *addr)
+{
+	WARN_ON(devres_release(dev, devm_memremap_release,
+				devm_memremap_match, addr));
+}
+EXPORT_SYMBOL(devm_memunmap);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 895e6b76b25e..37a9604133f6 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -1,15 +1,5 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- */
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2015 Intel Corporation. All rights reserved. */
 #include <linux/radix-tree.h>
 #include <linux/device.h>
 #include <linux/types.h>
@@ -20,169 +10,6 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
-#ifndef ioremap_cache
-/* temporary while we convert existing ioremap_cache users to memremap */
-__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
-{
-	return ioremap(offset, size);
-}
-#endif
-
-#ifndef arch_memremap_wb
-static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
-{
-	return (__force void *)ioremap_cache(offset, size);
-}
-#endif
-
-#ifndef arch_memremap_can_ram_remap
-static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
-					unsigned long flags)
-{
-	return true;
-}
-#endif
-
-static void *try_ram_remap(resource_size_t offset, size_t size,
-			   unsigned long flags)
-{
-	unsigned long pfn = PHYS_PFN(offset);
-
-	/* In the simple case just return the existing linear address */
-	if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
-	    arch_memremap_can_ram_remap(offset, size, flags))
-		return __va(offset);
-
-	return NULL; /* fallback to arch_memremap_wb */
-}
-
-/**
- * memremap() - remap an iomem_resource as cacheable memory
- * @offset: iomem resource start address
- * @size: size of remap
- * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
- *		  MEMREMAP_ENC, MEMREMAP_DEC
- *
- * memremap() is "ioremap" for cases where it is known that the resource
- * being mapped does not have i/o side effects and the __iomem
- * annotation is not applicable. In the case of multiple flags, the different
- * mapping types will be attempted in the order listed below until one of
- * them succeeds.
- *
- * MEMREMAP_WB - matches the default mapping for System RAM on
- * the architecture.  This is usually a read-allocate write-back cache.
- * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
- * memremap() will bypass establishing a new mapping and instead return
- * a pointer into the direct map.
- *
- * MEMREMAP_WT - establish a mapping whereby writes either bypass the
- * cache or are written through to memory and never exist in a
- * cache-dirty state with respect to program visibility.  Attempts to
- * map System RAM with this mapping type will fail.
- *
- * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
- * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
- * uncached. Attempts to map System RAM with this mapping type will fail.
- */
-void *memremap(resource_size_t offset, size_t size, unsigned long flags)
-{
-	int is_ram = region_intersects(offset, size,
-				       IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
-	void *addr = NULL;
-
-	if (!flags)
-		return NULL;
-
-	if (is_ram == REGION_MIXED) {
-		WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
-				&offset, (unsigned long) size);
-		return NULL;
-	}
-
-	/* Try all mapping types requested until one returns non-NULL */
-	if (flags & MEMREMAP_WB) {
-		/*
-		 * MEMREMAP_WB is special in that it can be satisifed
-		 * from the direct map.  Some archs depend on the
-		 * capability of memremap() to autodetect cases where
-		 * the requested range is potentially in System RAM.
-		 */
-		if (is_ram == REGION_INTERSECTS)
-			addr = try_ram_remap(offset, size, flags);
-		if (!addr)
-			addr = arch_memremap_wb(offset, size);
-	}
-
-	/*
-	 * If we don't have a mapping yet and other request flags are
-	 * present then we will be attempting to establish a new virtual
-	 * address mapping.  Enforce that this mapping is not aliasing
-	 * System RAM.
-	 */
-	if (!addr && is_ram == REGION_INTERSECTS && flags != MEMREMAP_WB) {
-		WARN_ONCE(1, "memremap attempted on ram %pa size: %#lx\n",
-				&offset, (unsigned long) size);
-		return NULL;
-	}
-
-	if (!addr && (flags & MEMREMAP_WT))
-		addr = ioremap_wt(offset, size);
-
-	if (!addr && (flags & MEMREMAP_WC))
-		addr = ioremap_wc(offset, size);
-
-	return addr;
-}
-EXPORT_SYMBOL(memremap);
-
-void memunmap(void *addr)
-{
-	if (is_vmalloc_addr(addr))
-		iounmap((void __iomem *) addr);
-}
-EXPORT_SYMBOL(memunmap);
-
-static void devm_memremap_release(struct device *dev, void *res)
-{
-	memunmap(*(void **)res);
-}
-
-static int devm_memremap_match(struct device *dev, void *res, void *match_data)
-{
-	return *(void **)res == match_data;
-}
-
-void *devm_memremap(struct device *dev, resource_size_t offset,
-		size_t size, unsigned long flags)
-{
-	void **ptr, *addr;
-
-	ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
-			dev_to_node(dev));
-	if (!ptr)
-		return ERR_PTR(-ENOMEM);
-
-	addr = memremap(offset, size, flags);
-	if (addr) {
-		*ptr = addr;
-		devres_add(dev, ptr);
-	} else {
-		devres_free(ptr);
-		return ERR_PTR(-ENXIO);
-	}
-
-	return addr;
-}
-EXPORT_SYMBOL(devm_memremap);
-
-void devm_memunmap(struct device *dev, void *addr)
-{
-	WARN_ON(devres_release(dev, devm_memremap_release,
-				devm_memremap_match, addr));
-}
-EXPORT_SYMBOL(devm_memunmap);
-
-#ifdef CONFIG_ZONE_DEVICE
 static DEFINE_MUTEX(pgmap_lock);
 static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
@@ -473,7 +300,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 	return pgmap;
 }
-#endif /* CONFIG_ZONE_DEVICE */
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 void put_zone_device_private_or_public_page(struct page *page)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 4/9] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (2 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 3/9] memremap: split devm_memremap_pages() and memremap() infrastructure Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-04-24 23:33 ` [PATCH v9 5/9] mm: fix __gup_device_huge vs unmap Dan Williams
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Martin Schwidefsky, Heiko Carstens, Michal Hocko,
	kbuild test robot, Thomas Meyer, Christoph Hellwig,
	Jérôme Glisse, Jan Kara, david, linux-fsdevel,
	linux-xfs, linux-mm, jack, hch

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    4 ++-
 fs/Kconfig               |    1 +
 include/linux/dax.h      |   48 +++++++++++++++++--------------
 include/linux/memremap.h |   17 -----------
 include/linux/mm.h       |   71 ++++++++++++++++++++++++++++++++++------------
 kernel/memremap.c        |   30 +++++++++++++++++--
 mm/Kconfig               |    5 +++
 mm/hmm.c                 |   13 +-------
 mm/swap.c                |    3 +-
 9 files changed, 118 insertions(+), 74 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e4864f319e16..86b3806ea35b 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -164,7 +164,7 @@ struct dax_device {
 	const struct dax_operations *ops;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_FS_DAX) && IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
 	/* TODO: wakeup page-idle waiters */
@@ -191,6 +191,7 @@ static struct dax_device *__fs_dax_claim(struct dax_device *dax_dev,
 		return NULL;
 	}
 
+	dev_pagemap_get_ops();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -223,6 +224,7 @@ static void __fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	dev_pagemap_put_ops();
 	mutex_unlock(&devmap_lock);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1e050e012eb9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fe322d67856e..f1d6a8366e4b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,6 +4,7 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/genhd.h>
 #include <linux/blkdev.h>
 #include <linux/radix-tree.h>
 #include <asm/pgtable.h>
@@ -95,26 +96,6 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
-struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
-#ifdef CONFIG_BLOCK
-static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
-		void *owner)
-{
-	struct dax_device *dax_dev;
-
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
-	return fs_dax_claim(dax_dev, owner);
-}
-#else
-static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
-		void *owner)
-{
-	return NULL;
-}
-#endif
-void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -135,13 +116,35 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+#endif
 
-static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
+#if IS_ENABLED(CONFIG_FS_DAX) && IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
+struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
+#ifdef CONFIG_BLOCK
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
+{
+	struct dax_device *dax_dev;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	return fs_dax_claim(dax_dev, owner);
+}
+#else
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
 		void *owner)
 {
 	return NULL;
 }
-
+#endif
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
+#else
+static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
+		void *owner)
+{
+	return dax_dev;
+}
 #ifdef CONFIG_BLOCK
 static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
 		void *owner)
@@ -157,6 +160,7 @@ static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
 #endif
 static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
+	fs_put_dax(dax_dev);
 }
 #endif
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..8cc619fe347b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -137,8 +136,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -169,20 +166,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
 	if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..6e19265ee8f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,27 +821,65 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
+{
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
+	return false;
+}
+
+static inline bool is_device_private_page(const struct page *page)
 {
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
-#define IS_HMM_ENABLED 0
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline void dev_pagemap_get_ops(void)
+{
+}
+
+static inline void dev_pagemap_put_ops(void)
+{
+}
+
+static inline bool put_devmap_managed_page(struct page *page)
+{
+	return false;
+}
+
 static inline bool is_device_private_page(const struct page *page)
 {
 	return false;
 }
+
 static inline bool is_device_public_page(const struct page *page)
 {
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
@@ -859,16 +897,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 37a9604133f6..faceb3359c23 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -301,8 +301,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 	return pgmap;
 }
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL_GPL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -322,5 +344,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index d5004d82a1d6..bf9d6366bced 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -692,6 +692,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEV_PAGEMAP_OPS
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -712,6 +715,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -722,6 +726,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 486dc394a5a3..de7b6bf77201 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -1167,7 +1158,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1261,7 +1252,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 3dd518832096..26fc9b5f1b6c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -743,7 +744,7 @@ void release_pages(struct page **pages, int nr)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 5/9] mm: fix __gup_device_huge vs unmap
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (3 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 4/9] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-05-09 10:46   ` Jan Kara
  2018-04-24 23:33 ` [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: stable, Jan Kara, david, hch, linux-fsdevel, linux-xfs, linux-mm,
	jack, hch

get_user_pages_fast() for device pages is missing the typical validation
that all page references have been taken while the mapping was valid.
Without this validation truncate operations can not reliably coordinate
against new page reference events like O_DIRECT.

Cc: <stable@vger.kernel.org>
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/gup.c |   36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 76af4cfeaf68..84dd2063ca3d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1456,32 +1456,48 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 	return 1;
 }
 
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
+	int nr_start = *nr;
+
+	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+		return 0;
 
-	fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		undo_dev_pagemap(nr, nr_start, pages);
+		return 0;
+	}
+	return 1;
 }
 
-static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
+static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
+	int nr_start = *nr;
+
+	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+		return 0;
 
-	fault_pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		undo_dev_pagemap(nr, nr_start, pages);
+		return 0;
+	}
+	return 1;
 }
 #else
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
+static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
 }
 
-static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
+static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
 	BUILD_BUG();
@@ -1499,7 +1515,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (pmd_devmap(orig))
-		return __gup_device_huge_pmd(orig, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
 
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
@@ -1537,7 +1553,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 
 	if (pud_devmap(orig))
-		return __gup_device_huge_pud(orig, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
 
 	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (4 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 5/9] mm: fix __gup_device_huge vs unmap Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-05-09 10:56   ` Jan Kara
  2018-04-24 23:33 ` [PATCH v9 7/9] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Jeff Moyer, Dave Chinner, Matthew Wilcox,
	Alexander Viro, Darrick J. Wong, Ross Zwisler, Dave Hansen,
	Andrew Morton, Christoph Hellwig, Christoph Hellwig,
	linux-fsdevel, linux-xfs, linux-mm, jack, hch

Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    2 +
 fs/Kconfig          |    1 +
 fs/dax.c            |   97 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |    7 ++++
 mm/gup.c            |    1 +
 5 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 86b3806ea35b..89f21bd9da10 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX) && IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-	/* TODO: wakeup page-idle waiters */
+	wake_up_var(&page->_refcount);
 }
 
 static struct dax_device *__fs_dax_claim(struct dax_device *dax_dev,
diff --git a/fs/Kconfig b/fs/Kconfig
index 1e050e012eb9..c9acbf695ddd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -40,6 +40,7 @@ config FS_DAX
 	depends on !(ARM || MIPS || SPARC)
 	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
 	select FS_IOMAP
+	select SRCU
 	select DAX
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
diff --git a/fs/dax.c b/fs/dax.c
index aaec72ded1b6..e8f61ea690f7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -351,6 +351,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+	unsigned long pfn;
+
+	for_each_mapped_pfn(entry, pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -492,6 +505,90 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true.
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	pgoff_t	indices[PAGEVEC_SIZE];
+	struct page *page = NULL;
+	struct pagevec pvec;
+	pgoff_t	index, end;
+	unsigned i;
+
+	/*
+	 * In the 'limited' case get_user_pages() for dax is disabled.
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return NULL;
+
+	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+		return NULL;
+
+	pagevec_init(&pvec);
+	index = 0;
+	end = -1;
+
+	/*
+	 * If we race get_user_pages_fast() here either we'll see the
+	 * elevated page count in the pagevec_lookup and wait, or
+	 * get_user_pages_fast() will see that the page it took a reference
+	 * against is no longer mapped in the page tables and bail to the
+	 * get_user_pages() slow path.  The slow path is protected by
+	 * pte_lock() and pmd_lock(). New references are not taken without
+	 * holding those locks, and unmap_mapping_range() will not zero the
+	 * pte or pmd without holding the respective lock, so we are
+	 * guaranteed to either see new references or prevent new
+	 * references from being established.
+	 */
+	unmap_mapping_range(mapping, 0, 0, 1);
+
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			xa_lock_irq(&mapping->i_pages);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dax_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			xa_unlock_irq(&mapping->i_pages);
+			if (page)
+				break;
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+
+		if (page)
+			break;
+	}
+	return page;
+}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f1d6a8366e4b..3369aeb180e8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -96,6 +96,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
+
+struct page *dax_layout_busy_page(struct address_space *mapping);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -116,6 +118,11 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+	return NULL;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_FS_DAX) && IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
diff --git a/mm/gup.c b/mm/gup.c
index 84dd2063ca3d..75ade7ebddb2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,7 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 7/9] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (5 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-04-24 23:33 ` [PATCH v9 8/9] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Darrick J. Wong, Ross Zwisler, Dave Chinner, Christoph Hellwig,
	Christoph Hellwig, Darrick J. Wong, linux-fsdevel, linux-xfs,
	linux-mm, jack, hch

In preparation for adding coordination between extent unmap operations
and busy dax-pages, update xfs_break_layouts() to permit it to be called
with the mmap lock held. This lock scheme will be required for
coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
pages mapped into the file's address space). Breaking dax layouts will
be added to xfs_break_layouts() in a future patch, for now this preps
the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
xfs_break_layouts().

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |    5 +----
 fs/xfs/xfs_ioctl.c |    5 +----
 fs/xfs/xfs_iops.c  |   10 +++++++---
 fs/xfs/xfs_pnfs.c  |    3 ++-
 4 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 299aee4b7b0b..35309bd046be 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -734,7 +734,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = false;
 
@@ -748,9 +748,6 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89fb1eb80aae..4151fade4bb1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -614,7 +614,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -648,9 +648,6 @@ xfs_ioc_space(
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a3ed3c811dfa..138fb36ca875 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1031,13 +1031,17 @@ xfs_vn_setattr(
 
 	if (iattr->ia_valid & ATTR_SIZE) {
 		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
-		uint			iolock = XFS_IOLOCK_EXCL;
+		uint			iolock;
+
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
 		error = xfs_break_layouts(d_inode(dentry), &iolock);
-		if (error)
+		if (error) {
+			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 			return error;
+		}
 
-		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		error = xfs_vn_setattr_size(dentry, iattr);
 		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	} else {
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index aa6c5c193f45..6ea7b0b55d02 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -43,7 +43,8 @@ xfs_break_layouts(
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
 		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
+		*iolock &= ~XFS_IOLOCK_SHARED;
+		*iolock |= XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 8/9] xfs: prepare xfs_break_layouts() for another layout type
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (6 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 7/9] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-04-24 23:33 ` [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  2018-05-03 23:53 ` [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
  9 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Ross Zwisler, Darrick J. Wong, Dave Chinner, Christoph Hellwig,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-mm, jack, hch

When xfs is operating as the back-end of a pNFS block server, it
prevents collisions between local and remote operations by requiring a
lease to be held for remotely accessed blocks. Local filesystem
operations break those leases before writing or mutating the extent map
of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
layout breaking.

Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
do not collide with local writes. Additionally, layouts are broken in
the BREAK_UNMAP case to make sure the layout-holder has a consistent
view of the file's extent map. While BREAK_WRITE breaks can be satisfied
be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.

After this refactoring xfs_break_layouts() becomes the entry point for
coordinating both types of breaks. Finally, xfs_break_leased_layouts()
becomes just the BREAK_WRITE handler.

Note that the unlock tracking is needed in a follow on change. That will
coordinate retrying either break handler until both successfully test
for a lease break while maintaining the lock state.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   30 ++++++++++++++++++++++++++++--
 fs/xfs/xfs_inode.h |   16 ++++++++++++++++
 fs/xfs/xfs_ioctl.c |    3 +--
 fs/xfs/xfs_iops.c  |    6 +++---
 fs/xfs/xfs_pnfs.c  |   13 +++++++------
 fs/xfs/xfs_pnfs.h  |    6 ++++--
 6 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 35309bd046be..1a5176b21803 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -312,7 +312,7 @@ xfs_file_aio_write_checks(
 	if (error <= 0)
 		return error;
 
-	error = xfs_break_layouts(inode, iolock);
+	error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
 	if (error)
 		return error;
 
@@ -718,6 +718,32 @@ xfs_file_write_iter(
 	return ret;
 }
 
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock,
+	enum layout_break_reason reason)
+{
+	bool			retry = false;
+	int			error = 0;
+
+	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	switch (reason) {
+	case BREAK_UNMAP:
+		ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+		/* fall through */
+	case BREAK_WRITE:
+		error = xfs_break_leased_layouts(inode, iolock, &retry);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+
+	return error;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
@@ -744,7 +770,7 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1eebc53df7d7..31401cc9d0e6 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,20 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 					>> XFS_ILOCK_SHIFT)
 
 /*
+ * Layouts are broken in the BREAK_WRITE case to ensure that
+ * layout-holders do not collide with local writes. Additionally,
+ * layouts are broken in the BREAK_UNMAP case to make sure the
+ * layout-holder has a consistent view of the file's extent map. While
+ * BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases,
+ * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
+ * go idle.
+ */
+enum layout_break_reason {
+        BREAK_WRITE,
+        BREAK_UNMAP,
+};
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -443,6 +457,8 @@ enum xfs_prealloc_flags {
 
 int	xfs_update_prealloc_flags(struct xfs_inode *ip,
 				  enum xfs_prealloc_flags flags);
+int	xfs_break_layouts(struct inode *inode, uint *iolock,
+		enum layout_break_reason reason);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4151fade4bb1..91e73d663099 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,7 +39,6 @@
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
 #include "xfs_trans.h"
-#include "xfs_pnfs.h"
 #include "xfs_acl.h"
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
@@ -644,7 +643,7 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 138fb36ca875..ce0c1f9466a8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,7 +37,6 @@
 #include "xfs_da_btree.h"
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
-#include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 
 #include <linux/capability.h>
@@ -1030,13 +1029,14 @@ xfs_vn_setattr(
 	int			error;
 
 	if (iattr->ia_valid & ATTR_SIZE) {
-		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
+		struct inode		*inode = d_inode(dentry);
+		struct xfs_inode	*ip = XFS_I(inode);
 		uint			iolock;
 
 		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
-		error = xfs_break_layouts(d_inode(dentry), &iolock);
+		error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 		if (error) {
 			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 			return error;
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 6ea7b0b55d02..40e69edb7e2e 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -31,17 +31,18 @@
  * rules in the page fault path we don't bother.
  */
 int
-xfs_break_layouts(
+xfs_break_leased_layouts(
 	struct inode		*inode,
-	uint			*iolock)
+	uint			*iolock,
+	bool			*did_unlock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			error;
 
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
+	*did_unlock = false;
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		*did_unlock = true;
 		error = break_layout(inode, true);
 		*iolock &= ~XFS_IOLOCK_SHARED;
 		*iolock |= XFS_IOLOCK_EXCL;
@@ -121,8 +122,8 @@ xfs_fs_map_blocks(
 	 * Lock out any other I/O before we flush and invalidate the pagecache,
 	 * and then hand out a layout to the remote system.  This is very
 	 * similar to direct I/O, except that the synchronization is much more
-	 * complicated.  See the comment near xfs_break_layouts for a detailed
-	 * explanation.
+	 * complicated.  See the comment near xfs_break_leased_layouts
+	 * for a detailed explanation.
 	 */
 	xfs_ilock(ip, XFS_IOLOCK_EXCL);
 
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index bf45951e28fe..0f2f51037064 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -9,11 +9,13 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
 
-int xfs_break_layouts(struct inode *inode, uint *iolock);
+int xfs_break_leased_layouts(struct inode *inode, uint *iolock,
+		bool *did_unlock);
 #else
 static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
+xfs_break_leased_layouts(struct inode *inode, uint *iolock, bool *did_unlock)
 {
+	*did_unlock = false;
 	return 0;
 }
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts()
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (7 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 8/9] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
@ 2018-04-24 23:33 ` Dan Williams
  2018-05-09 12:27   ` Jan Kara
  2018-05-09 14:38   ` Darrick J. Wong
  2018-05-03 23:53 ` [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
  9 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2018-04-24 23:33 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Chinner, Darrick J. Wong, Ross Zwisler, Jan Kara,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-mm, jack, hch

xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1, see generic_dax_pagefree()).

While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
deadlock the process that might be trying to elevate the page count of
more pages before arranging for any of them to go idle. I.e. the typical
case of submitting I/O is that iov_iter_get_pages() elevates the
reference count of all pages in the I/O before starting I/O on the first
page. The process of elevating the reference count of all pages involved
in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.

Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
held while sleeping. We need this to prevent starvation of the truncate
path as continuous submission of direct-I/O could starve the truncate
path indefinitely if the lock is dropped.

Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   59 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 48 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1a5176b21803..4e98d0dcc035 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -718,6 +718,37 @@ xfs_file_write_iter(
 	return ret;
 }
 
+static void
+xfs_wait_dax_page(
+	struct inode		*inode,
+	bool			*did_unlock)
+{
+	struct xfs_inode        *ip = XFS_I(inode);
+
+	*did_unlock = true;
+	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+	schedule();
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+}
+
+static int
+xfs_break_dax_layouts(
+	struct inode		*inode,
+	uint			iolock,
+	bool			*did_unlock)
+{
+	struct page		*page;
+
+	*did_unlock = false;
+	page = dax_layout_busy_page(inode->i_mapping);
+	if (!page)
+		return 0;
+
+	return ___wait_var_event(&page->_refcount,
+			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+			0, 0, xfs_wait_dax_page(inode, did_unlock));
+}
+
 int
 xfs_break_layouts(
 	struct inode		*inode,
@@ -729,17 +760,23 @@ xfs_break_layouts(
 
 	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
-	switch (reason) {
-	case BREAK_UNMAP:
-		ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
-		/* fall through */
-	case BREAK_WRITE:
-		error = xfs_break_leased_layouts(inode, iolock, &retry);
-		break;
-	default:
-		WARN_ON_ONCE(1);
-		return -EINVAL;
-	}
+	do {
+		switch (reason) {
+		case BREAK_UNMAP:
+			ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
+
+			error = xfs_break_dax_layouts(inode, *iolock, &retry);
+			/* fall through */
+		case BREAK_WRITE:
+			if (error || retry)
+				break;
+			error = xfs_break_leased_layouts(inode, iolock, &retry);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			return -EINVAL;
+		}
+	} while (error == 0 && retry);
 
 	return error;
 }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch
  2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (8 preceding siblings ...)
  2018-04-24 23:33 ` [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
@ 2018-05-03 23:53 ` Dan Williams
  2018-05-08  0:16   ` Darrick J. Wong
  9 siblings, 1 reply; 21+ messages in thread
From: Dan Williams @ 2018-05-03 23:53 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, Mike Snitzer, Dave Hansen, Dave Chinner,
	Linux MM, Christoph Hellwig, Thomas Meyer, kbuild test robot,
	Darrick J. Wong, Alasdair Kergon, Heiko Carstens,
	Martin Schwidefsky, Jérôme Glisse, Alexander Viro,
	Matthew Wilcox, stable, linux-xfs, linux-fsdevel, Andrew Morton

On Tue, Apr 24, 2018 at 4:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Changes since v8 [1]:
> * Rebase on v4.17-rc2
>
> * Fix get_user_pages_fast() for ZONE_DEVICE pages to revalidate the pte,
>   pmd, pud after taking references (Jan)
>
> * Kill dax_layout_lock(). With get_user_pages_fast() for ZONE_DEVICE
>   fixed we can then rely on the {pte,pmd}_lock to synchronize
>   dax_layout_busy_page() vs new page references (Jan)
>
> * Hold the iolock over repeated invocations of dax_layout_busy_page() to
>   enable truncate/hole-punch to make forward progress in the presence of
>   a constant stream of new direct-I/O requests (Jan).
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/015058.html

I'll push this for soak time in -next if there are no further comments...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch
  2018-05-03 23:53 ` [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
@ 2018-05-08  0:16   ` Darrick J. Wong
  0 siblings, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2018-05-08  0:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, Jan Kara, Mike Snitzer, Dave Hansen,
	Dave Chinner, Linux MM, Christoph Hellwig, Thomas Meyer,
	kbuild test robot, Alasdair Kergon, Heiko Carstens,
	Martin Schwidefsky, Jérôme Glisse, Alexander Viro,
	Matthew Wilcox, stable, linux-xfs, linux-fsdevel, Andrew Morton

On Thu, May 03, 2018 at 04:53:18PM -0700, Dan Williams wrote:
> On Tue, Apr 24, 2018 at 4:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > Changes since v8 [1]:
> > * Rebase on v4.17-rc2
> >
> > * Fix get_user_pages_fast() for ZONE_DEVICE pages to revalidate the pte,
> >   pmd, pud after taking references (Jan)
> >
> > * Kill dax_layout_lock(). With get_user_pages_fast() for ZONE_DEVICE
> >   fixed we can then rely on the {pte,pmd}_lock to synchronize
> >   dax_layout_busy_page() vs new page references (Jan)
> >
> > * Hold the iolock over repeated invocations of dax_layout_busy_page() to
> >   enable truncate/hole-punch to make forward progress in the presence of
> >   a constant stream of new direct-I/O requests (Jan).
> >
> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/015058.html
> 
> I'll push this for soak time in -next if there are no further comments...

I don't have any. :D

--D

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 3/9] memremap: split devm_memremap_pages() and memremap() infrastructure
  2018-04-24 23:33 ` [PATCH v9 3/9] memremap: split devm_memremap_pages() and memremap() infrastructure Dan Williams
@ 2018-05-09 10:29   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2018-05-09 10:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Christoph Hellwig,
	Jérôme Glisse, Ross Zwisler, david, linux-fsdevel,
	linux-xfs, linux-mm

On Tue 24-04-18 16:33:19, Dan Williams wrote:
> Currently, kernel/memremap.c contains generic code for supporting
> memremap() (CONFIG_HAS_IOMEM) and devm_memremap_pages()
> (CONFIG_ZONE_DEVICE). This causes ongoing build maintenance problems as
> additions to memremap.c, especially for the ZONE_DEVICE case, need to be
> careful about being placed in ifdef guards. Remove the need for these
> ifdef guards by moving the ZONE_DEVICE support functions to their own
> compilation unit.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "J�r�me Glisse" <jglisse@redhat.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Good idea. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  kernel/Makefile   |    3 +
>  kernel/iomem.c    |  167 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/memremap.c |  178 +----------------------------------------------------
>  3 files changed, 171 insertions(+), 177 deletions(-)
>  create mode 100644 kernel/iomem.c
> 
> diff --git a/kernel/Makefile b/kernel/Makefile
> index f85ae5dfa474..9b9241361311 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -112,7 +112,8 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
>  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
>  obj-$(CONFIG_TORTURE_TEST) += torture.o
>  
> -obj-$(CONFIG_HAS_IOMEM) += memremap.o
> +obj-$(CONFIG_HAS_IOMEM) += iomem.o
> +obj-$(CONFIG_ZONE_DEVICE) += memremap.o
>  
>  $(obj)/configs.o: $(obj)/config_data.h
>  
> diff --git a/kernel/iomem.c b/kernel/iomem.c
> new file mode 100644
> index 000000000000..f7525e14ebc6
> --- /dev/null
> +++ b/kernel/iomem.c
> @@ -0,0 +1,167 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/device.h>
> +#include <linux/types.h>
> +#include <linux/io.h>
> +#include <linux/mm.h>
> +
> +#ifndef ioremap_cache
> +/* temporary while we convert existing ioremap_cache users to memremap */
> +__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
> +{
> +	return ioremap(offset, size);
> +}
> +#endif
> +
> +#ifndef arch_memremap_wb
> +static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
> +{
> +	return (__force void *)ioremap_cache(offset, size);
> +}
> +#endif
> +
> +#ifndef arch_memremap_can_ram_remap
> +static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
> +					unsigned long flags)
> +{
> +	return true;
> +}
> +#endif
> +
> +static void *try_ram_remap(resource_size_t offset, size_t size,
> +			   unsigned long flags)
> +{
> +	unsigned long pfn = PHYS_PFN(offset);
> +
> +	/* In the simple case just return the existing linear address */
> +	if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
> +	    arch_memremap_can_ram_remap(offset, size, flags))
> +		return __va(offset);
> +
> +	return NULL; /* fallback to arch_memremap_wb */
> +}
> +
> +/**
> + * memremap() - remap an iomem_resource as cacheable memory
> + * @offset: iomem resource start address
> + * @size: size of remap
> + * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
> + *		  MEMREMAP_ENC, MEMREMAP_DEC
> + *
> + * memremap() is "ioremap" for cases where it is known that the resource
> + * being mapped does not have i/o side effects and the __iomem
> + * annotation is not applicable. In the case of multiple flags, the different
> + * mapping types will be attempted in the order listed below until one of
> + * them succeeds.
> + *
> + * MEMREMAP_WB - matches the default mapping for System RAM on
> + * the architecture.  This is usually a read-allocate write-back cache.
> + * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
> + * memremap() will bypass establishing a new mapping and instead return
> + * a pointer into the direct map.
> + *
> + * MEMREMAP_WT - establish a mapping whereby writes either bypass the
> + * cache or are written through to memory and never exist in a
> + * cache-dirty state with respect to program visibility.  Attempts to
> + * map System RAM with this mapping type will fail.
> + *
> + * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
> + * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
> + * uncached. Attempts to map System RAM with this mapping type will fail.
> + */
> +void *memremap(resource_size_t offset, size_t size, unsigned long flags)
> +{
> +	int is_ram = region_intersects(offset, size,
> +				       IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
> +	void *addr = NULL;
> +
> +	if (!flags)
> +		return NULL;
> +
> +	if (is_ram == REGION_MIXED) {
> +		WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
> +				&offset, (unsigned long) size);
> +		return NULL;
> +	}
> +
> +	/* Try all mapping types requested until one returns non-NULL */
> +	if (flags & MEMREMAP_WB) {
> +		/*
> +		 * MEMREMAP_WB is special in that it can be satisifed
> +		 * from the direct map.  Some archs depend on the
> +		 * capability of memremap() to autodetect cases where
> +		 * the requested range is potentially in System RAM.
> +		 */
> +		if (is_ram == REGION_INTERSECTS)
> +			addr = try_ram_remap(offset, size, flags);
> +		if (!addr)
> +			addr = arch_memremap_wb(offset, size);
> +	}
> +
> +	/*
> +	 * If we don't have a mapping yet and other request flags are
> +	 * present then we will be attempting to establish a new virtual
> +	 * address mapping.  Enforce that this mapping is not aliasing
> +	 * System RAM.
> +	 */
> +	if (!addr && is_ram == REGION_INTERSECTS && flags != MEMREMAP_WB) {
> +		WARN_ONCE(1, "memremap attempted on ram %pa size: %#lx\n",
> +				&offset, (unsigned long) size);
> +		return NULL;
> +	}
> +
> +	if (!addr && (flags & MEMREMAP_WT))
> +		addr = ioremap_wt(offset, size);
> +
> +	if (!addr && (flags & MEMREMAP_WC))
> +		addr = ioremap_wc(offset, size);
> +
> +	return addr;
> +}
> +EXPORT_SYMBOL(memremap);
> +
> +void memunmap(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		iounmap((void __iomem *) addr);
> +}
> +EXPORT_SYMBOL(memunmap);
> +
> +static void devm_memremap_release(struct device *dev, void *res)
> +{
> +	memunmap(*(void **)res);
> +}
> +
> +static int devm_memremap_match(struct device *dev, void *res, void *match_data)
> +{
> +	return *(void **)res == match_data;
> +}
> +
> +void *devm_memremap(struct device *dev, resource_size_t offset,
> +		size_t size, unsigned long flags)
> +{
> +	void **ptr, *addr;
> +
> +	ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
> +			dev_to_node(dev));
> +	if (!ptr)
> +		return ERR_PTR(-ENOMEM);
> +
> +	addr = memremap(offset, size, flags);
> +	if (addr) {
> +		*ptr = addr;
> +		devres_add(dev, ptr);
> +	} else {
> +		devres_free(ptr);
> +		return ERR_PTR(-ENXIO);
> +	}
> +
> +	return addr;
> +}
> +EXPORT_SYMBOL(devm_memremap);
> +
> +void devm_memunmap(struct device *dev, void *addr)
> +{
> +	WARN_ON(devres_release(dev, devm_memremap_release,
> +				devm_memremap_match, addr));
> +}
> +EXPORT_SYMBOL(devm_memunmap);
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 895e6b76b25e..37a9604133f6 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -1,15 +1,5 @@
> -/*
> - * Copyright(c) 2015 Intel Corporation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of version 2 of the GNU General Public License as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License for more details.
> - */
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2015 Intel Corporation. All rights reserved. */
>  #include <linux/radix-tree.h>
>  #include <linux/device.h>
>  #include <linux/types.h>
> @@ -20,169 +10,6 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> -#ifndef ioremap_cache
> -/* temporary while we convert existing ioremap_cache users to memremap */
> -__weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size)
> -{
> -	return ioremap(offset, size);
> -}
> -#endif
> -
> -#ifndef arch_memremap_wb
> -static void *arch_memremap_wb(resource_size_t offset, unsigned long size)
> -{
> -	return (__force void *)ioremap_cache(offset, size);
> -}
> -#endif
> -
> -#ifndef arch_memremap_can_ram_remap
> -static bool arch_memremap_can_ram_remap(resource_size_t offset, size_t size,
> -					unsigned long flags)
> -{
> -	return true;
> -}
> -#endif
> -
> -static void *try_ram_remap(resource_size_t offset, size_t size,
> -			   unsigned long flags)
> -{
> -	unsigned long pfn = PHYS_PFN(offset);
> -
> -	/* In the simple case just return the existing linear address */
> -	if (pfn_valid(pfn) && !PageHighMem(pfn_to_page(pfn)) &&
> -	    arch_memremap_can_ram_remap(offset, size, flags))
> -		return __va(offset);
> -
> -	return NULL; /* fallback to arch_memremap_wb */
> -}
> -
> -/**
> - * memremap() - remap an iomem_resource as cacheable memory
> - * @offset: iomem resource start address
> - * @size: size of remap
> - * @flags: any of MEMREMAP_WB, MEMREMAP_WT, MEMREMAP_WC,
> - *		  MEMREMAP_ENC, MEMREMAP_DEC
> - *
> - * memremap() is "ioremap" for cases where it is known that the resource
> - * being mapped does not have i/o side effects and the __iomem
> - * annotation is not applicable. In the case of multiple flags, the different
> - * mapping types will be attempted in the order listed below until one of
> - * them succeeds.
> - *
> - * MEMREMAP_WB - matches the default mapping for System RAM on
> - * the architecture.  This is usually a read-allocate write-back cache.
> - * Morever, if MEMREMAP_WB is specified and the requested remap region is RAM
> - * memremap() will bypass establishing a new mapping and instead return
> - * a pointer into the direct map.
> - *
> - * MEMREMAP_WT - establish a mapping whereby writes either bypass the
> - * cache or are written through to memory and never exist in a
> - * cache-dirty state with respect to program visibility.  Attempts to
> - * map System RAM with this mapping type will fail.
> - *
> - * MEMREMAP_WC - establish a writecombine mapping, whereby writes may
> - * be coalesced together (e.g. in the CPU's write buffers), but is otherwise
> - * uncached. Attempts to map System RAM with this mapping type will fail.
> - */
> -void *memremap(resource_size_t offset, size_t size, unsigned long flags)
> -{
> -	int is_ram = region_intersects(offset, size,
> -				       IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
> -	void *addr = NULL;
> -
> -	if (!flags)
> -		return NULL;
> -
> -	if (is_ram == REGION_MIXED) {
> -		WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
> -				&offset, (unsigned long) size);
> -		return NULL;
> -	}
> -
> -	/* Try all mapping types requested until one returns non-NULL */
> -	if (flags & MEMREMAP_WB) {
> -		/*
> -		 * MEMREMAP_WB is special in that it can be satisifed
> -		 * from the direct map.  Some archs depend on the
> -		 * capability of memremap() to autodetect cases where
> -		 * the requested range is potentially in System RAM.
> -		 */
> -		if (is_ram == REGION_INTERSECTS)
> -			addr = try_ram_remap(offset, size, flags);
> -		if (!addr)
> -			addr = arch_memremap_wb(offset, size);
> -	}
> -
> -	/*
> -	 * If we don't have a mapping yet and other request flags are
> -	 * present then we will be attempting to establish a new virtual
> -	 * address mapping.  Enforce that this mapping is not aliasing
> -	 * System RAM.
> -	 */
> -	if (!addr && is_ram == REGION_INTERSECTS && flags != MEMREMAP_WB) {
> -		WARN_ONCE(1, "memremap attempted on ram %pa size: %#lx\n",
> -				&offset, (unsigned long) size);
> -		return NULL;
> -	}
> -
> -	if (!addr && (flags & MEMREMAP_WT))
> -		addr = ioremap_wt(offset, size);
> -
> -	if (!addr && (flags & MEMREMAP_WC))
> -		addr = ioremap_wc(offset, size);
> -
> -	return addr;
> -}
> -EXPORT_SYMBOL(memremap);
> -
> -void memunmap(void *addr)
> -{
> -	if (is_vmalloc_addr(addr))
> -		iounmap((void __iomem *) addr);
> -}
> -EXPORT_SYMBOL(memunmap);
> -
> -static void devm_memremap_release(struct device *dev, void *res)
> -{
> -	memunmap(*(void **)res);
> -}
> -
> -static int devm_memremap_match(struct device *dev, void *res, void *match_data)
> -{
> -	return *(void **)res == match_data;
> -}
> -
> -void *devm_memremap(struct device *dev, resource_size_t offset,
> -		size_t size, unsigned long flags)
> -{
> -	void **ptr, *addr;
> -
> -	ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
> -			dev_to_node(dev));
> -	if (!ptr)
> -		return ERR_PTR(-ENOMEM);
> -
> -	addr = memremap(offset, size, flags);
> -	if (addr) {
> -		*ptr = addr;
> -		devres_add(dev, ptr);
> -	} else {
> -		devres_free(ptr);
> -		return ERR_PTR(-ENXIO);
> -	}
> -
> -	return addr;
> -}
> -EXPORT_SYMBOL(devm_memremap);
> -
> -void devm_memunmap(struct device *dev, void *addr)
> -{
> -	WARN_ON(devres_release(dev, devm_memremap_release,
> -				devm_memremap_match, addr));
> -}
> -EXPORT_SYMBOL(devm_memunmap);
> -
> -#ifdef CONFIG_ZONE_DEVICE
>  static DEFINE_MUTEX(pgmap_lock);
>  static RADIX_TREE(pgmap_radix, GFP_KERNEL);
>  #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
> @@ -473,7 +300,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  
>  	return pgmap;
>  }
> -#endif /* CONFIG_ZONE_DEVICE */
>  
>  #if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
>  void put_zone_device_private_or_public_page(struct page *page)
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure
  2018-04-24 23:33 ` [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure Dan Williams
@ 2018-05-09 10:37   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2018-05-09 10:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Alasdair Kergon, Matthew Wilcox, Ross Zwisler,
	Jérôme Glisse, Christoph Hellwig, Jan Kara,
	Mike Snitzer, david, linux-fsdevel, linux-xfs, linux-mm

On Tue 24-04-18 16:33:07, Dan Williams wrote:
> In preparation for allowing filesystems to augment the dev_pagemap
> associated with a dax_device, add an ->fs_claim() callback. The
> ->fs_claim() callback is leveraged by the device-mapper dax
> implementation to iterate all member devices in the map and repeat the
> claim operation across the array.
> 
> In order to resolve collisions between filesystem operations and DMA to
> DAX mapped pages we need a callback when DMA completes. With a callback
> we can hold off filesystem operations while DMA is in-flight and then
> resume those operations when the last put_page() occurs on a DMA page.
> The ->fs_claim() operation arranges for this callback to be registered,
> although that implementation is saved for a later patch.
> 
> Cc: Alasdair Kergon <agk@redhat.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: "J�r�me Glisse" <jglisse@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jan Kara <jack@suse.cz>
> Acked-by: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/super.c      |   76 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/dm.c          |   57 +++++++++++++++++++++++++++++++++++
>  include/linux/dax.h      |   48 +++++++++++++++++++++++++++++
>  include/linux/memremap.h |    8 +++++
>  4 files changed, 189 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 2b2332b605e4..e62a64b9c9fb 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
>  static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
>  static struct super_block *dax_superblock __read_mostly;
> +static DEFINE_MUTEX(devmap_lock);
>  
>  #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
>  static struct hlist_head dax_host_list[DAX_HASH_SIZE];
> @@ -169,9 +170,84 @@ struct dax_device {
>  	const char *host;
>  	void *private;
>  	unsigned long flags;
> +	struct dev_pagemap *pgmap;
>  	const struct dax_operations *ops;
>  };
>  
> +#if IS_ENABLED(CONFIG_FS_DAX)
> +static void generic_dax_pagefree(struct page *page, void *data)
> +{
> +	/* TODO: wakeup page-idle waiters */
> +}
> +
> +static struct dax_device *__fs_dax_claim(struct dax_device *dax_dev,
> +		void *owner)
> +{
> +	struct dev_pagemap *pgmap;
> +
> +	if (!dax_dev->pgmap)
> +		return dax_dev;
> +	pgmap = dax_dev->pgmap;
> +
> +	mutex_lock(&devmap_lock);
> +	if (pgmap->data && pgmap->data == owner) {
> +		/* dm might try to claim the same device more than once... */
> +		mutex_unlock(&devmap_lock);
> +		return dax_dev;
> +	} else if (pgmap->page_free || pgmap->page_fault
> +			|| pgmap->type != MEMORY_DEVICE_HOST) {
> +		put_dax(dax_dev);
> +		mutex_unlock(&devmap_lock);
> +		return NULL;
> +	}
> +
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = generic_dax_pagefree;
> +	pgmap->data = owner;
> +	mutex_unlock(&devmap_lock);
> +
> +	return dax_dev;
> +}
> +
> +struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner)
> +{
> +	if (dax_dev->ops->fs_claim)
> +		return dax_dev->ops->fs_claim(dax_dev, owner);
> +	else
> +		return __fs_dax_claim(dax_dev, owner);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_claim);
> +
> +static void __fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> +	struct dev_pagemap *pgmap = dax_dev->pgmap;
> +
> +	put_dax(dax_dev);
> +	if (!pgmap)
> +		return;
> +	if (!pgmap->data)
> +		return;
> +
> +	mutex_lock(&devmap_lock);
> +	WARN_ON(pgmap->data != owner);
> +	pgmap->type = MEMORY_DEVICE_HOST;
> +	pgmap->page_free = NULL;
> +	pgmap->data = NULL;
> +	mutex_unlock(&devmap_lock);
> +}
> +
> +void fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> +	if (!dax_dev)
> +		return;
> +	if (dax_dev->ops->fs_release)
> +		dax_dev->ops->fs_release(dax_dev, owner);
> +	else
> +		__fs_dax_release(dax_dev, owner);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_release);
> +#endif
> +
>  static ssize_t write_cache_show(struct device *dev,
>  		struct device_attribute *attr, char *buf)
>  {
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 4ea404dbcf0b..468aff8d9694 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1088,6 +1088,61 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  	return ret;
>  }
>  
> +static int dm_dax_dev_claim(struct dm_target *ti, struct dm_dev *dev,
> +		sector_t start, sector_t len, void *owner)
> +{
> +	if (fs_dax_claim(dev->dax_dev, owner))
> +		return 0;
> +	/*
> +	 * Outside of a kernel bug there is no reason a dax_dev should
> +	 * fail a claim attempt. Device-mapper should have exclusive
> +	 * ownership of the dm_dev and the filesystem should have
> +	 * exclusive ownership of the dm_target.
> +	 */
> +	WARN_ON_ONCE(1);
> +	return -ENXIO;
> +}
> +
> +static int dm_dax_dev_release(struct dm_target *ti, struct dm_dev *dev,
> +		sector_t start, sector_t len, void *owner)
> +{
> +	fs_dax_release(dev->dax_dev, owner);
> +	return 0;
> +}
> +
> +static void dm_dax_iterate_devices(struct dax_device *dax_dev,
> +		iterate_devices_callout_fn fn, void *arg)
> +{
> +	struct mapped_device *md = dax_get_private(dax_dev);
> +	struct dm_table *map;
> +	struct dm_target *ti;
> +	int i, srcu_idx;
> +
> +	map = dm_get_live_table(md, &srcu_idx);
> +
> +	for (i = 0; i < dm_table_get_num_targets(map); i++) {
> +		ti = dm_table_get_target(map, i);
> +
> +		if (ti->type->iterate_devices)
> +			ti->type->iterate_devices(ti, fn, arg);
> +	}
> +
> +	dm_put_live_table(md, srcu_idx);
> +}
> +
> +static struct dax_device *dm_dax_fs_claim(struct dax_device *dax_dev,
> +		void *owner)
> +{
> +	dm_dax_iterate_devices(dax_dev, dm_dax_dev_claim, owner);
> +	/* see comment in dm_dax_dev_claim about this unconditional return */
> +	return dax_dev;
> +}
> +
> +static void dm_dax_fs_release(struct dax_device *dax_dev, void *owner)
> +{
> +	dm_dax_iterate_devices(dax_dev, dm_dax_dev_release, owner);
> +}
> +
>  /*
>   * A target may call dm_accept_partial_bio only from the map routine.  It is
>   * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
> @@ -3133,6 +3188,8 @@ static const struct block_device_operations dm_blk_dops = {
>  static const struct dax_operations dm_dax_ops = {
>  	.direct_access = dm_dax_direct_access,
>  	.copy_from_iter = dm_dax_copy_from_iter,
> +	.fs_claim = dm_dax_fs_claim,
> +	.fs_release = dm_dax_fs_release,
>  };
>  
>  /*
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index f9eb22ad341e..af02f93c943a 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -4,6 +4,7 @@
>  
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> +#include <linux/blkdev.h>
>  #include <linux/radix-tree.h>
>  #include <asm/pgtable.h>
>  
> @@ -20,6 +21,10 @@ struct dax_operations {
>  	/* copy_from_iter: required operation for fs-dax direct-i/o */
>  	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
>  			struct iov_iter *);
> +	/* fs_claim: setup filesytem parameters for the device's dev_pagemap */
> +	struct dax_device *(*fs_claim)(struct dax_device *, void *);
> +	/* fs_release: restore device's dev_pagemap to its default state */
> +	void (*fs_release)(struct dax_device *, void *);
>  };
>  
>  extern struct attribute_group dax_attribute_group;
> @@ -83,6 +88,26 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct block_device *bdev, struct writeback_control *wbc);
> +struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
> +#ifdef CONFIG_BLOCK
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> +		void *owner)
> +{
> +	struct dax_device *dax_dev;
> +
> +	if (!blk_queue_dax(bdev->bd_queue))
> +		return NULL;
> +	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
> +	return fs_dax_claim(dax_dev, owner);
> +}
> +#else
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> +		void *owner)
> +{
> +	return NULL;
> +}
> +#endif
> +void fs_dax_release(struct dax_device *dax_dev, void *owner);
>  #else
>  static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
>  {
> @@ -108,6 +133,29 @@ static inline int dax_writeback_mapping_range(struct address_space *mapping,
>  {
>  	return -EOPNOTSUPP;
>  }
> +
> +static inline struct dax_device *fs_dax_claim(struct dax_device *dax_dev,
> +		void *owner)
> +{
> +	return NULL;
> +}
> +
> +#ifdef CONFIG_BLOCK
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> +		void *owner)
> +{
> +	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
> +}
> +#else
> +static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
> +		void *owner)
> +{
> +	return NULL;
> +}
> +#endif
> +static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
> +{
> +}
>  #endif
>  
>  int dax_read_lock(void);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 7b4899c06f49..02d6d042ee7f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -53,11 +53,19 @@ struct vmem_altmap {
>   * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>   * type. Any page of a process can be migrated to such memory. However no one
>   * should be allow to pin such memory so that it can always be evicted.
> + *
> + * MEMORY_DEVICE_FS_DAX:
> + * When MEMORY_DEVICE_HOST memory is represented by a device that can
> + * host a filesystem, for example /dev/pmem0, that filesystem can
> + * register for a callback when a page is idled. For the filesystem-dax
> + * case page idle callbacks are used to coordinate DMA vs
> + * hole-punch/truncate.
>   */
>  enum memory_type {
>  	MEMORY_DEVICE_HOST = 0,
>  	MEMORY_DEVICE_PRIVATE,
>  	MEMORY_DEVICE_PUBLIC,
> +	MEMORY_DEVICE_FS_DAX,
>  };
>  
>  /*
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 5/9] mm: fix __gup_device_huge vs unmap
  2018-04-24 23:33 ` [PATCH v9 5/9] mm: fix __gup_device_huge vs unmap Dan Williams
@ 2018-05-09 10:46   ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2018-05-09 10:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, stable, Jan Kara, david, hch, linux-fsdevel,
	linux-xfs, linux-mm

On Tue 24-04-18 16:33:29, Dan Williams wrote:
> get_user_pages_fast() for device pages is missing the typical validation
> that all page references have been taken while the mapping was valid.
> Without this validation truncate operations can not reliably coordinate
> against new page reference events like O_DIRECT.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
> Reported-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  mm/gup.c |   36 ++++++++++++++++++++++++++----------
>  1 file changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 76af4cfeaf68..84dd2063ca3d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1456,32 +1456,48 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  	return 1;
>  }
>  
> -static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
> +static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr)
>  {
>  	unsigned long fault_pfn;
> +	int nr_start = *nr;
> +
> +	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> +		return 0;
>  
> -	fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		undo_dev_pagemap(nr, nr_start, pages);
> +		return 0;
> +	}
> +	return 1;
>  }
>  
> -static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
> +static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr)
>  {
>  	unsigned long fault_pfn;
> +	int nr_start = *nr;
> +
> +	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> +		return 0;
>  
> -	fault_pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
> +	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> +		undo_dev_pagemap(nr, nr_start, pages);
> +		return 0;
> +	}
> +	return 1;
>  }
>  #else
> -static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
> +static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr)
>  {
>  	BUILD_BUG();
>  	return 0;
>  }
>  
> -static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
> +static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr)
>  {
>  	BUILD_BUG();
> @@ -1499,7 +1515,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		return 0;
>  
>  	if (pmd_devmap(orig))
> -		return __gup_device_huge_pmd(orig, addr, end, pages, nr);
> +		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
>  
>  	refs = 0;
>  	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> @@ -1537,7 +1553,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  		return 0;
>  
>  	if (pud_devmap(orig))
> -		return __gup_device_huge_pud(orig, addr, end, pages, nr);
> +		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
>  
>  	refs = 0;
>  	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-04-24 23:33 ` [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
@ 2018-05-09 10:56   ` Jan Kara
  2018-05-09 22:06     ` Dan Williams
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2018-05-09 10:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Jeff Moyer, Dave Chinner, Matthew Wilcox,
	Alexander Viro, Darrick J. Wong, Ross Zwisler, Dave Hansen,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, linux-xfs,
	linux-mm

On Tue 24-04-18 16:33:35, Dan Williams wrote:
> Background:
> 
> get_user_pages() in the filesystem pins file backed memory pages for
> access by devices performing dma. However, it only pins the memory pages
> not the page-to-file offset association. If a file is truncated the
> pages are mapped out of the file and dma may continue indefinitely into
> a page that is owned by a device driver. This breaks coherency of the
> file vs dma, but the assumption is that if userspace wants the
> file-space truncated it does not matter what data is inbound from the
> device, it is not relevant anymore. The only expectation is that dma can
> safely continue while the filesystem reallocates the block(s).
> 
> Problem:
> 
> This expectation that dma can safely continue while the filesystem
> changes the block map is broken by dax. With dax the target dma page
> *is* the filesystem block. The model of leaving the page pinned for dma,
> but truncating the file block out of the file, means that the filesytem
> is free to reallocate a block under active dma to another file and now
> the expected data-incoherency situation has turned into active
> data-corruption.
> 
> Solution:
> 
> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
> file while any page/block in the file is under active dma. This solution
> assumes that dma is transient. Cases where dma operations are known to
> not be transient, like RDMA, have been explicitly disabled via
> commits like 5f1d43de5416 "IB/core: disable memory registration of
> filesystem-dax vmas".
> 
> The dax_layout_busy_page() routine is called by filesystems with a lock
> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
> The process of looking up a busy page invalidates all mappings
> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
> The filesystem continues to call dax_layout_busy_page() until it finally
> returns no more active pages. This approach assumes that the page
> pinning is transient, if that assumption is violated the system would
> have likely hung from the uncompleted I/O.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reported-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

A few nits below. After fixing those feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 86b3806ea35b..89f21bd9da10 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -167,7 +167,7 @@ struct dax_device {
>  #if IS_ENABLED(CONFIG_FS_DAX) && IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
>  static void generic_dax_pagefree(struct page *page, void *data)
>  {
> -	/* TODO: wakeup page-idle waiters */
> +	wake_up_var(&page->_refcount);
>  }
>  
>  static struct dax_device *__fs_dax_claim(struct dax_device *dax_dev,

Why is this hunk in this patch? We don't wait for page refcount here. OTOH
I agree I don't see much better patch to fold this into.

> diff --git a/fs/Kconfig b/fs/Kconfig
> index 1e050e012eb9..c9acbf695ddd 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -40,6 +40,7 @@ config FS_DAX
>  	depends on !(ARM || MIPS || SPARC)
>  	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
>  	select FS_IOMAP
> +	select SRCU

No need for this anymore I guess.

> diff --git a/mm/gup.c b/mm/gup.c
> index 84dd2063ca3d..75ade7ebddb2 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -13,6 +13,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/rwsem.h>
>  #include <linux/hugetlb.h>
> +#include <linux/dax.h>
>  
>  #include <asm/mmu_context.h>
>  #include <asm/pgtable.h>

Why is this hunk here?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts()
  2018-04-24 23:33 ` [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
@ 2018-05-09 12:27   ` Jan Kara
  2018-05-09 22:54     ` Dan Williams
  2018-05-09 14:38   ` Darrick J. Wong
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kara @ 2018-05-09 12:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Dave Chinner, Darrick J. Wong, Ross Zwisler,
	Jan Kara, Christoph Hellwig, linux-fsdevel, linux-xfs, linux-mm

On Tue 24-04-18 16:33:50, Dan Williams wrote:
> xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
> for busy / pinned dax pages and waits for those pages to go idle before
> any potential extent unmap operation.
> 
> dax_layout_busy_page() handles synchronizing against new page-busy
> events (get_user_pages). It invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the xfs inode
> lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
> busy page it returns it for xfs to wait for the page-idle event that
> will fire when the page reference count reaches 1 (recall ZONE_DEVICE
> pages are idle at count 1, see generic_dax_pagefree()).
> 
> While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
> deadlock the process that might be trying to elevate the page count of
> more pages before arranging for any of them to go idle. I.e. the typical
> case of submitting I/O is that iov_iter_get_pages() elevates the
> reference count of all pages in the I/O before starting I/O on the first
> page. The process of elevating the reference count of all pages involved
> in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
> 
> Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
> held while sleeping. We need this to prevent starvation of the truncate
> path as continuous submission of direct-I/O could starve the truncate
> path indefinitely if the lock is dropped.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me except some nits below. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

for as much as it is worth with XFS code ;)

> +static int
> +xfs_break_dax_layouts(
> +	struct inode		*inode,
> +	uint			iolock,
> +	bool			*did_unlock)
> +{
> +	struct page		*page;
> +
> +	*did_unlock = false;
> +	page = dax_layout_busy_page(inode->i_mapping);
> +	if (!page)
> +		return 0;
> +
> +	return ___wait_var_event(&page->_refcount,
> +			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> +			0, 0, xfs_wait_dax_page(inode, did_unlock));
> +}
> +
>  int
>  xfs_break_layouts(
>  	struct inode		*inode,
> @@ -729,17 +760,23 @@ xfs_break_layouts(
>  
>  	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>  
> -	switch (reason) {
> -	case BREAK_UNMAP:
> -		ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
> -		/* fall through */
> -	case BREAK_WRITE:
> -		error = xfs_break_leased_layouts(inode, iolock, &retry);
> -		break;
> -	default:
> -		WARN_ON_ONCE(1);
> -		return -EINVAL;
> -	}
> +	do {
> +		switch (reason) {
> +		case BREAK_UNMAP:
> +			ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));

Maybe move the assertion to xfs_break_dax_layouts()?

> +
> +			error = xfs_break_dax_layouts(inode, *iolock, &retry);
> +			/* fall through */
> +		case BREAK_WRITE:
> +			if (error || retry)
> +				break;

The error handling IMHO belongs above the 'fall through' comment above.

> +			error = xfs_break_leased_layouts(inode, iolock, &retry);
> +			break;
> +		default:
> +			WARN_ON_ONCE(1);
> +			return -EINVAL;
> +		}
> +	} while (error == 0 && retry);

As a general 'taste' comment, I prefer if the 'retry' is always initialized
to 'false' at the beginning of the loop body in these kinds of loops. That
way it is obvious we are doing the right thing when looking at the loop
body and we don't have to verify that each case statement initializes
'retry' properly (in fact I'd remove the initialization from
xfs_break_dax_layouts() and xfs_break_leased_layouts()). But this is more a
matter of taste and consistency with other code in the area so I defer to
XFS maintainers for a final opinion. Darrick?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts()
  2018-04-24 23:33 ` [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
  2018-05-09 12:27   ` Jan Kara
@ 2018-05-09 14:38   ` Darrick J. Wong
  1 sibling, 0 replies; 21+ messages in thread
From: Darrick J. Wong @ 2018-05-09 14:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Dave Chinner, Ross Zwisler, Jan Kara,
	Christoph Hellwig, linux-fsdevel, linux-xfs, linux-mm

On Tue, Apr 24, 2018 at 04:33:50PM -0700, Dan Williams wrote:
> xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
> for busy / pinned dax pages and waits for those pages to go idle before
> any potential extent unmap operation.
> 
> dax_layout_busy_page() handles synchronizing against new page-busy
> events (get_user_pages). It invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the xfs inode
> lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
> busy page it returns it for xfs to wait for the page-idle event that
> will fire when the page reference count reaches 1 (recall ZONE_DEVICE
> pages are idle at count 1, see generic_dax_pagefree()).
> 
> While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
> deadlock the process that might be trying to elevate the page count of
> more pages before arranging for any of them to go idle. I.e. the typical
> case of submitting I/O is that iov_iter_get_pages() elevates the
> reference count of all pages in the I/O before starting I/O on the first
> page. The process of elevating the reference count of all pages involved
> in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
> 
> Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
> held while sleeping. We need this to prevent starvation of the truncate
> path as continuous submission of direct-I/O could starve the truncate
> path indefinitely if the lock is dropped.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I should've acked this explicitly since it's xfs code,
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>

The rest of it looks fine enough to me too, but there's no
Acked-by-goober tag to put on them. :P

--D

> ---
>  fs/xfs/xfs_file.c |   59 +++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 48 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1a5176b21803..4e98d0dcc035 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -718,6 +718,37 @@ xfs_file_write_iter(
>  	return ret;
>  }
>  
> +static void
> +xfs_wait_dax_page(
> +	struct inode		*inode,
> +	bool			*did_unlock)
> +{
> +	struct xfs_inode        *ip = XFS_I(inode);
> +
> +	*did_unlock = true;
> +	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> +	schedule();
> +	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +}
> +
> +static int
> +xfs_break_dax_layouts(
> +	struct inode		*inode,
> +	uint			iolock,
> +	bool			*did_unlock)
> +{
> +	struct page		*page;
> +
> +	*did_unlock = false;
> +	page = dax_layout_busy_page(inode->i_mapping);
> +	if (!page)
> +		return 0;
> +
> +	return ___wait_var_event(&page->_refcount,
> +			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> +			0, 0, xfs_wait_dax_page(inode, did_unlock));
> +}
> +
>  int
>  xfs_break_layouts(
>  	struct inode		*inode,
> @@ -729,17 +760,23 @@ xfs_break_layouts(
>  
>  	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>  
> -	switch (reason) {
> -	case BREAK_UNMAP:
> -		ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
> -		/* fall through */
> -	case BREAK_WRITE:
> -		error = xfs_break_leased_layouts(inode, iolock, &retry);
> -		break;
> -	default:
> -		WARN_ON_ONCE(1);
> -		return -EINVAL;
> -	}
> +	do {
> +		switch (reason) {
> +		case BREAK_UNMAP:
> +			ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
> +
> +			error = xfs_break_dax_layouts(inode, *iolock, &retry);
> +			/* fall through */
> +		case BREAK_WRITE:
> +			if (error || retry)
> +				break;
> +			error = xfs_break_leased_layouts(inode, iolock, &retry);
> +			break;
> +		default:
> +			WARN_ON_ONCE(1);
> +			return -EINVAL;
> +		}
> +	} while (error == 0 && retry);
>  
>  	return error;
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings
  2018-05-09 10:56   ` Jan Kara
@ 2018-05-09 22:06     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-09 22:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Jeff Moyer, Dave Chinner, Matthew Wilcox,
	Alexander Viro, Darrick J. Wong, Ross Zwisler, Dave Hansen,
	Andrew Morton, Christoph Hellwig, linux-fsdevel, linux-xfs,
	Linux MM

On Wed, May 9, 2018 at 3:56 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 24-04-18 16:33:35, Dan Williams wrote:
>> Background:
>>
>> get_user_pages() in the filesystem pins file backed memory pages for
>> access by devices performing dma. However, it only pins the memory pages
>> not the page-to-file offset association. If a file is truncated the
>> pages are mapped out of the file and dma may continue indefinitely into
>> a page that is owned by a device driver. This breaks coherency of the
>> file vs dma, but the assumption is that if userspace wants the
>> file-space truncated it does not matter what data is inbound from the
>> device, it is not relevant anymore. The only expectation is that dma can
>> safely continue while the filesystem reallocates the block(s).
>>
>> Problem:
>>
>> This expectation that dma can safely continue while the filesystem
>> changes the block map is broken by dax. With dax the target dma page
>> *is* the filesystem block. The model of leaving the page pinned for dma,
>> but truncating the file block out of the file, means that the filesytem
>> is free to reallocate a block under active dma to another file and now
>> the expected data-incoherency situation has turned into active
>> data-corruption.
>>
>> Solution:
>>
>> Defer all filesystem operations (fallocate(), truncate()) on a dax mode
>> file while any page/block in the file is under active dma. This solution
>> assumes that dma is transient. Cases where dma operations are known to
>> not be transient, like RDMA, have been explicitly disabled via
>> commits like 5f1d43de5416 "IB/core: disable memory registration of
>> filesystem-dax vmas".
>>
>> The dax_layout_busy_page() routine is called by filesystems with a lock
>> held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
>> The process of looking up a busy page invalidates all mappings
>> to trigger any subsequent get_user_pages() to block on i_mmap_lock.
>> The filesystem continues to call dax_layout_busy_page() until it finally
>> returns no more active pages. This approach assumes that the page
>> pinning is transient, if that assumption is violated the system would
>> have likely hung from the uncompleted I/O.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> A few nits below. After fixing those feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 86b3806ea35b..89f21bd9da10 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -167,7 +167,7 @@ struct dax_device {
>>  #if IS_ENABLED(CONFIG_FS_DAX) && IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS)
>>  static void generic_dax_pagefree(struct page *page, void *data)
>>  {
>> -     /* TODO: wakeup page-idle waiters */
>> +     wake_up_var(&page->_refcount);
>>  }
>>
>>  static struct dax_device *__fs_dax_claim(struct dax_device *dax_dev,
>
> Why is this hunk in this patch? We don't wait for page refcount here. OTOH
> I agree I don't see much better patch to fold this into.

I had it here because this patch is the enabling point where
filesystems can start using dax_layout_busy_page(). Otherwise I could
move it to the first patch that introduces a wait_var_event() for this
wake-up, but that's an xfs patch and seems out of place. In other
words, theoretically someone could backport just to this point and go
enable another filesystem without worrying about the xfs changes.

>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 1e050e012eb9..c9acbf695ddd 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -40,6 +40,7 @@ config FS_DAX
>>       depends on !(ARM || MIPS || SPARC)
>>       select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
>>       select FS_IOMAP
>> +     select SRCU
>
> No need for this anymore I guess.

Yup, stale, removed.

>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 84dd2063ca3d..75ade7ebddb2 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -13,6 +13,7 @@
>>  #include <linux/sched/signal.h>
>>  #include <linux/rwsem.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/dax.h>
>>
>>  #include <asm/mmu_context.h>
>>  #include <asm/pgtable.h>
>
> Why is this hunk here?

Also stale, and removed. It was there for the now removed dax_layout_lock().

Good catches, thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts()
  2018-05-09 12:27   ` Jan Kara
@ 2018-05-09 22:54     ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-09 22:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Dave Chinner, Darrick J. Wong, Ross Zwisler,
	Christoph Hellwig, linux-fsdevel, linux-xfs, Linux MM

On Wed, May 9, 2018 at 5:27 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 24-04-18 16:33:50, Dan Williams wrote:
>> xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
>> for busy / pinned dax pages and waits for those pages to go idle before
>> any potential extent unmap operation.
>>
>> dax_layout_busy_page() handles synchronizing against new page-busy
>> events (get_user_pages). It invalidates all mappings to trigger the
>> get_user_pages slow path which will eventually block on the xfs inode
>> lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
>> busy page it returns it for xfs to wait for the page-idle event that
>> will fire when the page reference count reaches 1 (recall ZONE_DEVICE
>> pages are idle at count 1, see generic_dax_pagefree()).
>>
>> While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
>> deadlock the process that might be trying to elevate the page count of
>> more pages before arranging for any of them to go idle. I.e. the typical
>> case of submitting I/O is that iov_iter_get_pages() elevates the
>> reference count of all pages in the I/O before starting I/O on the first
>> page. The process of elevating the reference count of all pages involved
>> in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
>>
>> Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
>> held while sleeping. We need this to prevent starvation of the truncate
>> path as continuous submission of direct-I/O could starve the truncate
>> path indefinitely if the lock is dropped.
>>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reported-by: Jan Kara <jack@suse.cz>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Looks good to me except some nits below. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> for as much as it is worth with XFS code ;)
>
>> +static int
>> +xfs_break_dax_layouts(
>> +     struct inode            *inode,
>> +     uint                    iolock,
>> +     bool                    *did_unlock)
>> +{
>> +     struct page             *page;
>> +
>> +     *did_unlock = false;
>> +     page = dax_layout_busy_page(inode->i_mapping);
>> +     if (!page)
>> +             return 0;
>> +
>> +     return ___wait_var_event(&page->_refcount,
>> +                     atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
>> +                     0, 0, xfs_wait_dax_page(inode, did_unlock));
>> +}
>> +
>>  int
>>  xfs_break_layouts(
>>       struct inode            *inode,
>> @@ -729,17 +760,23 @@ xfs_break_layouts(
>>
>>       ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>>
>> -     switch (reason) {
>> -     case BREAK_UNMAP:
>> -             ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
>> -             /* fall through */
>> -     case BREAK_WRITE:
>> -             error = xfs_break_leased_layouts(inode, iolock, &retry);
>> -             break;
>> -     default:
>> -             WARN_ON_ONCE(1);
>> -             return -EINVAL;
>> -     }
>> +     do {
>> +             switch (reason) {
>> +             case BREAK_UNMAP:
>> +                     ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
>
> Maybe move the assertion to xfs_break_dax_layouts()?

Makes sense, sure.

>
>> +
>> +                     error = xfs_break_dax_layouts(inode, *iolock, &retry);
>> +                     /* fall through */
>> +             case BREAK_WRITE:
>> +                     if (error || retry)
>> +                             break;
>
> The error handling IMHO belongs above the 'fall through' comment above.

Ok, yes I think this location is a stale holdover.

>
>> +                     error = xfs_break_leased_layouts(inode, iolock, &retry);
>> +                     break;
>> +             default:
>> +                     WARN_ON_ONCE(1);
>> +                     return -EINVAL;
>> +             }
>> +     } while (error == 0 && retry);
>
> As a general 'taste' comment, I prefer if the 'retry' is always initialized
> to 'false' at the beginning of the loop body in these kinds of loops. That
> way it is obvious we are doing the right thing when looking at the loop
> body and we don't have to verify that each case statement initializes
> 'retry' properly (in fact I'd remove the initialization from
> xfs_break_dax_layouts() and xfs_break_leased_layouts()). But this is more a
> matter of taste and consistency with other code in the area so I defer to
> XFS maintainers for a final opinion. Darrick?

I'm fine with making this change, I'll proceed unless / until Darrick says no.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2018-04-24 23:33 ` [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-05-16  7:20   ` Dan Williams
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Williams @ 2018-05-16  7:20 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, david, linux-xfs, Linux MM,
	Jérôme Glisse, linux-fsdevel, Christoph Hellwig,
	Dave Jiang

On Tue, Apr 24, 2018 at 4:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> In order to resolve collisions between filesystem operations and DMA to
> DAX mapped pages we need a callback when DMA completes. With a callback
> we can hold off filesystem operations while DMA is in-flight and then
> resume those operations when the last put_page() occurs on a DMA page.
>
> Recall that the 'struct page' entries for DAX memory are created with
> devm_memremap_pages(). That routine arranges for the pages to be
> allocated, but never onlined, so a DAX page is DMA-idle when its
> reference count reaches one.
>
> Also recall that the HMM sub-system added infrastructure to trap the
> page-idle (2-to-1 reference count) transition of the pages allocated by
> devm_memremap_pages() and trigger a callback via the 'struct
> dev_pagemap' associated with the page range. Whereas the HMM callbacks
> are going to a device driver to manage bounce pages in device-memory in
> the filesystem-dax case we will call back to filesystem specified
> callback.
>
> Since the callback is not known at devm_memremap_pages() time we arrange
> for the filesystem to install it at mount time. No functional changes
> are expected as this only registers a nop handler for the ->page_free()
> event for device-mapped pages.
>
> Cc: Michal Hocko <mhocko@suse.com>
> Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Ugh, so it turns out this approach is broken in the face of
partitions. Thanks to Dave J for noticing this with xfstests, I was
only testing with whole devices.

Thankfully, the fix is straightforward. Just delete this and patch1
and move generic_dax_pagefree() into the pmem driver directly. It's a
1:1 relationship between pgmap and the dax_device, so there is no need
for this dynamic claim infrastructure. I'll send out the patches that
got modified after rebasing on the removal of this patch.

>  drivers/dax/super.c   |   21 +++++++++++----------
>  drivers/nvdimm/pmem.c |    3 ++-
>  fs/ext2/super.c       |    6 +++---
>  fs/ext4/super.c       |    6 +++---
>  fs/xfs/xfs_super.c    |   20 ++++++++++----------
>  include/linux/dax.h   |   14 ++++++++------
>  6 files changed, 37 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e62a64b9c9fb..e4864f319e16 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -63,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
>  }
>  EXPORT_SYMBOL(bdev_dax_pgoff);
>
> -#if IS_ENABLED(CONFIG_FS_DAX)
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> -       if (!blk_queue_dax(bdev->bd_queue))
> -               return NULL;
> -       return fs_dax_get_by_host(bdev->bd_disk->disk_name);
> -}
> -EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -#endif
> -
>  /**
>   * __bdev_dax_supported() - Check if the device supports dax for filesystem
>   * @sb: The superblock of the device
> @@ -575,6 +565,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
>  }
>  EXPORT_SYMBOL_GPL(alloc_dax);
>
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> +               const struct dax_operations *ops, struct dev_pagemap *pgmap)
> +{
> +       struct dax_device *dax_dev = alloc_dax(private, host, ops);
> +
> +       if (dax_dev)
> +               dax_dev->pgmap = pgmap;
> +       return dax_dev;
> +}
> +EXPORT_SYMBOL_GPL(alloc_dax_devmap);
> +
>  void put_dax(struct dax_device *dax_dev)
>  {
>         if (!dax_dev)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 9d714926ecf5..fc1a1ab25e9e 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -408,7 +408,8 @@ static int pmem_attach_disk(struct device *dev,
>         nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
>         disk->bb = &pmem->bb;
>
> -       dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
> +       dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
> +                       &pmem->pgmap);
>         if (!dax_dev) {
>                 put_disk(disk);
>                 return -ENOMEM;
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index de1694512f1f..421c7d4bed39 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -172,7 +172,7 @@ static void ext2_put_super (struct super_block * sb)
>         brelse (sbi->s_sbh);
>         sb->s_fs_info = NULL;
>         kfree(sbi->s_blockgroup_lock);
> -       fs_put_dax(sbi->s_daxdev);
> +       fs_dax_release(sbi->s_daxdev, sb);
>         kfree(sbi);
>  }
>
> @@ -817,7 +817,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
>
>  static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>  {
> -       struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> +       struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
>         struct buffer_head * bh;
>         struct ext2_sb_info * sbi;
>         struct ext2_super_block * es;
> @@ -1213,7 +1213,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>         kfree(sbi->s_blockgroup_lock);
>         kfree(sbi);
>  failed:
> -       fs_put_dax(dax_dev);
> +       fs_dax_release(dax_dev, sb);
>         return ret;
>  }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 185f7e61f4cf..3e5d0f9e8772 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -954,7 +954,7 @@ static void ext4_put_super(struct super_block *sb)
>         if (sbi->s_chksum_driver)
>                 crypto_free_shash(sbi->s_chksum_driver);
>         kfree(sbi->s_blockgroup_lock);
> -       fs_put_dax(sbi->s_daxdev);
> +       fs_dax_release(sbi->s_daxdev, sb);
>         kfree(sbi);
>  }
>
> @@ -3407,7 +3407,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
>
>  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  {
> -       struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
> +       struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
>         char *orig_data = kstrdup(data, GFP_KERNEL);
>         struct buffer_head *bh;
>         struct ext4_super_block *es = NULL;
> @@ -4429,7 +4429,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  out_free_base:
>         kfree(sbi);
>         kfree(orig_data);
> -       fs_put_dax(dax_dev);
> +       fs_dax_release(dax_dev, sb);
>         return err ? err : ret;
>  }
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d71424052917..f53f8a47a526 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -724,7 +724,7 @@ xfs_close_devices(
>
>                 xfs_free_buftarg(mp->m_logdev_targp);
>                 xfs_blkdev_put(logdev);
> -               fs_put_dax(dax_logdev);
> +               fs_dax_release(dax_logdev, mp);
>         }
>         if (mp->m_rtdev_targp) {
>                 struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
> @@ -732,10 +732,10 @@ xfs_close_devices(
>
>                 xfs_free_buftarg(mp->m_rtdev_targp);
>                 xfs_blkdev_put(rtdev);
> -               fs_put_dax(dax_rtdev);
> +               fs_dax_release(dax_rtdev, mp);
>         }
>         xfs_free_buftarg(mp->m_ddev_targp);
> -       fs_put_dax(dax_ddev);
> +       fs_dax_release(dax_ddev, mp);
>  }
>
>  /*
> @@ -753,9 +753,9 @@ xfs_open_devices(
>         struct xfs_mount        *mp)
>  {
>         struct block_device     *ddev = mp->m_super->s_bdev;
> -       struct dax_device       *dax_ddev = fs_dax_get_by_bdev(ddev);
> -       struct dax_device       *dax_logdev = NULL, *dax_rtdev = NULL;
> +       struct dax_device       *dax_ddev = fs_dax_claim_bdev(ddev, mp);
>         struct block_device     *logdev = NULL, *rtdev = NULL;
> +       struct dax_device       *dax_logdev = NULL, *dax_rtdev = NULL;
>         int                     error;
>
>         /*
> @@ -765,7 +765,7 @@ xfs_open_devices(
>                 error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
>                 if (error)
>                         goto out;
> -               dax_logdev = fs_dax_get_by_bdev(logdev);
> +               dax_logdev = fs_dax_claim_bdev(logdev, mp);
>         }
>
>         if (mp->m_rtname) {
> @@ -779,7 +779,7 @@ xfs_open_devices(
>                         error = -EINVAL;
>                         goto out_close_rtdev;
>                 }
> -               dax_rtdev = fs_dax_get_by_bdev(rtdev);
> +               dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
>         }
>
>         /*
> @@ -813,14 +813,14 @@ xfs_open_devices(
>         xfs_free_buftarg(mp->m_ddev_targp);
>   out_close_rtdev:
>         xfs_blkdev_put(rtdev);
> -       fs_put_dax(dax_rtdev);
> +       fs_dax_release(dax_rtdev, mp);
>   out_close_logdev:
>         if (logdev && logdev != ddev) {
>                 xfs_blkdev_put(logdev);
> -               fs_put_dax(dax_logdev);
> +               fs_dax_release(dax_logdev, mp);
>         }
>   out:
> -       fs_put_dax(dax_ddev);
> +       fs_dax_release(dax_ddev, mp);
>         return error;
>  }
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index af02f93c943a..fe322d67856e 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -33,6 +33,8 @@ extern struct attribute_group dax_attribute_group;
>  struct dax_device *dax_get_by_host(const char *host);
>  struct dax_device *alloc_dax(void *private, const char *host,
>                 const struct dax_operations *ops);
> +struct dax_device *alloc_dax_devmap(void *private, const char *host,
> +               const struct dax_operations *ops, struct dev_pagemap *pgmap);
>  void put_dax(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);
>  void dax_write_cache(struct dax_device *dax_dev, bool wc);
> @@ -51,6 +53,12 @@ static inline struct dax_device *alloc_dax(void *private, const char *host,
>          */
>         return NULL;
>  }
> +static inline struct dax_device *alloc_dax_devmap(void *private,
> +               const char *host, const struct dax_operations *ops,
> +               struct dev_pagemap *pgmap)
> +{
> +       return NULL;
> +}
>  static inline void put_dax(struct dax_device *dax_dev)
>  {
>  }
> @@ -85,7 +93,6 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
>         put_dax(dax_dev);
>  }
>
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
>                 struct block_device *bdev, struct writeback_control *wbc);
>  struct dax_device *fs_dax_claim(struct dax_device *dax_dev, void *owner);
> @@ -123,11 +130,6 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
>  {
>  }
>
> -static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> -       return NULL;
> -}
> -
>  static inline int dax_writeback_mapping_range(struct address_space *mapping,
>                 struct block_device *bdev, struct writeback_control *wbc)
>  {
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-05-16  7:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-24 23:33 [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
2018-04-24 23:33 ` [PATCH v9 1/9] dax, dm: introduce ->fs_{claim, release}() dax_device infrastructure Dan Williams
2018-05-09 10:37   ` Jan Kara
2018-04-24 23:33 ` [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
2018-05-16  7:20   ` Dan Williams
2018-04-24 23:33 ` [PATCH v9 3/9] memremap: split devm_memremap_pages() and memremap() infrastructure Dan Williams
2018-05-09 10:29   ` Jan Kara
2018-04-24 23:33 ` [PATCH v9 4/9] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
2018-04-24 23:33 ` [PATCH v9 5/9] mm: fix __gup_device_huge vs unmap Dan Williams
2018-05-09 10:46   ` Jan Kara
2018-04-24 23:33 ` [PATCH v9 6/9] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
2018-05-09 10:56   ` Jan Kara
2018-05-09 22:06     ` Dan Williams
2018-04-24 23:33 ` [PATCH v9 7/9] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
2018-04-24 23:33 ` [PATCH v9 8/9] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
2018-04-24 23:33 ` [PATCH v9 9/9] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
2018-05-09 12:27   ` Jan Kara
2018-05-09 22:54     ` Dan Williams
2018-05-09 14:38   ` Darrick J. Wong
2018-05-03 23:53 ` [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch Dan Williams
2018-05-08  0:16   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).