nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch
@ 2017-12-24  0:56 Dan Williams
  2017-12-24  0:56 ` [PATCH v4 01/18] mm, dax: introduce pfn_t_special() Dan Williams
                   ` (18 more replies)
  0 siblings, 19 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, jack, Peter Zijlstra, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, Andreas Dilger, hch, Matthew Wilcox,
	Michael Ellerman, Ingo Molnar, Martin Schwidefsky, linux-ext4,
	Dave Chinner, linux-nvdimm, Jérôme Glisse,
	Alexander Viro, Gerald Schaefer, Theodore Ts'o,
	Darrick J. Wong, linux-xfs, Jan Kara, linux-fsdevel,
	Paul Mackerras, Kirill A. Shutemov

Changes since v3 [1]:
* Kill the i_daxdma_lock, and do not impose any new locking constraints
  on filesystem implementations (Dave)

* Reuse the existing i_mmap_lock for synchronizing against
  get_user_pages() by unmapping and causing punch-hole/truncate to
  re-fault the page before get_user_pages() can elevate the page reference
  count (Jan)

* Create a dax-specifc address_space_operations instance for each
  filesystem. This allows page->mapping to be set for dax pages. (Jan).

* Change the ext4 and ext2 policy of 'mount -o dax' vs a device that
  does not support dax. This converts any environments that may have
  been using 'page-less' dax back to using page cache.

* Rename wait_on_devmap_idle() to wait_on_atomic_one(), a generic
  facility for waiting for an atomic counter to reach a value of '1'.

[1]: https://lwn.net/Articles/737273/

---

Background:

get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore. The only expectation is that dma can safely continue while the
filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_flush_dma() routine is called by filesystems with a lock held
against mm faults (i_mmap_lock). It then invalidates all mappings to
trigger any subsequent get_user_pages() to block on i_mmap_lock. Finally
it scans/rescans all pages in the mapping until it observes all pages
idle.

So far this solution only targets xfs since it already implements
xfs_break_layouts in all the locations that would need this
synchronization. It applies on top of the vmem_altmap / dev_pagemap
reworks from Christoph.

---

Dan Williams (18):
      mm, dax: introduce pfn_t_special()
      ext4: auto disable dax instead of failing mount
      ext2: auto disable dax instead of failing mount
      dax: require 'struct page' by default for filesystem dax
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: store pfns in the radix
      tools/testing/nvdimm: add 'bio_delay' mechanism
      mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
      mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
      fs, dax: introduce DEFINE_FSDAX_AOPS
      xfs: use DEFINE_FSDAX_AOPS
      ext4: use DEFINE_FSDAX_AOPS
      ext2: use DEFINE_FSDAX_AOPS
      mm, fs, dax: use page->mapping to warn if dma collides with truncate
      wait_bit: introduce {wait_on,wake_up}_atomic_one
      mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
      xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper


 arch/powerpc/platforms/Kconfig        |    1 
 arch/powerpc/sysdev/axonram.c         |    2 
 drivers/dax/device.c                  |    1 
 drivers/dax/super.c                   |  100 ++++++++++-
 drivers/nvdimm/pmem.c                 |    3 
 drivers/s390/block/Kconfig            |    1 
 drivers/s390/block/dcssblk.c          |    3 
 fs/Kconfig                            |    8 +
 fs/dax.c                              |  295 ++++++++++++++++++++++++++++-----
 fs/ext2/ext2.h                        |    1 
 fs/ext2/file.c                        |    1 
 fs/ext2/inode.c                       |   23 ++-
 fs/ext2/namei.c                       |   18 --
 fs/ext2/super.c                       |   13 +
 fs/ext4/file.c                        |    1 
 fs/ext4/inode.c                       |    6 +
 fs/ext4/super.c                       |   15 +-
 fs/xfs/Makefile                       |    3 
 fs/xfs/xfs_aops.c                     |    2 
 fs/xfs/xfs_aops.h                     |    1 
 fs/xfs/xfs_dma.c                      |   81 +++++++++
 fs/xfs/xfs_dma.h                      |   24 +++
 fs/xfs/xfs_file.c                     |    8 -
 fs/xfs/xfs_ioctl.c                    |    7 -
 fs/xfs/xfs_iops.c                     |   12 +
 fs/xfs/xfs_super.c                    |   20 +-
 include/linux/dax.h                   |   70 +++++++-
 include/linux/memremap.h              |   28 +--
 include/linux/mm.h                    |   62 +++++--
 include/linux/pfn_t.h                 |   13 +
 include/linux/vma.h                   |   23 +++
 include/linux/wait_bit.h              |   13 +
 kernel/memremap.c                     |   30 +++
 kernel/sched/wait_bit.c               |   59 ++++++-
 mm/Kconfig                            |    5 +
 mm/gup.c                              |    5 +
 mm/hmm.c                              |   13 -
 mm/huge_memory.c                      |    6 -
 mm/ksm.c                              |    3 
 mm/madvise.c                          |    2 
 mm/memory.c                           |   22 ++
 mm/migrate.c                          |    3 
 mm/mlock.c                            |    5 -
 mm/mmap.c                             |    8 -
 mm/swap.c                             |    3 
 tools/testing/nvdimm/Kbuild           |    1 
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++
 tools/testing/nvdimm/test/nfit_test.h |    1 
 49 files changed, 918 insertions(+), 203 deletions(-)
 create mode 100644 fs/xfs/xfs_dma.c
 create mode 100644 fs/xfs/xfs_dma.h
 create mode 100644 include/linux/vma.h
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v4 01/18] mm, dax: introduce pfn_t_special()
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-04  8:16   ` Christoph Hellwig
  2017-12-24  0:56 ` [PATCH v4 02/18] ext4: auto disable dax instead of failing mount Dan Williams
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm
  Cc: jack, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-xfs, Martin Schwidefsky, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, hch

In support of removing the VM_MIXEDMAP indication from DAX VMAs,
introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
should be used for DAX ptes. This also helps identify drivers like
dccssblk that only want to use DAX in a read-only fashion without
get_user_pages() support.

Ideally we could delete axonram and dcssblk DAX support, but if we need
to keep it better make it explicit that axonram and dcssblk only support
a sub-set of DAX due to missing _PAGE_DEVMAP support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    2 +-
 drivers/s390/block/dcssblk.c  |    3 ++-
 include/linux/pfn_t.h         |   13 +++++++++++++
 mm/memory.c                   |   16 +++++++++++++++-
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 1b307c80b401..cdbb0e59b3d3 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -151,7 +151,7 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_page
 	resource_size_t offset = pgoff * PAGE_SIZE;
 
 	*kaddr = (void *) bank->io_addr + offset;
-	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV|PFN_SPECIAL);
 	return (bank->size - offset) / PAGE_SIZE;
 }
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 6aaefb780436..9cae08b36b80 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -916,7 +916,8 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
 
 	dev_sz = dev_info->end - dev_info->start + 1;
 	*kaddr = (void *) dev_info->start + offset;
-	*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+	*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset),
+			PFN_DEV|PFN_SPECIAL);
 
 	return (dev_sz - offset) / PAGE_SIZE;
 }
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 43b1d7648e82..a03c2642a87c 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -15,8 +15,10 @@
 #define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
 #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
+#define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5))
 
 #define PFN_FLAGS_TRACE \
+	{ PFN_SPECIAL,	"SPECIAL" }, \
 	{ PFN_SG_CHAIN,	"SG_CHAIN" }, \
 	{ PFN_SG_LAST,	"SG_LAST" }, \
 	{ PFN_DEV,	"DEV" }, \
@@ -120,4 +122,15 @@ pud_t pud_mkdevmap(pud_t pud);
 #endif
 #endif /* __HAVE_ARCH_PTE_DEVMAP */
 
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static inline bool pfn_t_special(pfn_t pfn)
+{
+	return (pfn.val & PFN_SPECIAL) == PFN_SPECIAL;
+}
+#else
+static inline bool pfn_t_special(pfn_t pfn)
+{
+	return false;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
 #endif /* _LINUX_PFN_T_H_ */
diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..48a13473b401 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1897,12 +1897,26 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (pfn_t_special(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 02/18] ext4: auto disable dax instead of failing mount
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
  2017-12-24  0:56 ` [PATCH v4 01/18] mm, dax: introduce pfn_t_special() Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-03 14:20   ` Jan Kara
  2017-12-24  0:56 ` [PATCH v4 03/18] ext2: " Dan Williams
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, linux-xfs, linux-fsdevel, hch

Bring the ext4 filesystem in line with xfs that only warns and continues
when the "-o dax" option is specified to mount and the backing device
does not support dax. This is in preparation for removing dax support
from devices that do not enable get_user_pages() operations on dax
mappings. In other words 'gup' support is required and configurations
that were using so called 'page-less' dax will be converted back to
using the page cache.

Removing the broken 'page-less' dax support is a pre-requisite for
removing the "EXPERIMENTAL" warning when mounting a filesystem in dax
mode.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/super.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7c46693a14d7..18873ea89e08 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3710,11 +3710,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		if (ext4_has_feature_inline_data(sb)) {
 			ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
 					" that may contain inline data");
-			goto failed_mount;
+			sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
 		}
 		err = bdev_dax_supported(sb, blocksize);
-		if (err)
-			goto failed_mount;
+		if (err) {
+			ext4_msg(sb, KERN_ERR,
+				"DAX unsupported by block device. Turning off DAX.");
+			sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
+		}
 	}
 
 	if (ext4_has_feature_encrypt(sb) && es->s_encryption_level) {

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 03/18] ext2: auto disable dax instead of failing mount
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
  2017-12-24  0:56 ` [PATCH v4 01/18] mm, dax: introduce pfn_t_special() Dan Williams
  2017-12-24  0:56 ` [PATCH v4 02/18] ext4: auto disable dax instead of failing mount Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-03 14:21   ` Jan Kara
  2017-12-24  0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, linux-xfs, linux-fsdevel, hch

Bring the ext2 filesystem in line with xfs that only warns and continues
when the "-o dax" option is specified to mount and the backing device
does not support dax. This is in preparation for removing dax support
from devices that do not enable get_user_pages() operations on dax
mappings. In other words 'gup' support is required and configurations
that were using so called 'page-less' dax will be converted back to
using the page cache.

Removing the broken 'page-less' dax support is a pre-requisite for
removing the "EXPERIMENTAL" warning when mounting a filesystem in dax
mode.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext2/super.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..38f9222606ee 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -959,8 +959,11 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 
 	if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
 		err = bdev_dax_supported(sb, blocksize);
-		if (err)
-			goto failed_mount;
+		if (err) {
+			ext2_msg(sb, KERN_ERR,
+				"DAX unsupported by block device. Turning off DAX.");
+			sbi->s_mount_opt &= ~EXT2_MOUNT_DAX;
+		}
 	}
 
 	/* If the blocksize doesn't match, re-read the thing.. */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (2 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 03/18] ext2: " Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-03 15:29   ` Jan Kara
                     ` (2 more replies)
  2017-12-24  0:56 ` [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax Dan Williams
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm
  Cc: jack, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-xfs, Martin Schwidefsky, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, hch, Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages.  We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
drivers that have not supported get_user_pages() to date we allow them
to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
option which requires ->direct_access() to return pfn_t_special() pfns.
This leaves DAX support in brd disabled and scheduled for removal.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/Kconfig |    1 +
 drivers/dax/super.c            |   10 ++++++++++
 drivers/s390/block/Kconfig     |    1 +
 fs/Kconfig                     |    7 +++++++
 4 files changed, 19 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 5a96a2763e4a..2ce89b42a9f4 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -297,6 +297,7 @@ config AXON_RAM
 	tristate "Axon DDR2 memory device driver"
 	depends on PPC_IBM_CELL_BLADE && BLOCK
 	select DAX
+	select FS_DAX_LIMITED
 	default m
 	help
 	  It registers one block device per Axon's DDR2 memory bank found
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3ec804672601..473af694ad1c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,15 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
+			|| pfn_t_devmap(pfn))
+		/* pass */;
+	else {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 31f014b57bfc..594ae5fc8e9d 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,6 +15,7 @@ config BLK_DEV_XPRAM
 config DCSSBLK
 	def_tristate m
 	select DAX
+	select FS_DAX_LIMITED
 	prompt "DCSSBLK support"
 	depends on S390 && BLOCK
 	help
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..b40128bf6d1a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,13 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+# Selected by DAX drivers that do not expect filesystem DAX to support
+# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
+# for fork() of processes with MAP_SHARED mappings or support for
+# direct-I/O to a DAX mapping.
+config FS_DAX_LIMITED
+	bool
+
 endif # BLOCK
 
 # Posix ACL utility routines

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (3 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-03 15:27   ` Jan Kara
  2017-12-24  0:56 ` [PATCH v4 06/18] dax: stop using VM_HUGEPAGE " Dan Williams
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, jack, linux-nvdimm, linux-xfs, linux-fsdevel, hch,
	Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/mm.h   |    1 +
 include/linux/vma.h  |   23 +++++++++++++++++++++++
 mm/huge_memory.c     |    6 ++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |    8 ++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 13 files changed, 48 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 7b0bf825c4e7..c514ad48ff73 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -463,7 +463,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2da67699dc33..62c12c75b788 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -126,7 +126,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index a0ae27b1bc66..983cee466a89 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -367,7 +367,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8601275cc5e6..1d6d4a3ecd42 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1130,7 +1130,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09637c353de0..dc124b278173 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2405,6 +2405,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
 int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..e71487e8c5f0
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2017 Intel Corporation. All rights reserved. */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates some common checks for
+ * VM_SPECIAL, hugetlb and dax vmas, but note that there are 'special'
+ * vmas and circumstances beyond these types. In other words this helper
+ * is not exhaustive for example this does not replace VM_PFNMAP checks.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..d1b891f27675 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,10 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & VM_PFNMAP) || vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +823,10 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & VM_PFNMAP) || vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index be8f4576f842..0bd1fda485fd 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2372,6 +2372,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 751e97aa2210..eff3ec1e2574 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -96,7 +96,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 48a13473b401..1efb005e8fab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -831,6 +831,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -918,6 +920,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1228,7 +1232,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1897,7 +1901,7 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
-static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 {
 	/* these checks mirror the abort conditions in vm_normal_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
diff --git a/mm/migrate.c b/mm/migrate.c
index 4d0be47a322a..624d43a455be 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -46,6 +46,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2938,7 +2939,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index 30472d438794..a10580f77c84 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -520,8 +521,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index a4d546821214..b063e363cf27 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1737,11 +1738,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 06/18] dax: stop using VM_HUGEPAGE for dax
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (4 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2017-12-24  0:56 ` [PATCH v4 07/18] dax: store pfns in the radix Dan Williams
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, linux-xfs, linux-fsdevel, hch

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index c514ad48ff73..4b663bd35f53 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -463,7 +463,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 983cee466a89..5eba87000b7f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -367,7 +367,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1d6d4a3ecd42..6df0c133a61e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1129,8 +1129,6 @@ xfs_file_mmap(
 
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 07/18] dax: store pfns in the radix
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (5 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 06/18] dax: stop using VM_HUGEPAGE " Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2017-12-27  0:17   ` Ross Zwisler
  2018-01-03 15:39   ` Jan Kara
  2017-12-24  0:56 ` [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, Matthew Wilcox, linux-xfs, linux-fsdevel, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 ++++++++--
 fs/dax.c            |   75 ++++++++++++++++++---------------------------------
 2 files changed, 39 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 473af694ad1c..516124ae1ccf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 78b72c48374e..54071cd27e8c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags, bool dirty)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -659,17 +659,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -688,7 +685,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -718,29 +715,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -751,8 +730,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -810,8 +787,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -879,6 +856,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -886,14 +864,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1200,8 +1179,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto error_finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						 dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						 0, write && !sync);
 		if (IS_ERR(entry)) {
 			error = PTR_ERR(entry);
@@ -1286,13 +1264,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
 	if (IS_ERR(ret))
 		goto fallback;
@@ -1415,8 +1395,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
-		entry = dax_insert_mapping_entry(mapping, vmf, entry,
-						dax_iomap_sector(&iomap, pos),
+		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 						RADIX_DAX_PMD, write && !sync);
 		if (IS_ERR(entry))
 			goto finish_iomap;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (6 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 07/18] dax: store pfns in the radix Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2017-12-27 18:08   ` Ross Zwisler
  2018-01-02 21:44   ` Dave Chinner
  2017-12-24  0:56 ` [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, linux-xfs, linux-fsdevel, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index db33b28c5ef3..afc070c961cd 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -16,6 +16,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index ff9d3a5825e1..dd90060c0004 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -387,4 +389,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index 7217b2b953b5..9362b01e9a8f 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -872,6 +872,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -2151,6 +2184,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index 113b44675a71..744740a76dee 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -98,4 +98,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (7 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-04  8:20   ` Christoph Hellwig
  2017-12-24  0:56 ` [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, jack, linux-nvdimm, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   79 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |    3 +-
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    8 +++++
 7 files changed, 103 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 516124ae1ccf..e926e373a3a5 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,66 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+	/* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -505,6 +553,17 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index cf074b1ce219..bbe3044c1b26 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -407,7 +407,8 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops,
+			&pmem->pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 38f9222606ee..b0d6d9954945 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -171,7 +171,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -814,7 +814,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1210,7 +1210,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 18873ea89e08..238cad596733 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -950,7 +950,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3396,7 +3396,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4406,7 +4406,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5122d3021117..8ff821f3bcfb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -722,7 +722,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -730,10 +730,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -751,9 +751,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -763,7 +763,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -777,7 +777,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -811,14 +811,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5258346c558c..1c6ed44fe9fc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -51,12 +51,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -68,13 +64,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 #endif
 
@@ -82,6 +79,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..02d6d042ee7f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,19 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * When MEMORY_DEVICE_HOST memory is represented by a device that can
+ * host a filesystem, for example /dev/pmem0, that filesystem can
+ * register for a callback when a page is idled. For the filesystem-dax
+ * case page idle callbacks are used to coordinate DMA vs
+ * hole-punch/truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (8 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2018-01-04  8:25   ` Christoph Hellwig
  2017-12-24  0:56 ` [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS Dan Williams
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, jack, linux-nvdimm, linux-xfs,
	Jérôme Glisse, linux-fsdevel, hch

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   20 ++-------------
 include/linux/mm.h       |   61 ++++++++++++++++++++++++++++++++--------------
 kernel/memremap.c        |   30 ++++++++++++++++++++---
 mm/Kconfig               |    5 ++++
 mm/hmm.c                 |   13 ++--------
 mm/swap.c                |    3 ++
 8 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e926e373a3a5..0352a098b099 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	dev_pagemap_enable_ops();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	dev_pagemap_disable_ops();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index b40128bf6d1a..73fe372c00db 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEV_PAGEMAP_OPS
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..60c7608379ca 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -130,6 +129,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void dev_pagemap_enable_ops(void);
+void dev_pagemap_disable_ops(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
@@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
 	if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc124b278173..fda3d7dcddc3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
+	return false;
 }
-#define IS_HMM_ENABLED 0
+
 static inline bool is_device_private_page(const struct page *page)
 {
-	return false;
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
 static inline bool is_device_public_page(const struct page *page)
 {
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool put_devmap_managed_page(struct page *page)
+{
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
@@ -850,16 +878,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index c04000361664..bad64045b546 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -467,8 +467,30 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_enable_ops(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(dev_pagemap_enable_ops);
+
+void dev_pagemap_disable_ops(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(dev_pagemap_disable_ops);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -488,5 +510,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index 03ff7703d322..f8cf32411e1a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -705,6 +705,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEV_PAGEMAP_OPS
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -725,6 +728,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -735,6 +739,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index ee75b2923dde..a27b4ae12823 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -998,7 +989,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_enable_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1092,7 +1083,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_enable_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 38e1b6374a97..f2acaa93637b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -772,7 +773,7 @@ void release_pages(struct page **pages, int nr)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (9 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2017-12-24  0:56 ` Dan Williams
  2017-12-27  5:29   ` Matthew Wilcox
  2018-01-02 21:41   ` Dave Chinner
  2017-12-24  0:57 ` [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS Dan Williams
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:56 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, Matthew Wilcox, linux-xfs, linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

A DEFINE_FSDAX_AOPS macro helper is provided instead of a global 'struct
address_space_operations fs_dax_aops' instance, because ->writepages
needs to be an fs-specific implementation.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   32 ++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 54071cd27e8c..fadc1b13838b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -45,6 +45,75 @@
 /* The 'colour' (ie low bits) within a PMD of a page offset.  */
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 
+int dax_set_page_dirty(struct page *page)
+{
+	/*
+	 * Unlike __set_page_dirty_no_writeback, dax does all dirty
+	 * tracking in the radix in response to mkwrite faults.
+	 */
+	return 0;
+}
+EXPORT_SYMBOL(dax_set_page_dirty);
+
+ssize_t dax_direct_IO(struct kiocb *kiocb, struct iov_iter *iter)
+{
+	/*
+	 * The expectation is that filesystems that implement DAX
+	 * support also arrange for ->read_iter and ->write_iter to
+	 * bypass ->direct_IO.
+	 */
+	WARN_ONCE(1, "dax: incomplete fs implementation\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(dax_direct_IO);
+
+int dax_writepage(struct page *page, struct writeback_control *wbc)
+{
+	WARN_ONCE(1, "dax: incomplete fs implementation\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(dax_writepage);
+
+int dax_readpage(struct file *filp, struct page *page)
+{
+	WARN_ONCE(1, "dax: incomplete fs implementation\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(dax_readpage);
+
+int dax_readpages(struct file *filp, struct address_space *mapping,
+		struct list_head *pages, unsigned nr_pages)
+{
+	WARN_ONCE(1, "dax: incomplete fs implementation\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(dax_readpages);
+
+int dax_write_begin(struct file *filp, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	WARN_ONCE(1, "dax: incomplete fs implementation\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(dax_write_begin);
+
+int dax_write_end(struct file *filp, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	WARN_ONCE(1, "dax: incomplete fs implementation\n");
+	return -EINVAL;
+}
+EXPORT_SYMBOL(dax_write_end);
+
+void dax_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length)
+{
+	/* nothing to do for dax */
+}
+EXPORT_SYMBOL(dax_invalidatepage);
+
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 1c6ed44fe9fc..3502abcbea31 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -53,6 +53,34 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
 void fs_dax_release(struct dax_device *dax_dev, void *owner);
+int dax_set_page_dirty(struct page *page);
+ssize_t dax_direct_IO(struct kiocb *kiocb, struct iov_iter *iter);
+int dax_writepage(struct page *page, struct writeback_control *wbc);
+int dax_readpage(struct file *filp, struct page *page);
+int dax_readpages(struct file *filp, struct address_space *mapping,
+		struct list_head *pages, unsigned nr_pages);
+int dax_write_begin(struct file *filp, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata);
+int dax_write_end(struct file *filp, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata);
+void dax_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int length);
+
+#define DEFINE_FSDAX_AOPS(name, writepages_fn)	\
+const struct address_space_operations name = {	\
+	.set_page_dirty = dax_set_page_dirty,	\
+	.direct_IO = dax_direct_IO,	\
+	.writepage = dax_writepage,	\
+	.readpage = dax_readpage,	\
+	.writepages = writepages_fn,	\
+	.readpages = dax_readpages,	\
+	.write_begin = dax_write_begin,	\
+	.write_end = dax_write_end,	\
+	.invalidatepage = dax_invalidatepage, \
+}
+
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -73,6 +101,10 @@ static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
 static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
 }
+
+#define DEFINE_FSDAX_AOPS(name, writepages_fn)	\
+const struct address_space_operations name = { 0 }
+
 #endif
 
 int dax_read_lock(void);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (10 preceding siblings ...)
  2017-12-24  0:56 ` [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-02 21:15   ` Darrick J. Wong
  2018-01-04  8:28   ` Christoph Hellwig
  2017-12-24  0:57 ` [PATCH v4 13/18] ext4: " Dan Williams
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm; +Cc: jack, Darrick J. Wong, linux-nvdimm, linux-xfs, linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_aops.c |    2 ++
 fs/xfs/xfs_aops.h |    1 +
 fs/xfs/xfs_iops.c |    5 ++++-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 21e2d70884e1..361915d53cef 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1492,3 +1492,5 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
 };
+
+DEFINE_FSDAX_AOPS(xfs_dax_address_space_operations, xfs_vm_writepages);
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..a6ffbb5fe379 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_address_space_operations;
 
 int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..67bd97edc73b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		inode->i_mapping->a_ops = &xfs_address_space_operations;
+		if (IS_DAX(inode))
+			inode->i_mapping->a_ops = &xfs_dax_address_space_operations;
+		else
+			inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 13/18] ext4: use DEFINE_FSDAX_AOPS
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (11 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-04  8:29   ` Christoph Hellwig
  2017-12-24  0:57 ` [PATCH v4 14/18] ext2: " Dan Williams
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm
  Cc: jack, linux-nvdimm, linux-xfs, Andreas Dilger, linux-fsdevel,
	Theodore Ts'o, linux-ext4, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..065d11f43cb3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3928,6 +3928,8 @@ static const struct address_space_operations ext4_da_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+static DEFINE_FSDAX_AOPS(ext4_dax_aops, ext4_writepages);
+
 void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
@@ -3940,7 +3942,9 @@ void ext4_set_aops(struct inode *inode)
 	default:
 		BUG();
 	}
-	if (test_opt(inode->i_sb, DELALLOC))
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext4_dax_aops;
+	else if (test_opt(inode->i_sb, DELALLOC))
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
 		inode->i_mapping->a_ops = &ext4_aops;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 14/18] ext2: use DEFINE_FSDAX_AOPS
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (12 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 13/18] ext4: " Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-04  8:29   ` Christoph Hellwig
  2017-12-24  0:57 ` [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate Dan Williams
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, linux-xfs, Jan Kara, linux-fsdevel, hch

In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext2/ext2.h  |    1 +
 fs/ext2/inode.c |   23 +++++++++++++++--------
 fs/ext2/namei.c |   18 ++----------------
 3 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..d1f4546e0028 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -990,6 +990,8 @@ const struct address_space_operations ext2_nobh_aops = {
 	.error_remove_page	= generic_error_remove_page,
 };
 
+static DEFINE_FSDAX_AOPS(ext2_dax_aops, ext2_writepages);
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1390,18 @@ void ext2_set_inode_flags(struct inode *inode)
 		inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+	inode->i_op = &ext2_file_inode_operations;
+	inode->i_fop = &ext2_file_operations;
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &ext2_dax_aops;
+	else if (test_opt(inode->i_sb, NOBH))
+		inode->i_mapping->a_ops = &ext2_nobh_aops;
+	else
+		inode->i_mapping->a_ops = &ext2_aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
 	struct ext2_inode_info *ei;
@@ -1480,14 +1494,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 		ei->i_data[n] = raw_inode->i_block[n];
 
 	if (S_ISREG(inode->i_mode)) {
-		inode->i_op = &ext2_file_inode_operations;
-		if (test_opt(inode->i_sb, NOBH)) {
-			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
-		} else {
-			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
-		}
+		ext2_set_file_ops(inode);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
 		inode->i_fop = &ext2_dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	return ext2_add_nondir(dentry, inode);
 }
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (IS_ERR(inode))
 		return PTR_ERR(inode);
 
-	inode->i_op = &ext2_file_inode_operations;
-	if (test_opt(inode->i_sb, NOBH)) {
-		inode->i_mapping->a_ops = &ext2_nobh_aops;
-		inode->i_fop = &ext2_file_operations;
-	} else {
-		inode->i_mapping->a_ops = &ext2_aops;
-		inode->i_fop = &ext2_file_operations;
-	}
+	ext2_set_file_ops(inode);
 	mark_inode_dirty(inode);
 	d_tmpfile(dentry, inode);
 	unlock_new_inode(inode);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (13 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 14/18] ext2: " Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-04  8:30   ` Christoph Hellwig
  2018-01-04  9:39   ` Jan Kara
  2017-12-24  0:57 ` [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
                   ` (3 subsequent siblings)
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm; +Cc: jack, linux-nvdimm, Matthew Wilcox, linux-xfs, linux-fsdevel, hch

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index fadc1b13838b..7d9fff8a1195 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -366,6 +366,56 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->mapping);
+		page->mapping = mapping;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+		bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		page->mapping = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -472,6 +522,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -521,6 +572,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -617,6 +669,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, mapping, false);
+		dax_associate_entry(new_entry, mapping);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (14 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-04  8:30   ` Christoph Hellwig
  2017-12-24  0:57 ` [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions Dan Williams
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm
  Cc: jack, linux-nvdimm, Peter Zijlstra, linux-xfs, Ingo Molnar,
	linux-fsdevel, hch

Add a generic facility for awaiting an atomic_t to reach a value of one.

Page reference counts typically need to reach zero to be considered a
free / inactive page. However, ZONE_DEVICE pages allocated via
devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
done at init time to assign pages to the page allocator is skipped.

These pages will have their reference count elevated > 1 by
get_user_pages() when they are under DMA. In order to coordinate DMA to
these pages vs filesytem operations like hole-punch and truncate the
filesystem-dax implementation needs to capture the DMA-idle event (1 to
0 count transition).

For now, this implementation does not have functional behavior change,
follow-on patches will add waiters for these page-idle events.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 +-
 include/linux/wait_bit.h |   13 ++++++++++
 kernel/sched/wait_bit.c  |   59 +++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0352a098b099..85a56f849b0c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-	/* TODO: wakeup page-idle waiters */
+	wake_up_atomic_one(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 61b39eaf7cad..564c9a0141cd 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -33,10 +33,15 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+static inline void wake_up_atomic_one(atomic_t *p)
+{
+	wake_up_atomic_t(p);
+}
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode);
+int out_of_line_wait_on_atomic_one(atomic_t *p, wait_atomic_t_action_f action, unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -262,4 +267,12 @@ int wait_on_atomic_t(atomic_t *val, wait_atomic_t_action_f action, unsigned mode
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_atomic_one(atomic_t *val, wait_atomic_t_action_f action, unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_atomic_one(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 84cb3acd9260..8739b1e50df5 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,28 +162,47 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int __wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg, int target)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
 	    wait_bit->key.bit_nr != key->bit_nr ||
-	    atomic_read(val) != 0)
+	    atomic_read(val) != target)
 		return 0;
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	return __wake_atomic_t_function(wq_entry, mode, sync, arg, 0);
+}
+
+static int wake_atomic_one_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	return __wake_atomic_t_function(wq_entry, mode, sync, arg, 1);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       wait_atomic_t_action_f action, unsigned int mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		wait_atomic_t_action_f action, unsigned int mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +210,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val, mode);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,6 +229,17 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_ATOMIC_ONE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_atomic_one_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p,
 					 wait_atomic_t_action_f action,
 					 unsigned int mode)
@@ -217,7 +247,7 @@ __sched int out_of_line_wait_on_atomic_t(atomic_t *p,
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
@@ -230,6 +260,17 @@ __sched int atomic_t_wait(atomic_t *counter, unsigned int mode)
 }
 EXPORT_SYMBOL(atomic_t_wait);
 
+__sched int out_of_line_wait_on_atomic_one(atomic_t *p,
+					   wait_atomic_t_action_f action,
+					   unsigned int mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_ATOMIC_ONE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_atomic_one);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (15 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-04  8:31   ` Christoph Hellwig
  2018-01-04 11:12   ` Jan Kara
  2017-12-24  0:57 ` [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Dan Williams
  2018-01-04  8:17 ` [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Christoph Hellwig
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm
  Cc: jack, Matthew Wilcox, linux-nvdimm, Dave Hansen, Dave Chinner,
	hch, linux-xfs, Alexander Viro, linux-fsdevel, Darrick J. Wong

Background:

get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore. The only expectation is that dma can safely continue while the
filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_flush_dma() routine is intended to be called by filesystems with
locks held against mm faults (i_mmap_lock). It then invalidates all
mappings to trigger any subsequent get_user_pages() to block on
i_mmap_lock. Finally it scans/rescans all pages in the mapping until it
observes all them idle.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   95 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   27 ++++++++++++++
 mm/gup.c            |    5 +++
 3 files changed, 127 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 7d9fff8a1195..eed589bf833e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -416,6 +416,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	}
 }
 
+static struct page *dma_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -557,6 +570,87 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+int dax_flush_dma(struct address_space *mapping, wait_atomic_t_action_f action)
+{
+	pgoff_t	indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	pgoff_t	index, end;
+	unsigned i;
+
+	/* in the limited case get_user_pages for dax is disabled */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return 0;
+
+	if (!dax_mapping(mapping))
+		return 0;
+
+	if (mapping->nrexceptional == 0)
+		return 0;
+
+retry:
+	pagevec_init(&pvec);
+	index = 0;
+	end = -1;
+	unmap_mapping_range(mapping, 0, 0, 1);
+	/*
+	 * Flush dax_dma_lock() sections to ensure all possible page
+	 * references have been taken, or will block on the fs
+	 * 'mmap_lock'.
+	 */
+	synchronize_rcu();
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		int rc = 0;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			struct page *page = NULL;
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dma_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			if (!page)
+				continue;
+			rc = wait_on_atomic_one(&page->_refcount, action,
+					TASK_INTERRUPTIBLE);
+			if (rc == 0)
+				continue;
+			break;
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+
+		if (rc < 0)
+			return rc;
+		if (rc == 0) {
+			cond_resched();
+			continue;
+		}
+
+		/*
+		 * We have dropped fs locks, so we need to revalidate
+		 * that previously seen idle pages are still idle.
+		 */
+		goto retry;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_flush_dma);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -581,6 +675,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3502abcbea31..ccd6aed90f95 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -81,6 +81,15 @@ const struct address_space_operations name = {	\
 	.invalidatepage = dax_invalidatepage, \
 }
 
+static inline void dax_dma_lock(void)
+{
+	rcu_read_lock();
+}
+
+static inline void dax_dma_unlock(void)
+{
+	rcu_read_unlock();
+}
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -105,6 +114,13 @@ static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 #define DEFINE_FSDAX_AOPS(name, writepages_fn)	\
 const struct address_space_operations name = { 0 }
 
+static inline void dax_dma_lock(void)
+{
+}
+
+static inline void dax_dma_unlock(void)
+{
+}
 #endif
 
 int dax_read_lock(void);
@@ -134,11 +150,22 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
+static inline struct page *refcount_to_page(atomic_t *c)
+{
+	return container_of(c, struct page, _refcount);
+}
+
 #ifdef CONFIG_FS_DAX
+int dax_flush_dma(struct address_space *mapping, wait_atomic_t_action_f action);
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
 #else
+static inline int dax_flush_dma(struct address_space *mapping,
+		wait_atomic_t_action_f action)
+{
+	return 0;
+}
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length)
diff --git a/mm/gup.c b/mm/gup.c
index 9d142eb9e2e9..a8f5e13f7d17 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,7 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/dax.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -693,7 +694,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		dax_dma_lock();
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_dma_unlock();
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,
@@ -1825,7 +1828,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
+		dax_dma_lock();
 		gup_pgd_range(addr, end, write, pages, &nr);
+		dax_dma_unlock();
 		local_irq_enable();
 		ret = nr;
 	}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (16 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions Dan Williams
@ 2017-12-24  0:57 ` Dan Williams
  2018-01-02 21:07   ` Darrick J. Wong
  2018-01-02 23:00   ` Dave Chinner
  2018-01-04  8:17 ` [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Christoph Hellwig
  18 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-12-24  0:57 UTC (permalink / raw)
  To: akpm
  Cc: jack, linux-nvdimm, Darrick J. Wong, Dave Chinner, linux-xfs,
	linux-fsdevel, hch

xfs_break_layouts() scans for active pNFS layouts, drops locks and
rescans for those layouts to be broken. xfs_sync_dma performs
xfs_break_layouts and also scans for active dax-dma pages, drops locks
and rescans for those pages to go idle.

dax_flush_dma handles synchronizing against new page-busy events
(get_user_pages). iIt invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the
XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle
callback that will fire when the page reference count reaches 1 (recall
ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops
locks so we do not deadlock the process that might be trying to elevate
the page count of more pages before arranging for any of them to go idle
as is typically the case of iov_iter_get_pages.

dax_flush_dma relies on the fs-provided wait_atomic_t_action_f
(xfs_wait_dax_page) to handle evaluating the page reference count and
dropping locks when waiting.

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Makefile    |    3 ++
 fs/xfs/xfs_dma.c   |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_dma.h   |   24 +++++++++++++++
 fs/xfs/xfs_file.c  |    6 ++--
 fs/xfs/xfs_ioctl.c |    7 ++--
 fs/xfs/xfs_iops.c  |    7 +++-
 6 files changed, 118 insertions(+), 10 deletions(-)
 create mode 100644 fs/xfs/xfs_dma.c
 create mode 100644 fs/xfs/xfs_dma.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 7ceb41a9786a..f2cdc5a3eb6c 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -129,6 +129,9 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
 				   xfs_qm.o \
 				   xfs_quotaops.o
 
+# dax dma
+xfs-$(CONFIG_FS_DAX)		+= xfs_dma.o
+
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
 
diff --git a/fs/xfs/xfs_dma.c b/fs/xfs/xfs_dma.c
new file mode 100644
index 000000000000..3df1a51a76c4
--- /dev/null
+++ b/fs/xfs/xfs_dma.c
@@ -0,0 +1,81 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ */
+#include <linux/dax.h>
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_shared.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_pnfs.h"
+
+/*
+ * xfs_wait_dax_page - helper for dax_flush_dma to drop locks and sleep
+ * wait for a page idle event. Returns 1 if the locks did not need to be
+ * dropped and the page is idle, returns -EINTR if the sleep was
+ * interrupted and returns 1 when it slept. dax_flush_dma()
+ * retries/rescans all mappings when the lock is dropped.
+ */
+static int xfs_wait_dax_page(
+	atomic_t		*count,
+	unsigned int		mode)
+{
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
+	struct page 		*page = refcount_to_page(count);
+	struct address_space	*mapping = page->mapping;
+	struct inode		*inode = mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	xfs_iunlock(ip, iolock);
+	schedule();
+	xfs_ilock(ip, iolock);
+
+	if (signal_pending_state(mode, current))
+		return -EINTR;
+	return 1;
+}
+
+/*
+ * Synchronize [R]DMA before changing the file's block map. For pNFS,
+ * recall all layouts. For DAX, wait for transient DMA to complete. All
+ * other DMA is handled by pinning page cache pages.
+ *
+ * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and
+ * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit.
+ */
+int xfs_sync_dma(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	while (true) {
+		error = xfs_break_layouts(inode, iolock);
+		if (error)
+			break;
+
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+		*iolock |= XFS_MMAPLOCK_EXCL;
+
+		error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page);
+		if (error <= 0)
+			break;
+		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+		*iolock &= ~XFS_MMAPLOCK_EXCL;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_dma.h b/fs/xfs/xfs_dma.h
new file mode 100644
index 000000000000..29635639b073
--- /dev/null
+++ b/fs/xfs/xfs_dma.h
@@ -0,0 +1,24 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ */
+#ifndef __XFS_DMA__
+#define __XFS_DMA__
+#ifdef CONFIG_FS_DAX
+int xfs_sync_dma(struct inode *inode, uint *iolock);
+#else
+#include "xfs_pnfs.h"
+
+static inline int xfs_sync_dma(struct inode *inode, uint *iolock)
+{
+	int error = xfs_break_layouts(inode, iolock);
+
+	if (error)
+		return error;
+
+	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_EXCL);
+	*iolock |= XFS_MMAPLOCK_EXCL;
+	return 0;
+}
+#endif /* CONFIG_FS_DAX */
+#endif /* __XFS_DMA__ */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6df0c133a61e..84fc178da656 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -37,6 +37,7 @@
 #include "xfs_log.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include "xfs_dma.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
@@ -778,12 +779,11 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_sync_dma(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
 
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 20dc65fef6a4..4340bef658b0 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -39,7 +39,7 @@
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
 #include "xfs_trans.h"
-#include "xfs_pnfs.h"
+#include "xfs_dma.h"
 #include "xfs_acl.h"
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
@@ -643,12 +643,11 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_sync_dma(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 67bd97edc73b..c1055337b233 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -37,7 +37,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
-#include "xfs_pnfs.h"
+#include "xfs_dma.h"
 #include "xfs_iomap.h"
 
 #include <linux/capability.h>
@@ -1030,11 +1030,12 @@ xfs_vn_setattr(
 		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
 		uint			iolock = XFS_IOLOCK_EXCL;
 
-		error = xfs_break_layouts(d_inode(dentry), &iolock);
+		error = xfs_sync_dma(d_inode(dentry), &iolock);
 		if (error)
 			return error;
 
-		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+		ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
+
 		error = xfs_vn_setattr_size(dentry, iattr);
 		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
 	} else {

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 07/18] dax: store pfns in the radix
  2017-12-24  0:56 ` [PATCH v4 07/18] dax: store pfns in the radix Dan Williams
@ 2017-12-27  0:17   ` Ross Zwisler
  2018-01-02 20:15     ` Dan Williams
  2018-01-03 15:39   ` Jan Kara
  1 sibling, 1 reply; 66+ messages in thread
From: Ross Zwisler @ 2017-12-27  0:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Matthew Wilcox, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:56:38PM -0800, Dan Williams wrote:
> In preparation for examining the busy state of dax pages in the truncate
> path, switch from sectors to pfns in the radix.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/super.c |   15 ++++++++--
>  fs/dax.c            |   75 ++++++++++++++++++---------------------------------
>  2 files changed, 39 insertions(+), 51 deletions(-)
<>
> @@ -688,7 +685,7 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * compare sectors as we must not bail out due to difference in lockbit
>  	 * or entry type.
>  	 */

Can you please also fix the comment above this test so it talks about pfns
instead of sectors?

> -	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
> +	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
>  		goto put_unlocked;
>  	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
>  				dax_is_zero_entry(entry))) {
> @@ -718,29 +715,11 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
>  	 * worry about partial PMD writebacks.
>  	 */

Ditto for this comment ^^^

> -	sector = dax_radix_sector(entry);
> +	pfn = dax_radix_pfn(entry);
>  	size = PAGE_SIZE << dax_radix_order(entry);
>  
> -	id = dax_read_lock();
> -	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
> -	if (ret)
> -		goto dax_unlock;
> -
> -	/*
> -	 * dax_direct_access() may sleep, so cannot hold tree_lock over
> -	 * its invocation.
> -	 */
> -	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
> -	if (ret < 0)
> -		goto dax_unlock;
> -
> -	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
> -		ret = -EIO;
> -		goto dax_unlock;
> -	}
> -
> -	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
> -	dax_flush(dax_dev, kaddr, size);
> +	dax_mapping_entry_mkclean(mapping, index, pfn);
> +	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
>  	/*
>  	 * After we have flushed the cache, we can clear the dirty tag. There
>  	 * cannot be new dirty data in the pfn after the flush has completed as
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS
  2017-12-24  0:56 ` [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS Dan Williams
@ 2017-12-27  5:29   ` Matthew Wilcox
  2018-01-02 20:21     ` Dan Williams
  2018-01-02 21:41   ` Dave Chinner
  1 sibling, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2017-12-27  5:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Matthew Wilcox, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:56:59PM -0800, Dan Williams wrote:
> +int dax_set_page_dirty(struct page *page)
> +{
> +	/*
> +	 * Unlike __set_page_dirty_no_writeback, dax does all dirty
> +	 * tracking in the radix in response to mkwrite faults.

Please stop saying "in the radix".  I think you mean "in the page cache".

> +EXPORT_SYMBOL(dax_set_page_dirty);
> +EXPORT_SYMBOL(dax_direct_IO);
> +EXPORT_SYMBOL(dax_writepage);
> +EXPORT_SYMBOL(dax_readpage);
> +EXPORT_SYMBOL(dax_readpages);
> +EXPORT_SYMBOL(dax_write_begin);
> +EXPORT_SYMBOL(dax_write_end);
> +EXPORT_SYMBOL(dax_invalidatepage);

Exporting all these symbols to modules isn't exactly free.  Are you sure it
doesn't make more sense to put tests for dax in the existing aops?

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2017-12-24  0:56 ` [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
@ 2017-12-27 18:08   ` Ross Zwisler
  2018-01-02 20:35     ` Dan Williams
  2018-01-02 21:44   ` Dave Chinner
  1 sibling, 1 reply; 66+ messages in thread
From: Ross Zwisler @ 2017-12-27 18:08 UTC (permalink / raw)
  To: Dan Williams; +Cc: jack, linux-nvdimm, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:56:43PM -0800, Dan Williams wrote:
> In support of testing truncate colliding with dma add a mechanism that
> delays the completion of block I/O requests by a programmable number of
> seconds. This allows a truncate operation to be issued while page
> references are held for direct-I/O.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

> @@ -387,4 +389,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
>  }
>  EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
>  
> +static DEFINE_SPINLOCK(bio_lock);
> +static struct bio *biolist;
> +int bio_do_queue;
> +
> +static void run_bio(struct work_struct *work)
> +{
> +	struct delayed_work *dw = container_of(work, typeof(*dw), work);
> +	struct bio *bio, *next;
> +
> +	pr_info("%s\n", __func__);

Did you mean to leave this print in, or was it part of your debug while
developing?  I don't see any other prints in the rest of the nvdimm testing
code?

> +	spin_lock(&bio_lock);
> +	bio_do_queue = 0;
> +	bio = biolist;
> +	biolist = NULL;
> +	spin_unlock(&bio_lock);
> +
> +	while (bio) {
> +		next = bio->bi_next;
> +		bio->bi_next = NULL;
> +		bio_endio(bio);
> +		bio = next;
> +	}
> +	kfree(dw);
> +}
> +
> +void nfit_test_inject_bio_delay(int sec)
> +{
> +	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
> +
> +	spin_lock(&bio_lock);
> +	if (!bio_do_queue) {
> +		pr_info("%s: %d seconds\n", __func__, sec);

Ditto with this print - did you mean to leave it in?

> +		INIT_DELAYED_WORK(dw, run_bio);
> +		bio_do_queue = 1;
> +		schedule_delayed_work(dw, sec * HZ);
> +		dw = NULL;

Why set dw = NULL here?  In the else case we leak dw - was this dw=NULL meant
to allow a kfree(dw) after we get out of the if() (and probably after we drop
the spinlock)?

> +	}
> +	spin_unlock(&bio_lock);
> +}
> +EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
> +

> diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
> index 7217b2b953b5..9362b01e9a8f 100644
> --- a/tools/testing/nvdimm/test/nfit.c
> +++ b/tools/testing/nvdimm/test/nfit.c
> @@ -872,6 +872,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
>  	NULL,
>  };
>  
> +static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
> +{
> +	return sprintf(buf, "0\n");
> +}

It doesn't seem like this _show() routine adds much?  We could have it print
out the value of 'bio_do_queue' so we can see if we are currently queueing
bios in a workqueue element, but that suffers pretty badly from a TOCTOU race.

Otherwise we could just omit the _show() altogether and just use
DRIVER_ATTR_WO(bio_delay).

> +
> +static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
> +		size_t count)
> +{
> +	unsigned long delay;
> +	int rc = kstrtoul(buf, 0, &delay);
> +
> +	if (rc < 0)
> +		return rc;
> +
> +	nfit_test_inject_bio_delay(delay);
> +	return count;
> +}
> +DRIVER_ATTR_RW(bio_delay);

   DRIVER_ATTR_WO(bio_delay);  ?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 07/18] dax: store pfns in the radix
  2017-12-27  0:17   ` Ross Zwisler
@ 2018-01-02 20:15     ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2018-01-02 20:15 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-nvdimm, linux-xfs, Jeff Moyer,
	linux-fsdevel, Christoph Hellwig

On Tue, Dec 26, 2017 at 4:17 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sat, Dec 23, 2017 at 04:56:38PM -0800, Dan Williams wrote:
>> In preparation for examining the busy state of dax pages in the truncate
>> path, switch from sectors to pfns in the radix.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/dax/super.c |   15 ++++++++--
>>  fs/dax.c            |   75 ++++++++++++++++++---------------------------------
>>  2 files changed, 39 insertions(+), 51 deletions(-)
> <>
>> @@ -688,7 +685,7 @@ static int dax_writeback_one(struct block_device *bdev,
>>        * compare sectors as we must not bail out due to difference in lockbit
>>        * or entry type.
>>        */
>
> Can you please also fix the comment above this test so it talks about pfns
> instead of sectors?
>
>> -     if (dax_radix_sector(entry2) != dax_radix_sector(entry))
>> +     if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
>>               goto put_unlocked;
>>       if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
>>                               dax_is_zero_entry(entry))) {
>> @@ -718,29 +715,11 @@ static int dax_writeback_one(struct block_device *bdev,
>>        * 'entry'.  This allows us to flush for PMD_SIZE and not have to
>>        * worry about partial PMD writebacks.
>>        */
>
> Ditto for this comment ^^^
>

Sure, will do.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS
  2017-12-27  5:29   ` Matthew Wilcox
@ 2018-01-02 20:21     ` Dan Williams
  2018-01-03 16:05       ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-01-02 20:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-nvdimm, Matthew Wilcox, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Tue, Dec 26, 2017 at 9:29 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Sat, Dec 23, 2017 at 04:56:59PM -0800, Dan Williams wrote:
>> +int dax_set_page_dirty(struct page *page)
>> +{
>> +     /*
>> +      * Unlike __set_page_dirty_no_writeback, dax does all dirty
>> +      * tracking in the radix in response to mkwrite faults.
>
> Please stop saying "in the radix".  I think you mean "in the page cache".

Ok, I'll be more precise and mention the PAGECACHE_TAG_DIRTY vs
PageDirty distinction.

>
>> +EXPORT_SYMBOL(dax_set_page_dirty);
>> +EXPORT_SYMBOL(dax_direct_IO);
>> +EXPORT_SYMBOL(dax_writepage);
>> +EXPORT_SYMBOL(dax_readpage);
>> +EXPORT_SYMBOL(dax_readpages);
>> +EXPORT_SYMBOL(dax_write_begin);
>> +EXPORT_SYMBOL(dax_write_end);
>> +EXPORT_SYMBOL(dax_invalidatepage);
>
> Exporting all these symbols to modules isn't exactly free.  Are you sure it
> doesn't make more sense to put tests for dax in the existing aops?
>

I'd rather have just one global fs_dax_aops instance that all
filesystems could reference, but ->writepages() is fundamentally an
address_space_operation. Until we can rework that I'd prefer the
overhead of the extra exports than sprinkling more IS_DAX checks
around.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2017-12-27 18:08   ` Ross Zwisler
@ 2018-01-02 20:35     ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2018-01-02 20:35 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Andrew Morton, Jan Kara,
	linux-nvdimm, linux-xfs, linux-fsdevel, Christoph Hellwig

On Wed, Dec 27, 2017 at 10:08 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Sat, Dec 23, 2017 at 04:56:43PM -0800, Dan Williams wrote:
>> In support of testing truncate colliding with dma add a mechanism that
>> delays the completion of block I/O requests by a programmable number of
>> seconds. This allows a truncate operation to be issued while page
>> references are held for direct-I/O.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
>> @@ -387,4 +389,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
>>  }
>>  EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
>>
>> +static DEFINE_SPINLOCK(bio_lock);
>> +static struct bio *biolist;
>> +int bio_do_queue;
>> +
>> +static void run_bio(struct work_struct *work)
>> +{
>> +     struct delayed_work *dw = container_of(work, typeof(*dw), work);
>> +     struct bio *bio, *next;
>> +
>> +     pr_info("%s\n", __func__);
>
> Did you mean to leave this print in, or was it part of your debug while
> developing?  I don't see any other prints in the rest of the nvdimm testing
> code?
>
>> +     spin_lock(&bio_lock);
>> +     bio_do_queue = 0;
>> +     bio = biolist;
>> +     biolist = NULL;
>> +     spin_unlock(&bio_lock);
>> +
>> +     while (bio) {
>> +             next = bio->bi_next;
>> +             bio->bi_next = NULL;
>> +             bio_endio(bio);
>> +             bio = next;
>> +     }
>> +     kfree(dw);
>> +}
>> +
>> +void nfit_test_inject_bio_delay(int sec)
>> +{
>> +     struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
>> +
>> +     spin_lock(&bio_lock);
>> +     if (!bio_do_queue) {
>> +             pr_info("%s: %d seconds\n", __func__, sec);
>
> Ditto with this print - did you mean to leave it in?

Yes, this one plus the previous one are in there deliberately so that
I can see the injection / completion of the delay relative to when the
test is performing direct-i/o.

>
>> +             INIT_DELAYED_WORK(dw, run_bio);
>> +             bio_do_queue = 1;
>> +             schedule_delayed_work(dw, sec * HZ);
>> +             dw = NULL;
>
> Why set dw = NULL here?  In the else case we leak dw - was this dw=NULL meant
> to allow a kfree(dw) after we get out of the if() (and probably after we drop
> the spinlock)?

Something like that, but now it's just a leftover from an initial
version of the code, will delete.

>> +     }
>> +     spin_unlock(&bio_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
>> +
>
>> diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
>> index 7217b2b953b5..9362b01e9a8f 100644
>> --- a/tools/testing/nvdimm/test/nfit.c
>> +++ b/tools/testing/nvdimm/test/nfit.c
>> @@ -872,6 +872,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
>>       NULL,
>>  };
>>
>> +static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
>> +{
>> +     return sprintf(buf, "0\n");
>> +}
>
> It doesn't seem like this _show() routine adds much?  We could have it print
> out the value of 'bio_do_queue' so we can see if we are currently queueing
> bios in a workqueue element, but that suffers pretty badly from a TOCTOU race.
>
> Otherwise we could just omit the _show() altogether and just use
> DRIVER_ATTR_WO(bio_delay).

Sure.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2017-12-24  0:57 ` [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Dan Williams
@ 2018-01-02 21:07   ` Darrick J. Wong
  2018-01-02 23:00   ` Dave Chinner
  1 sibling, 0 replies; 66+ messages in thread
From: Darrick J. Wong @ 2018-01-02 21:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Dave Chinner, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote:
> xfs_break_layouts() scans for active pNFS layouts, drops locks and
> rescans for those layouts to be broken. xfs_sync_dma performs
> xfs_break_layouts and also scans for active dax-dma pages, drops locks
> and rescans for those pages to go idle.
> 
> dax_flush_dma handles synchronizing against new page-busy events
> (get_user_pages). iIt invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the
> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle
> callback that will fire when the page reference count reaches 1 (recall
> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops
> locks so we do not deadlock the process that might be trying to elevate
> the page count of more pages before arranging for any of them to go idle
> as is typically the case of iov_iter_get_pages.
> 
> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f
> (xfs_wait_dax_page) to handle evaluating the page reference count and
> dropping locks when waiting.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/Makefile    |    3 ++
>  fs/xfs/xfs_dma.c   |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_dma.h   |   24 +++++++++++++++
>  fs/xfs/xfs_file.c  |    6 ++--
>  fs/xfs/xfs_ioctl.c |    7 ++--
>  fs/xfs/xfs_iops.c  |    7 +++-
>  6 files changed, 118 insertions(+), 10 deletions(-)
>  create mode 100644 fs/xfs/xfs_dma.c
>  create mode 100644 fs/xfs/xfs_dma.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 7ceb41a9786a..f2cdc5a3eb6c 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -129,6 +129,9 @@ xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
>  				   xfs_qm.o \
>  				   xfs_quotaops.o
>  
> +# dax dma
> +xfs-$(CONFIG_FS_DAX)		+= xfs_dma.o
> +
>  # xfs_rtbitmap is shared with libxfs
>  xfs-$(CONFIG_XFS_RT)		+= xfs_rtalloc.o
>  
> diff --git a/fs/xfs/xfs_dma.c b/fs/xfs/xfs_dma.c
> new file mode 100644
> index 000000000000..3df1a51a76c4
> --- /dev/null
> +++ b/fs/xfs/xfs_dma.c
> @@ -0,0 +1,81 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0
> + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> + */
> +#include <linux/dax.h>
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_shared.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_bit.h"
> +#include "xfs_sb.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_inode.h"
> +#include "xfs_pnfs.h"
> +
> +/*
> + * xfs_wait_dax_page - helper for dax_flush_dma to drop locks and sleep
> + * wait for a page idle event. Returns 1 if the locks did not need to be
> + * dropped and the page is idle, returns -EINTR if the sleep was
> + * interrupted and returns 1 when it slept. dax_flush_dma()
> + * retries/rescans all mappings when the lock is dropped.

What does the return 0 case signify?  I'm guessing "returns 1 if the
locks did not need to be dropped and the page is idle"?

> + */
> +static int xfs_wait_dax_page(
> +	atomic_t		*count,
> +	unsigned int		mode)
> +{

static int
xfs_wait_dax_page(
	atomci_t		*count,
	unsigned int		mode)
{

IOWs, the function name gets its own line.  The same applies to every
other function definition.

> +	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
> +	struct page 		*page = refcount_to_page(count);

Space after 'page' ^^^ but before tab...

> +	struct address_space	*mapping = page->mapping;
> +	struct inode		*inode = mapping->host;
> +	struct xfs_inode	*ip = XFS_I(inode);
> +
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
> +
> +	if (page_ref_count(page) == 1)
> +		return 0;
> +
> +	xfs_iunlock(ip, iolock);
> +	schedule();
> +	xfs_ilock(ip, iolock);
> +
> +	if (signal_pending_state(mode, current))
> +		return -EINTR;
> +	return 1;
> +}
> +
> +/*
> + * Synchronize [R]DMA before changing the file's block map. For pNFS,
> + * recall all layouts. For DAX, wait for transient DMA to complete. All
> + * other DMA is handled by pinning page cache pages.
> + *
> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and
> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit.

Is it guaranteed that we never emerge from xfs_break_layouts with
IOLOCK_SHARED?  I /think/ the answer is yes, but this seems like a
subtlety that would be easy to screw up.

> + */
> +int xfs_sync_dma(
> +	struct inode		*inode,
> +	uint			*iolock)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error;
> +
> +	while (true) {
> +		error = xfs_break_layouts(inode, iolock);
> +		if (error)
> +			break;
> +
> +		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +		*iolock |= XFS_MMAPLOCK_EXCL;
> +
> +		error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page);
> +		if (error <= 0)
> +			break;
> +		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> +		*iolock &= ~XFS_MMAPLOCK_EXCL;
> +	}
> +
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_dma.h b/fs/xfs/xfs_dma.h
> new file mode 100644
> index 000000000000..29635639b073
> --- /dev/null
> +++ b/fs/xfs/xfs_dma.h
> @@ -0,0 +1,24 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0
> + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> + */
> +#ifndef __XFS_DMA__
> +#define __XFS_DMA__
> +#ifdef CONFIG_FS_DAX
> +int xfs_sync_dma(struct inode *inode, uint *iolock);
> +#else
> +#include "xfs_pnfs.h"
> +
> +static inline int xfs_sync_dma(struct inode *inode, uint *iolock)

I think we need to do this prior to reflinking into a file too, right?
Or at least we would if dax+reflink were a supported config.  I think
the reason we've never tripped over this is that neither pnfs nor dax
will have anything to do with reflinked files.

(brain creaking, trying to get back up to speed....)

--D

> +{
> +	int error = xfs_break_layouts(inode, iolock);
> +
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> +	*iolock |= XFS_MMAPLOCK_EXCL;
> +	return 0;
> +}
> +#endif /* CONFIG_FS_DAX */
> +#endif /* __XFS_DMA__ */
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6df0c133a61e..84fc178da656 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -37,6 +37,7 @@
>  #include "xfs_log.h"
>  #include "xfs_icache.h"
>  #include "xfs_pnfs.h"
> +#include "xfs_dma.h"
>  #include "xfs_iomap.h"
>  #include "xfs_reflink.h"
>  
> @@ -778,12 +779,11 @@ xfs_file_fallocate(
>  		return -EOPNOTSUPP;
>  
>  	xfs_ilock(ip, iolock);
> -	error = xfs_break_layouts(inode, &iolock);
> +	error = xfs_sync_dma(inode, &iolock);
>  	if (error)
>  		goto out_unlock;
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> -	iolock |= XFS_MMAPLOCK_EXCL;
> +	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>  
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 20dc65fef6a4..4340bef658b0 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -39,7 +39,7 @@
>  #include "xfs_icache.h"
>  #include "xfs_symlink.h"
>  #include "xfs_trans.h"
> -#include "xfs_pnfs.h"
> +#include "xfs_dma.h"
>  #include "xfs_acl.h"
>  #include "xfs_btree.h"
>  #include <linux/fsmap.h>
> @@ -643,12 +643,11 @@ xfs_ioc_space(
>  		return error;
>  
>  	xfs_ilock(ip, iolock);
> -	error = xfs_break_layouts(inode, &iolock);
> +	error = xfs_sync_dma(inode, &iolock);
>  	if (error)
>  		goto out_unlock;
>  
> -	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> -	iolock |= XFS_MMAPLOCK_EXCL;
> +	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 67bd97edc73b..c1055337b233 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -37,7 +37,7 @@
>  #include "xfs_da_btree.h"
>  #include "xfs_dir2.h"
>  #include "xfs_trans_space.h"
> -#include "xfs_pnfs.h"
> +#include "xfs_dma.h"
>  #include "xfs_iomap.h"
>  
>  #include <linux/capability.h>
> @@ -1030,11 +1030,12 @@ xfs_vn_setattr(
>  		struct xfs_inode	*ip = XFS_I(d_inode(dentry));
>  		uint			iolock = XFS_IOLOCK_EXCL;
>  
> -		error = xfs_break_layouts(d_inode(dentry), &iolock);
> +		error = xfs_sync_dma(d_inode(dentry), &iolock);
>  		if (error)
>  			return error;
>  
> -		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +		ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
> +
>  		error = xfs_vn_setattr_size(dentry, iattr);
>  		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>  	} else {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS
  2017-12-24  0:57 ` [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS Dan Williams
@ 2018-01-02 21:15   ` Darrick J. Wong
  2018-01-02 21:40     ` Dan Williams
  2018-01-04  8:28   ` Christoph Hellwig
  1 sibling, 1 reply; 66+ messages in thread
From: Darrick J. Wong @ 2018-01-02 21:15 UTC (permalink / raw)
  To: Dan Williams; +Cc: jack, linux-nvdimm, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:57:04PM -0800, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings.
> 
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: linux-xfs@vger.kernel.org
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_aops.c |    2 ++
>  fs/xfs/xfs_aops.h |    1 +
>  fs/xfs/xfs_iops.c |    5 ++++-
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 21e2d70884e1..361915d53cef 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1492,3 +1492,5 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.is_partially_uptodate  = block_is_partially_uptodate,
>  	.error_remove_page	= generic_error_remove_page,
>  };
> +
> +DEFINE_FSDAX_AOPS(xfs_dax_address_space_operations, xfs_vm_writepages);

Hmm, if we ever re-enable changing the DAX flag on the fly, will
mapping->a_ops have to change dynamically too?

How sure are we that we'll never have to set anything in the dax aops
other than ->writepages?

(I also kinda wonder why not just make the callers savvy vs. a bunch of
dummy aops, but maybe that's been tried in a previous iteration?)

--D

> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 88c85ea63da0..a6ffbb5fe379 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -54,6 +54,7 @@ struct xfs_ioend {
>  };
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> +extern const struct address_space_operations xfs_dax_address_space_operations;
>  
>  int	xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
>  
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 56475fcd76f2..67bd97edc73b 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1272,7 +1272,10 @@ xfs_setup_iops(
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
>  		inode->i_fop = &xfs_file_operations;
> -		inode->i_mapping->a_ops = &xfs_address_space_operations;
> +		if (IS_DAX(inode))
> +			inode->i_mapping->a_ops = &xfs_dax_address_space_operations;
> +		else
> +			inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
>  		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS
  2018-01-02 21:15   ` Darrick J. Wong
@ 2018-01-02 21:40     ` Dan Williams
  2018-01-03 16:09       ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-01-02 21:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Andrew Morton

On Tue, Jan 2, 2018 at 1:15 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Sat, Dec 23, 2017 at 04:57:04PM -0800, Dan Williams wrote:
>> In preparation for the dax implementation to start associating dax pages
>> to inodes via page->mapping, we need to provide a 'struct
>> address_space_operations' instance for dax. Otherwise, direct-I/O
>> triggers incorrect page cache assumptions and warnings.
>>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: linux-xfs@vger.kernel.org
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_aops.c |    2 ++
>>  fs/xfs/xfs_aops.h |    1 +
>>  fs/xfs/xfs_iops.c |    5 ++++-
>>  3 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 21e2d70884e1..361915d53cef 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -1492,3 +1492,5 @@ const struct address_space_operations xfs_address_space_operations = {
>>       .is_partially_uptodate  = block_is_partially_uptodate,
>>       .error_remove_page      = generic_error_remove_page,
>>  };
>> +
>> +DEFINE_FSDAX_AOPS(xfs_dax_address_space_operations, xfs_vm_writepages);
>
> Hmm, if we ever re-enable changing the DAX flag on the fly, will
> mapping->a_ops have to change dynamically too?
>
> How sure are we that we'll never have to set anything in the dax aops
> other than ->writepages?
>
> (I also kinda wonder why not just make the callers savvy vs. a bunch of
> dummy aops, but maybe that's been tried in a previous iteration?)

Matthew had similar feedback. I pushed back that I think more IS_DAX
sprinkling increases the long term maintenance burden, but now that
you've independently asked for the same thing I'm not opposed to
changing my mind.

Either way this need to switch the address_space_operations, or
synchronize against in-flight address_space_operations is going to
complicate the "dynamic toggle the dax mode" feature.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS
  2017-12-24  0:56 ` [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS Dan Williams
  2017-12-27  5:29   ` Matthew Wilcox
@ 2018-01-02 21:41   ` Dave Chinner
  1 sibling, 0 replies; 66+ messages in thread
From: Dave Chinner @ 2018-01-02 21:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Matthew Wilcox, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:56:59PM -0800, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings like the
> following:
> 
>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>  xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>  [..]
>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O 4.15.0-rc2+ #984
>  [..]
>  Call Trace:
>   set_page_dirty_lock+0x40/0x60
>   bio_set_pages_dirty+0x37/0x50
>   iomap_dio_actor+0x2b7/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   iomap_apply+0xa4/0x110
>   iomap_dio_rw+0x29e/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>   __vfs_read+0xf9/0x170
>   vfs_read+0xa6/0x150
>   SyS_pread64+0x93/0xb0
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> ...where the default set_page_dirty() handler assumes that dirty state
> is being tracked in 'struct page' flags.
> 
> A DEFINE_FSDAX_AOPS macro helper is provided instead of a global 'struct
> address_space_operations fs_dax_aops' instance, because ->writepages
> needs to be an fs-specific implementation.
....
>  static int __init init_dax_wait_table(void)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 1c6ed44fe9fc..3502abcbea31 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -53,6 +53,34 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
>  
>  struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
>  void fs_dax_release(struct dax_device *dax_dev, void *owner);
> +int dax_set_page_dirty(struct page *page);
> +ssize_t dax_direct_IO(struct kiocb *kiocb, struct iov_iter *iter);
> +int dax_writepage(struct page *page, struct writeback_control *wbc);
> +int dax_readpage(struct file *filp, struct page *page);
> +int dax_readpages(struct file *filp, struct address_space *mapping,
> +		struct list_head *pages, unsigned nr_pages);
> +int dax_write_begin(struct file *filp, struct address_space *mapping,
> +		loff_t pos, unsigned len, unsigned flags,
> +		struct page **pagep, void **fsdata);
> +int dax_write_end(struct file *filp, struct address_space *mapping,
> +		loff_t pos, unsigned len, unsigned copied,
> +		struct page *page, void *fsdata);
> +void dax_invalidatepage(struct page *page, unsigned int offset,
> +		unsigned int length);
> +
> +#define DEFINE_FSDAX_AOPS(name, writepages_fn)	\
> +const struct address_space_operations name = {	\
> +	.set_page_dirty = dax_set_page_dirty,	\
> +	.direct_IO = dax_direct_IO,	\
> +	.writepage = dax_writepage,	\
> +	.readpage = dax_readpage,	\
> +	.writepages = writepages_fn,	\
> +	.readpages = dax_readpages,	\
> +	.write_begin = dax_write_begin,	\
> +	.write_end = dax_write_end,	\
> +	.invalidatepage = dax_invalidatepage, \
> +}

Please don't hide ops structure definitions inside macrosi - it goes
completely against the convention used everywhere in filesystems.
i.e. we declare them in full for each filesystem that uses them so
that they can be modified for each individual filesystem as
necessary.

Also, ops structures aren't intended to be debugging aids. If the
filesystem doesn't implement something, the ops method should be
null and hence never called, not stubbed with a function that issues
warnings. If you really want to make sure we don't screw up, add a
debug only-check on the inode's aops vector when the DAX mmap range
is first being set up.

IOWs, this:

const struct address_space_operations xfs_dax_aops = {
	.writepages = xfs_vm_dax_writepage,
};

is all that should be defined for XFS, and similarly for other
filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2017-12-24  0:56 ` [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
  2017-12-27 18:08   ` Ross Zwisler
@ 2018-01-02 21:44   ` Dave Chinner
  2018-01-02 21:51     ` Dan Williams
  1 sibling, 1 reply; 66+ messages in thread
From: Dave Chinner @ 2018-01-02 21:44 UTC (permalink / raw)
  To: Dan Williams; +Cc: jack, linux-nvdimm, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:56:43PM -0800, Dan Williams wrote:
> In support of testing truncate colliding with dma add a mechanism that
> delays the completion of block I/O requests by a programmable number of
> seconds. This allows a truncate operation to be issued while page
> references are held for direct-I/O.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Why not put this in the generic bio layer code and then write a
generic fstest to exercise this truncate vs direct IO completion
race condition on all types of storage and filesystems?

i.e. if it sits in a nvdimm test suite, it's never going to be run
by filesystem developers....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2018-01-02 21:44   ` Dave Chinner
@ 2018-01-02 21:51     ` Dan Williams
  2018-01-03 15:46       ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-01-02 21:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Andrew Morton

On Tue, Jan 2, 2018 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Dec 23, 2017 at 04:56:43PM -0800, Dan Williams wrote:
>> In support of testing truncate colliding with dma add a mechanism that
>> delays the completion of block I/O requests by a programmable number of
>> seconds. This allows a truncate operation to be issued while page
>> references are held for direct-I/O.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Why not put this in the generic bio layer code and then write a
> generic fstest to exercise this truncate vs direct IO completion
> race condition on all types of storage and filesystems?
>
> i.e. if it sits in a nvdimm test suite, it's never going to be run
> by filesystem developers....

I do want to get it into xfstests eventually. I picked the nvdimm
infrastructure for expediency of getting the fix developed. Also, I
consider the collision in the non-dax case a solved problem since the
core mm will keep the page out of circulation indefinitely.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2017-12-24  0:57 ` [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Dan Williams
  2018-01-02 21:07   ` Darrick J. Wong
@ 2018-01-02 23:00   ` Dave Chinner
  2018-01-03  2:21     ` Dan Williams
  2018-01-04  8:33     ` Christoph Hellwig
  1 sibling, 2 replies; 66+ messages in thread
From: Dave Chinner @ 2018-01-02 23:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Darrick J. Wong, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote:
> xfs_break_layouts() scans for active pNFS layouts, drops locks and
> rescans for those layouts to be broken. xfs_sync_dma performs
> xfs_break_layouts and also scans for active dax-dma pages, drops locks
> and rescans for those pages to go idle.
> 
> dax_flush_dma handles synchronizing against new page-busy events
> (get_user_pages). iIt invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the
> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle
> callback that will fire when the page reference count reaches 1 (recall
> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops
> locks so we do not deadlock the process that might be trying to elevate
> the page count of more pages before arranging for any of them to go idle
> as is typically the case of iov_iter_get_pages.
> 
> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f
> (xfs_wait_dax_page) to handle evaluating the page reference count and
> dropping locks when waiting.

I don't see a problem with supporting this functionality, but I
see lots of problems with the code being presented. First of all,
I think the "sync dma" abstraction here is all wrong.

In the case of the filesystem, we don't care about whether DMA has
completed or not, and we *shouldn't have to care* about deep, dark
secrets of other subsystems.

If I read the current code, I see this in all the "truncate" paths:

	start op
	break layout leases
	change layout

and in the IO path:

	start IO
	break layout leases
	map IO
	issue IO

What this change does is make the truncate paths read:

	start op
	sync DMA
	change layout

but the IO path is unchanged. (This is not explained in comments or
commit messages).

And I look at that "sync DMA" step and wonder why the hell we need
to "sync DMA" because DMA has nothing to do with high level
filesystem code. It doesn't tell me anything obvious about why we
need to do this, nor does it tell me what we're actually
synchronising against.

What we care about in the filesystem code is whether there are
existing external references to file layout. If there's an external
reference, then it has to be broken before we can proceed and modify
the file layout. We don't care what owns that reference, just that
it has to broken before we continue.

AFAIC, these DMA references are just another external layout
reference that needs to be broken.  IOWs, this "sync DMA" complexity
needs to go inside xfs_break_layouts() as it is part of breaking the
external reference to the file layout - it does not replace the
layout breaking abstraction and so the implementation needs to
reflect that.

> + * Synchronize [R]DMA before changing the file's block map. For pNFS,
> + * recall all layouts. For DAX, wait for transient DMA to complete. All
> + * other DMA is handled by pinning page cache pages.
> + *
> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and
> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit.
> + */
> +int xfs_sync_dma(
> +	struct inode		*inode,
> +	uint			*iolock)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error;
> +
> +	while (true) {
> +		error = xfs_break_layouts(inode, iolock);
> +		if (error)
> +			break;
> +
> +		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +		*iolock |= XFS_MMAPLOCK_EXCL;
> +
> +		error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page);
> +		if (error <= 0)
> +			break;
> +		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> +		*iolock &= ~XFS_MMAPLOCK_EXCL;
> +	}

At this level the loop seems sub-optimal. If we don't drop the
IOLOCK, then we have no reason to call xfs_break_layouts() a second
time.  Hence in isolation this loop doesn' make sense. Yes, I
realise that dax_flush_dma() can result in all locks on the inode
being dropped, but that's hidden in another function whose calling
scope is not at all obvious from this code.

Also, xfs_wait_dax_page() assumes we have IOLOCK_EXCL held when it
is called. Nothing enforces the requirement that xfs_sync_dma() is
passed XFS_IOLOCK_EXCL, and so such assumptions cannot be made.
Even if it was, I really dislike the idea of a function that
/assumes/ lock state - that's a landmine that will bite us in the
rear end at some unexpected point in the future. If you need to
cycle held locks on an inode, you need to pass the held lock state
to the function.

Another gripe I have is that calling xfs_sync_dma() implies the mmap
lock is held exclusively on return. Hiding this locking inside
xfs_sync_dma removes the code documentation that large tracts of
code are protected against page faults by the XFS_MMAPLOCK_EXCL lock
call.  Instead of knowing at a glance that the truncate path
(xfs_vn_setattr) is protected against page faults, I have to
remember that xfs_sync_dma() now does this.

This goes back to my initial comments of "what the hell does "sync
dma" mean in a filesystem context?" - it certainly doesn't make me
think about inode locking. I don't like hidden/implicitly locking
like this, because it breaks both the least-surprise and the
self-documenting code principles....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2018-01-02 23:00   ` Dave Chinner
@ 2018-01-03  2:21     ` Dan Williams
  2018-01-03  7:51       ` Dave Chinner
  2018-01-04  8:33     ` Christoph Hellwig
  1 sibling, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-01-03  2:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Tue, Jan 2, 2018 at 3:00 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote:
>> xfs_break_layouts() scans for active pNFS layouts, drops locks and
>> rescans for those layouts to be broken. xfs_sync_dma performs
>> xfs_break_layouts and also scans for active dax-dma pages, drops locks
>> and rescans for those pages to go idle.
>>
>> dax_flush_dma handles synchronizing against new page-busy events
>> (get_user_pages). iIt invalidates all mappings to trigger the
>> get_user_pages slow path which will eventually block on the
>> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle
>> callback that will fire when the page reference count reaches 1 (recall
>> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops
>> locks so we do not deadlock the process that might be trying to elevate
>> the page count of more pages before arranging for any of them to go idle
>> as is typically the case of iov_iter_get_pages.
>>
>> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f
>> (xfs_wait_dax_page) to handle evaluating the page reference count and
>> dropping locks when waiting.
>
> I don't see a problem with supporting this functionality, but I
> see lots of problems with the code being presented. First of all,
> I think the "sync dma" abstraction here is all wrong.
>
> In the case of the filesystem, we don't care about whether DMA has
> completed or not, and we *shouldn't have to care* about deep, dark
> secrets of other subsystems.
>
> If I read the current code, I see this in all the "truncate" paths:
>
>         start op
>         break layout leases
>         change layout
>
> and in the IO path:
>
>         start IO
>         break layout leases
>         map IO
>         issue IO
>
> What this change does is make the truncate paths read:
>
>         start op
>         sync DMA
>         change layout
>
> but the IO path is unchanged. (This is not explained in comments or
> commit messages).
>
> And I look at that "sync DMA" step and wonder why the hell we need
> to "sync DMA" because DMA has nothing to do with high level
> filesystem code. It doesn't tell me anything obvious about why we
> need to do this, nor does it tell me what we're actually
> synchronising against.
>
> What we care about in the filesystem code is whether there are
> existing external references to file layout. If there's an external
> reference, then it has to be broken before we can proceed and modify
> the file layout. We don't care what owns that reference, just that
> it has to broken before we continue.
>
> AFAIC, these DMA references are just another external layout
> reference that needs to be broken.  IOWs, this "sync DMA" complexity
> needs to go inside xfs_break_layouts() as it is part of breaking the
> external reference to the file layout - it does not replace the
> layout breaking abstraction and so the implementation needs to
> reflect that.

These two sentences from the xfs_break_layouts() comment scared me
down this path of distinguishing dax-dma waiting from pNFS layout
lease break waiting:

---

"Additionally we call it during the write operation, where aren't
concerned about exposing unallocated blocks but just want to provide
basic synchronization between a local writer and pNFS clients.  mmap
writes would also benefit from this sort of synchronization, but due
to the tricky locking rules in the page fault path we don't bother."

---

I was not sure about holding XFS_MMAPLOCK_EXCL over
xfs_break_layouts() where it has historically not been held, and I was
worried about the potential deadlock of requiring all pages to be
unmapped and idle during a write. I.e. would we immediately deadlock
if userspace performed  direct-I/O to a file with a source buffer that
was mapped from that same file?

In general though, I agree that xfs_break_layouts() should comprehend
both cases. I'll investigate if the deadlock is real and perhaps add a
flag to xfs_break_layouts to distinguish the IO path from the truncate
paths to at least make that detail internal to the layout breaking
mechanism.

>> + * Synchronize [R]DMA before changing the file's block map. For pNFS,
>> + * recall all layouts. For DAX, wait for transient DMA to complete. All
>> + * other DMA is handled by pinning page cache pages.
>> + *
>> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and
>> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit.
>> + */
>> +int xfs_sync_dma(
>> +     struct inode            *inode,
>> +     uint                    *iolock)
>> +{
>> +     struct xfs_inode        *ip = XFS_I(inode);
>> +     int                     error;
>> +
>> +     while (true) {
>> +             error = xfs_break_layouts(inode, iolock);
>> +             if (error)
>> +                     break;
>> +
>> +             xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>> +             *iolock |= XFS_MMAPLOCK_EXCL;
>> +
>> +             error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page);
>> +             if (error <= 0)
>> +                     break;
>> +             xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>> +             *iolock &= ~XFS_MMAPLOCK_EXCL;
>> +     }
>
> At this level the loop seems sub-optimal. If we don't drop the
> IOLOCK, then we have no reason to call xfs_break_layouts() a second
> time.  Hence in isolation this loop doesn' make sense. Yes, I
> realise that dax_flush_dma() can result in all locks on the inode
> being dropped, but that's hidden in another function whose calling
> scope is not at all obvious from this code.
>
> Also, xfs_wait_dax_page() assumes we have IOLOCK_EXCL held when it
> is called. Nothing enforces the requirement that xfs_sync_dma() is
> passed XFS_IOLOCK_EXCL, and so such assumptions cannot be made.
> Even if it was, I really dislike the idea of a function that
> /assumes/ lock state - that's a landmine that will bite us in the
> rear end at some unexpected point in the future. If you need to
> cycle held locks on an inode, you need to pass the held lock state
> to the function.

I agree, and I thought about this, but at the time the callback is
made the only way we could pass the lock context to
xfs_wait_dax_page() would be to temporarily store it in the 'struct
page' which seemed ugly at first glance.

However, we do have space, we could alias it with the other unused
8-bytes of page->lru.

At least that cleans up the filesystem interface to not need to make
implicit locking assumptions... I'll go that route.

> Another gripe I have is that calling xfs_sync_dma() implies the mmap
> lock is held exclusively on return. Hiding this locking inside
> xfs_sync_dma removes the code documentation that large tracts of
> code are protected against page faults by the XFS_MMAPLOCK_EXCL lock
> call.  Instead of knowing at a glance that the truncate path
> (xfs_vn_setattr) is protected against page faults, I have to
> remember that xfs_sync_dma() now does this.
>
> This goes back to my initial comments of "what the hell does "sync
> dma" mean in a filesystem context?" - it certainly doesn't make me
> think about inode locking. I don't like hidden/implicitly locking
> like this, because it breaks both the least-surprise and the
> self-documenting code principles....

Thanks for this Dave, it's clear. I'll a spin a new version with some
of the reworks proposed above, but holler if those don't address your
core concern.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2018-01-03  2:21     ` Dan Williams
@ 2018-01-03  7:51       ` Dave Chinner
  2018-01-04  8:34         ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Dave Chinner @ 2018-01-03  7:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Tue, Jan 02, 2018 at 06:21:13PM -0800, Dan Williams wrote:
> On Tue, Jan 2, 2018 at 3:00 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote:
> >> xfs_break_layouts() scans for active pNFS layouts, drops locks and
> >> rescans for those layouts to be broken. xfs_sync_dma performs
> >> xfs_break_layouts and also scans for active dax-dma pages, drops locks
> >> and rescans for those pages to go idle.
> >>
> >> dax_flush_dma handles synchronizing against new page-busy events
> >> (get_user_pages). iIt invalidates all mappings to trigger the
> >> get_user_pages slow path which will eventually block on the
> >> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle
> >> callback that will fire when the page reference count reaches 1 (recall
> >> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops
> >> locks so we do not deadlock the process that might be trying to elevate
> >> the page count of more pages before arranging for any of them to go idle
> >> as is typically the case of iov_iter_get_pages.
> >>
> >> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f
> >> (xfs_wait_dax_page) to handle evaluating the page reference count and
> >> dropping locks when waiting.
> >
> > I don't see a problem with supporting this functionality, but I
> > see lots of problems with the code being presented. First of all,
> > I think the "sync dma" abstraction here is all wrong.
> >
> > In the case of the filesystem, we don't care about whether DMA has
> > completed or not, and we *shouldn't have to care* about deep, dark
> > secrets of other subsystems.
> >
> > If I read the current code, I see this in all the "truncate" paths:
> >
> >         start op
> >         break layout leases
> >         change layout
> >
> > and in the IO path:
> >
> >         start IO
> >         break layout leases
> >         map IO
> >         issue IO
> >
> > What this change does is make the truncate paths read:
> >
> >         start op
> >         sync DMA
> >         change layout
> >
> > but the IO path is unchanged. (This is not explained in comments or
> > commit messages).
> >
> > And I look at that "sync DMA" step and wonder why the hell we need
> > to "sync DMA" because DMA has nothing to do with high level
> > filesystem code. It doesn't tell me anything obvious about why we
> > need to do this, nor does it tell me what we're actually
> > synchronising against.
> >
> > What we care about in the filesystem code is whether there are
> > existing external references to file layout. If there's an external
> > reference, then it has to be broken before we can proceed and modify
> > the file layout. We don't care what owns that reference, just that
> > it has to broken before we continue.
> >
> > AFAIC, these DMA references are just another external layout
> > reference that needs to be broken.  IOWs, this "sync DMA" complexity
> > needs to go inside xfs_break_layouts() as it is part of breaking the
> > external reference to the file layout - it does not replace the
> > layout breaking abstraction and so the implementation needs to
> > reflect that.
> 
> These two sentences from the xfs_break_layouts() comment scared me
> down this path of distinguishing dax-dma waiting from pNFS layout
> lease break waiting:
> 
> ---
> 
> "Additionally we call it during the write operation, where aren't
> concerned about exposing unallocated blocks but just want to provide
> basic synchronization between a local writer and pNFS clients.  mmap
> writes would also benefit from this sort of synchronization, but due
> to the tricky locking rules in the page fault path we don't bother."
> ---

The pnfs code  went into 3.20 (4.0, IIRC), whilst the XFS_MMAPLOCK
code went into 4.1. So the pnfs code was written and tested by
Christoph a long time before I added the XFS_MMAPLOCK, despite them
landing only one release apart. We've never really gone back to look
at this because there hasn't been a need until now....

> I was not sure about holding XFS_MMAPLOCK_EXCL over
> xfs_break_layouts() where it has historically not been held, and I was
> worried about the potential deadlock of requiring all pages to be
> unmapped and idle during a write. I.e. would we immediately deadlock
> if userspace performed  direct-I/O to a file with a source buffer that
> was mapped from that same file?

Most likely.

> In general though, I agree that xfs_break_layouts() should comprehend
> both cases. I'll investigate if the deadlock is real and perhaps add a
> flag to xfs_break_layouts to distinguish the IO path from the truncate
> paths to at least make that detail internal to the layout breaking
> mechanism.

We can't hold the XFS_MMAPLOCK over the direct IO write submission
path. That will cause deadlocks as it will invert the
mmap_sem/XFS_MMAPLOCK order via get_user_pages_fast(). That's the
whole reason we have the IOLOCK and the MMAPLOCK - neither can be
taken in both the IO path and the page fault path because of
mmap_sem inversions, hence we need a lock per path for truncate
exclusion....

We can take the MMAPLOCK briefly during IO setup (e.g. where we are
breaking layouts) but we have to drop it before calling into the
iomap code where the mmap_sem may be taken.....

> >> + * Synchronize [R]DMA before changing the file's block map. For pNFS,
> >> + * recall all layouts. For DAX, wait for transient DMA to complete. All
> >> + * other DMA is handled by pinning page cache pages.
> >> + *
> >> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and
> >> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit.
> >> + */
> >> +int xfs_sync_dma(
> >> +     struct inode            *inode,
> >> +     uint                    *iolock)
> >> +{
> >> +     struct xfs_inode        *ip = XFS_I(inode);
> >> +     int                     error;
> >> +
> >> +     while (true) {
> >> +             error = xfs_break_layouts(inode, iolock);
> >> +             if (error)
> >> +                     break;
> >> +
> >> +             xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> >> +             *iolock |= XFS_MMAPLOCK_EXCL;
> >> +
> >> +             error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page);
> >> +             if (error <= 0)
> >> +                     break;
> >> +             xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> >> +             *iolock &= ~XFS_MMAPLOCK_EXCL;
> >> +     }
> >
> > At this level the loop seems sub-optimal. If we don't drop the
> > IOLOCK, then we have no reason to call xfs_break_layouts() a second
> > time.  Hence in isolation this loop doesn' make sense. Yes, I
> > realise that dax_flush_dma() can result in all locks on the inode
> > being dropped, but that's hidden in another function whose calling
> > scope is not at all obvious from this code.
> >
> > Also, xfs_wait_dax_page() assumes we have IOLOCK_EXCL held when it
> > is called. Nothing enforces the requirement that xfs_sync_dma() is
> > passed XFS_IOLOCK_EXCL, and so such assumptions cannot be made.
> > Even if it was, I really dislike the idea of a function that
> > /assumes/ lock state - that's a landmine that will bite us in the
> > rear end at some unexpected point in the future. If you need to
> > cycle held locks on an inode, you need to pass the held lock state
> > to the function.
> 
> I agree, and I thought about this, but at the time the callback is
> made the only way we could pass the lock context to
> xfs_wait_dax_page() would be to temporarily store it in the 'struct
> page' which seemed ugly at first glance.

I haven't looked at how you are implementing that callback, but it's
parameters are ... a bit strange. If we're waiting on a page, then
it should be passed the page, not an atomic_t....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 02/18] ext4: auto disable dax instead of failing mount
  2017-12-24  0:56 ` [PATCH v4 02/18] ext4: auto disable dax instead of failing mount Dan Williams
@ 2018-01-03 14:20   ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-03 14:20 UTC (permalink / raw)
  To: Dan Williams; +Cc: jack, linux-nvdimm, hch, linux-xfs, linux-fsdevel, akpm

On Sat 23-12-17 16:56:11, Dan Williams wrote:
> Bring the ext4 filesystem in line with xfs that only warns and continues
> when the "-o dax" option is specified to mount and the backing device
> does not support dax. This is in preparation for removing dax support
> from devices that do not enable get_user_pages() operations on dax
> mappings. In other words 'gup' support is required and configurations
> that were using so called 'page-less' dax will be converted back to
> using the page cache.
> 
> Removing the broken 'page-less' dax support is a pre-requisite for
> removing the "EXPERIMENTAL" warning when mounting a filesystem in dax
> mode.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

OK, given the fact that we already silently disable DAX for inodes with
data journalling I agree this makes sense. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 7c46693a14d7..18873ea89e08 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3710,11 +3710,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  		if (ext4_has_feature_inline_data(sb)) {
>  			ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
>  					" that may contain inline data");
> -			goto failed_mount;
> +			sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
>  		}
>  		err = bdev_dax_supported(sb, blocksize);
> -		if (err)
> -			goto failed_mount;
> +		if (err) {
> +			ext4_msg(sb, KERN_ERR,
> +				"DAX unsupported by block device. Turning off DAX.");
> +			sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
> +		}
>  	}
>  
>  	if (ext4_has_feature_encrypt(sb) && es->s_encryption_level) {
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 03/18] ext2: auto disable dax instead of failing mount
  2017-12-24  0:56 ` [PATCH v4 03/18] ext2: " Dan Williams
@ 2018-01-03 14:21   ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-03 14:21 UTC (permalink / raw)
  To: Dan Williams; +Cc: jack, linux-nvdimm, hch, linux-xfs, linux-fsdevel, akpm

On Sat 23-12-17 16:56:17, Dan Williams wrote:
> Bring the ext2 filesystem in line with xfs that only warns and continues
> when the "-o dax" option is specified to mount and the backing device
> does not support dax. This is in preparation for removing dax support
> from devices that do not enable get_user_pages() operations on dax
> mappings. In other words 'gup' support is required and configurations
> that were using so called 'page-less' dax will be converted back to
> using the page cache.
> 
> Removing the broken 'page-less' dax support is a pre-requisite for
> removing the "EXPERIMENTAL" warning when mounting a filesystem in dax
> mode.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext2/super.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 7646818ab266..38f9222606ee 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -959,8 +959,11 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
>  		err = bdev_dax_supported(sb, blocksize);
> -		if (err)
> -			goto failed_mount;
> +		if (err) {
> +			ext2_msg(sb, KERN_ERR,
> +				"DAX unsupported by block device. Turning off DAX.");
> +			sbi->s_mount_opt &= ~EXT2_MOUNT_DAX;
> +		}
>  	}
>  
>  	/* If the blocksize doesn't match, re-read the thing.. */
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax
  2017-12-24  0:56 ` [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax Dan Williams
@ 2018-01-03 15:27   ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-03 15:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack, linux-nvdimm, hch, linux-xfs, linux-fsdevel,
	akpm, Kirill A. Shutemov

On Sat 23-12-17 16:56:27, Dan Williams wrote:
> VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
> the memory page it is dealing with is not typical memory from the linear
> map. The get_user_pages_fast() path, since it does not resolve the vma,
> is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
> use that as a VM_MIXEDMAP replacement in some locations. In the cases
> where there is no pte to consult we fallback to using vma_is_dax() to
> detect the VM_MIXEDMAP special case.
> 
> Now that we have explicit driver pfn_t-flag opt-in/opt-out for
> get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
> also means we no longer need to worry about safely manipulating vm_flags
> in a future where we support dynamically changing the dax mode of a
> file.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

...

> diff --git a/mm/madvise.c b/mm/madvise.c
> index 751e97aa2210..eff3ec1e2574 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -96,7 +96,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>  		new_flags |= VM_DONTDUMP;
>  		break;
>  	case MADV_DODUMP:
> -		if (new_flags & VM_SPECIAL) {
> +		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
>  			error = -EINVAL;
>  			goto out;
>  		}

Why do you add the check here? I assume it's because VM_SPECIAL contains
VM_MIXEDMAP... But then why don't we allow dumping of DAX VMAs? Possibly
just keep the addition of the check in this patch and then add a separate
patch removing it with proper justification.

> diff --git a/mm/memory.c b/mm/memory.c
> index 48a13473b401..1efb005e8fab 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
...
> @@ -1228,7 +1232,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * efficient than faulting.
>  	 */
>  	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
> -			!vma->anon_vma)
> +			!vma->anon_vma && !vma_is_dax(vma))
>  		return 0;
>  
>  	if (is_vm_hugetlb_page(vma))

Ditto here... Page fault will fill DAX vmas just fine so I don't see a
reason why fork would need to copy page tables by hand.

Also I suppose comments about VM_MIXEDMAP in do_wp_page() and
wp_pfn_shared() would use some updating.

I'm not sure but I think VM_SPECIAL checks in mm/hmm.c needs treatment as
well?

If the replacement was really strict you should also add the check to
vma_merge() AFAICT. But as in some other cases, we can enable vma merging
for DAX vmas just fine so as the end result vma_merge() should IMO treat
DAX vmas. But it would be good to have this change recorded in a changelog
of a separate patch removing this additional check.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax
  2017-12-24  0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams
@ 2018-01-03 15:29   ` Jan Kara
  2018-01-04  8:16   ` Christoph Hellwig
  2018-01-08 11:58   ` Gerald Schaefer
  2 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-03 15:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens, hch,
	linux-xfs, Martin Schwidefsky, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, akpm, Gerald Schaefer

On Sat 23-12-17 16:56:22, Dan Williams wrote:
> If a dax buffer from a device that does not map pages is passed to
> read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
> gdb attempts to examine the contents of a dax buffer from a device that
> does not map pages it triggers SIGBUS. If fork(2) is called on a process
> with a dax mapping from a device that does not map pages it triggers
> SIGBUS. 'struct page' is required otherwise several kernel code paths
> break in surprising ways. Disable filesystem-dax on devices that do not
> map pages.
> 
> In addition to needing pfn_to_page() to be valid we also require devmap
> pages.  We need this to detect dax pages in the get_user_pages_fast()
> path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
> drivers that have not supported get_user_pages() to date we allow them
> to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
> option which requires ->direct_access() to return pfn_t_special() pfns.
> This leaves DAX support in brd disabled and scheduled for removal.
> 
> Note that when the initial dax support was being merged a few years back
> there was concern that struct page was unsuitable for use with next
> generation persistent memory devices. The theoretical concern was that
> struct page access, being such a hotly used data structure in the
> kernel, would lead to media wear out. While that was a reasonable
> conservative starting position it has not held true in practice. We have
> long since committed to using devm_memremap_pages() to support higher
> order kernel functionality that needs get_user_pages() and
> pfn_to_page().
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  arch/powerpc/platforms/Kconfig |    1 +
>  drivers/dax/super.c            |   10 ++++++++++
>  drivers/s390/block/Kconfig     |    1 +
>  fs/Kconfig                     |    7 +++++++
>  4 files changed, 19 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
> index 5a96a2763e4a..2ce89b42a9f4 100644
> --- a/arch/powerpc/platforms/Kconfig
> +++ b/arch/powerpc/platforms/Kconfig
> @@ -297,6 +297,7 @@ config AXON_RAM
>  	tristate "Axon DDR2 memory device driver"
>  	depends on PPC_IBM_CELL_BLADE && BLOCK
>  	select DAX
> +	select FS_DAX_LIMITED
>  	default m
>  	help
>  	  It registers one block device per Axon's DDR2 memory bank found
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 3ec804672601..473af694ad1c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -15,6 +15,7 @@
>  #include <linux/mount.h>
>  #include <linux/magic.h>
>  #include <linux/genhd.h>
> +#include <linux/pfn_t.h>
>  #include <linux/cdev.h>
>  #include <linux/hash.h>
>  #include <linux/slab.h>
> @@ -123,6 +124,15 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
>  		return len < 0 ? len : -EIO;
>  	}
>  
> +	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
> +			|| pfn_t_devmap(pfn))
> +		/* pass */;
> +	else {
> +		pr_debug("VFS (%s): error: dax support not enabled\n",
> +				sb->s_id);
> +		return -EOPNOTSUPP;
> +	}
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(__bdev_dax_supported);
> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
> index 31f014b57bfc..594ae5fc8e9d 100644
> --- a/drivers/s390/block/Kconfig
> +++ b/drivers/s390/block/Kconfig
> @@ -15,6 +15,7 @@ config BLK_DEV_XPRAM
>  config DCSSBLK
>  	def_tristate m
>  	select DAX
> +	select FS_DAX_LIMITED
>  	prompt "DCSSBLK support"
>  	depends on S390 && BLOCK
>  	help
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7aee6d699fd6..b40128bf6d1a 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -58,6 +58,13 @@ config FS_DAX_PMD
>  	depends on ZONE_DEVICE
>  	depends on TRANSPARENT_HUGEPAGE
>  
> +# Selected by DAX drivers that do not expect filesystem DAX to support
> +# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
> +# for fork() of processes with MAP_SHARED mappings or support for
> +# direct-I/O to a DAX mapping.
> +config FS_DAX_LIMITED
> +	bool
> +
>  endif # BLOCK
>  
>  # Posix ACL utility routines
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 07/18] dax: store pfns in the radix
  2017-12-24  0:56 ` [PATCH v4 07/18] dax: store pfns in the radix Dan Williams
  2017-12-27  0:17   ` Ross Zwisler
@ 2018-01-03 15:39   ` Jan Kara
  1 sibling, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-03 15:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Matthew Wilcox, hch, linux-xfs, linux-fsdevel, akpm

On Sat 23-12-17 16:56:38, Dan Williams wrote:
> In preparation for examining the busy state of dax pages in the truncate
> path, switch from sectors to pfns in the radix.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me after comments are fixed as Ross asked. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  drivers/dax/super.c |   15 ++++++++--
>  fs/dax.c            |   75 ++++++++++++++++++---------------------------------
>  2 files changed, 39 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 473af694ad1c..516124ae1ccf 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
>  		return len < 0 ? len : -EIO;
>  	}
>  
> -	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
> -			|| pfn_t_devmap(pfn))
> +	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
> +		/*
> +		 * An arch that has enabled the pmem api should also
> +		 * have its drivers support pfn_t_devmap()
> +		 *
> +		 * This is a developer warning and should not trigger in
> +		 * production. dax_flush() will crash since it depends
> +		 * on being able to do (page_address(pfn_to_page())).
> +		 */
> +		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
> +	} else if (pfn_t_devmap(pfn)) {
>  		/* pass */;
> -	else {
> +	} else {
>  		pr_debug("VFS (%s): error: dax support not enabled\n",
>  				sb->s_id);
>  		return -EOPNOTSUPP;
> diff --git a/fs/dax.c b/fs/dax.c
> index 78b72c48374e..54071cd27e8c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
>  #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
>  #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
>  
> -static unsigned long dax_radix_sector(void *entry)
> +static unsigned long dax_radix_pfn(void *entry)
>  {
>  	return (unsigned long)entry >> RADIX_DAX_SHIFT;
>  }
>  
> -static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
> +static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
>  {
>  	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
> -			((unsigned long)sector << RADIX_DAX_SHIFT) |
> -			RADIX_DAX_ENTRY_LOCK);
> +			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
>  }
>  
>  static unsigned int dax_radix_order(void *entry)
> @@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
>   */
>  static void *dax_insert_mapping_entry(struct address_space *mapping,
>  				      struct vm_fault *vmf,
> -				      void *entry, sector_t sector,
> +				      void *entry, pfn_t pfn_t,
>  				      unsigned long flags, bool dirty)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	void *new_entry;
> +	unsigned long pfn = pfn_t_to_pfn(pfn_t);
>  	pgoff_t index = vmf->pgoff;
> +	void *new_entry;
>  
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> @@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	new_entry = dax_radix_locked_entry(sector, flags);
> +	new_entry = dax_radix_locked_entry(pfn, flags);
>  
>  	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>  		/*
> @@ -659,17 +659,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
>  	i_mmap_unlock_read(mapping);
>  }
>  
> -static int dax_writeback_one(struct block_device *bdev,
> -		struct dax_device *dax_dev, struct address_space *mapping,
> -		pgoff_t index, void *entry)
> +static int dax_writeback_one(struct dax_device *dax_dev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	void *entry2, **slot, *kaddr;
> -	long ret = 0, id;
> -	sector_t sector;
> -	pgoff_t pgoff;
> +	void *entry2, **slot;
> +	unsigned long pfn;
> +	long ret = 0;
>  	size_t size;
> -	pfn_t pfn;
>  
>  	/*
>  	 * A page got tagged dirty in DAX mapping? Something is seriously
> @@ -688,7 +685,7 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * compare sectors as we must not bail out due to difference in lockbit
>  	 * or entry type.
>  	 */
> -	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
> +	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
>  		goto put_unlocked;
>  	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
>  				dax_is_zero_entry(entry))) {
> @@ -718,29 +715,11 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
>  	 * worry about partial PMD writebacks.
>  	 */
> -	sector = dax_radix_sector(entry);
> +	pfn = dax_radix_pfn(entry);
>  	size = PAGE_SIZE << dax_radix_order(entry);
>  
> -	id = dax_read_lock();
> -	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
> -	if (ret)
> -		goto dax_unlock;
> -
> -	/*
> -	 * dax_direct_access() may sleep, so cannot hold tree_lock over
> -	 * its invocation.
> -	 */
> -	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
> -	if (ret < 0)
> -		goto dax_unlock;
> -
> -	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
> -		ret = -EIO;
> -		goto dax_unlock;
> -	}
> -
> -	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
> -	dax_flush(dax_dev, kaddr, size);
> +	dax_mapping_entry_mkclean(mapping, index, pfn);
> +	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
>  	/*
>  	 * After we have flushed the cache, we can clear the dirty tag. There
>  	 * cannot be new dirty data in the pfn after the flush has completed as
> @@ -751,8 +730,6 @@ static int dax_writeback_one(struct block_device *bdev,
>  	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
> - dax_unlock:
> -	dax_read_unlock(id);
>  	put_locked_mapping_entry(mapping, index);
>  	return ret;
>  
> @@ -810,8 +787,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  				break;
>  			}
>  
> -			ret = dax_writeback_one(bdev, dax_dev, mapping,
> -					indices[i], pvec.pages[i]);
> +			ret = dax_writeback_one(dax_dev, mapping, indices[i],
> +					pvec.pages[i]);
>  			if (ret < 0) {
>  				mapping_set_error(mapping, ret);
>  				goto out;
> @@ -879,6 +856,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
>  	int ret = VM_FAULT_NOPAGE;
>  	struct page *zero_page;
>  	void *entry2;
> +	pfn_t pfn;
>  
>  	zero_page = ZERO_PAGE(0);
>  	if (unlikely(!zero_page)) {
> @@ -886,14 +864,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
>  		goto out;
>  	}
>  
> -	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
> +	pfn = page_to_pfn_t(zero_page);
> +	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
>  			RADIX_DAX_ZERO_PAGE, false);
>  	if (IS_ERR(entry2)) {
>  		ret = VM_FAULT_SIGBUS;
>  		goto out;
>  	}
>  
> -	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
> +	vm_insert_mixed(vmf->vma, vaddr, pfn);
>  out:
>  	trace_dax_load_hole(inode, vmf, ret);
>  	return ret;
> @@ -1200,8 +1179,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  		if (error < 0)
>  			goto error_finish_iomap;
>  
> -		entry = dax_insert_mapping_entry(mapping, vmf, entry,
> -						 dax_iomap_sector(&iomap, pos),
> +		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
>  						 0, write && !sync);
>  		if (IS_ERR(entry)) {
>  			error = PTR_ERR(entry);
> @@ -1286,13 +1264,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
>  	void *ret = NULL;
>  	spinlock_t *ptl;
>  	pmd_t pmd_entry;
> +	pfn_t pfn;
>  
>  	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
>  
>  	if (unlikely(!zero_page))
>  		goto fallback;
>  
> -	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
> +	pfn = page_to_pfn_t(zero_page);
> +	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
>  			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
>  	if (IS_ERR(ret))
>  		goto fallback;
> @@ -1415,8 +1395,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  		if (error < 0)
>  			goto finish_iomap;
>  
> -		entry = dax_insert_mapping_entry(mapping, vmf, entry,
> -						dax_iomap_sector(&iomap, pos),
> +		entry = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
>  						RADIX_DAX_PMD, write && !sync);
>  		if (IS_ERR(entry))
>  			goto finish_iomap;
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2018-01-02 21:51     ` Dan Williams
@ 2018-01-03 15:46       ` Jan Kara
  2018-01-03 20:37         ` Jeff Moyer
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2018-01-03 15:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Tue 02-01-18 13:51:49, Dan Williams wrote:
> On Tue, Jan 2, 2018 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Dec 23, 2017 at 04:56:43PM -0800, Dan Williams wrote:
> >> In support of testing truncate colliding with dma add a mechanism that
> >> delays the completion of block I/O requests by a programmable number of
> >> seconds. This allows a truncate operation to be issued while page
> >> references are held for direct-I/O.
> >>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >
> > Why not put this in the generic bio layer code and then write a
> > generic fstest to exercise this truncate vs direct IO completion
> > race condition on all types of storage and filesystems?
> >
> > i.e. if it sits in a nvdimm test suite, it's never going to be run
> > by filesystem developers....
> 
> I do want to get it into xfstests eventually. I picked the nvdimm
> infrastructure for expediency of getting the fix developed. Also, I
> consider the collision in the non-dax case a solved problem since the
> core mm will keep the page out of circulation indefinitely.

Yes, but there are different races that could happen even for regular page
cache pages. So I also think it would be worthwhile to have this inside the
block layer possibly as part of the generic fault-injection framework which
is already there for fail_make_request. That already supports various
filtering, frequency, and other options that could be useful.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS
  2018-01-02 20:21     ` Dan Williams
@ 2018-01-03 16:05       ` Jan Kara
  2018-01-04  8:27         ` Christoph Hellwig
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2018-01-03 16:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Matthew Wilcox, Matthew Wilcox,
	Christoph Hellwig, linux-xfs, linux-fsdevel, Andrew Morton

On Tue 02-01-18 12:21:28, Dan Williams wrote:
> >> +EXPORT_SYMBOL(dax_set_page_dirty);
> >> +EXPORT_SYMBOL(dax_direct_IO);
> >> +EXPORT_SYMBOL(dax_writepage);
> >> +EXPORT_SYMBOL(dax_readpage);
> >> +EXPORT_SYMBOL(dax_readpages);
> >> +EXPORT_SYMBOL(dax_write_begin);
> >> +EXPORT_SYMBOL(dax_write_end);
> >> +EXPORT_SYMBOL(dax_invalidatepage);
> >
> > Exporting all these symbols to modules isn't exactly free.  Are you sure it
> > doesn't make more sense to put tests for dax in the existing aops?
> >
> 
> I'd rather have just one global fs_dax_aops instance that all
> filesystems could reference, but ->writepages() is fundamentally an
> address_space_operation. Until we can rework that I'd prefer the
> overhead of the extra exports than sprinkling more IS_DAX checks
> around.

Just for record I agree with what Dave said about this patch. Generic
address_space_operations are not how aops are commonly defined by
filesystems. Just create one structure for each fs as Dave suggested.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS
  2018-01-02 21:40     ` Dan Williams
@ 2018-01-03 16:09       ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-03 16:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Tue 02-01-18 13:40:32, Dan Williams wrote:
> On Tue, Jan 2, 2018 at 1:15 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Sat, Dec 23, 2017 at 04:57:04PM -0800, Dan Williams wrote:
> >> In preparation for the dax implementation to start associating dax pages
> >> to inodes via page->mapping, we need to provide a 'struct
> >> address_space_operations' instance for dax. Otherwise, direct-I/O
> >> triggers incorrect page cache assumptions and warnings.
> >>
> >> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> >> Cc: linux-xfs@vger.kernel.org
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  fs/xfs/xfs_aops.c |    2 ++
> >>  fs/xfs/xfs_aops.h |    1 +
> >>  fs/xfs/xfs_iops.c |    5 ++++-
> >>  3 files changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> >> index 21e2d70884e1..361915d53cef 100644
> >> --- a/fs/xfs/xfs_aops.c
> >> +++ b/fs/xfs/xfs_aops.c
> >> @@ -1492,3 +1492,5 @@ const struct address_space_operations xfs_address_space_operations = {
> >>       .is_partially_uptodate  = block_is_partially_uptodate,
> >>       .error_remove_page      = generic_error_remove_page,
> >>  };
> >> +
> >> +DEFINE_FSDAX_AOPS(xfs_dax_address_space_operations, xfs_vm_writepages);
> >
> > Hmm, if we ever re-enable changing the DAX flag on the fly, will
> > mapping->a_ops have to change dynamically too?
> >
> > How sure are we that we'll never have to set anything in the dax aops
> > other than ->writepages?
> >
> > (I also kinda wonder why not just make the callers savvy vs. a bunch of
> > dummy aops, but maybe that's been tried in a previous iteration?)
> 
> Matthew had similar feedback. I pushed back that I think more IS_DAX
> sprinkling increases the long term maintenance burden, but now that
> you've independently asked for the same thing I'm not opposed to
> changing my mind.
> 
> Either way this need to switch the address_space_operations, or
> synchronize against in-flight address_space_operations is going to
> complicate the "dynamic toggle the dax mode" feature.

ext4 already dynamically switches aops for journalled data vs
non-journalled data vs nodelalloc case. That being said I'm not quite sure
this is bug-free ;)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism
  2018-01-03 15:46       ` Jan Kara
@ 2018-01-03 20:37         ` Jeff Moyer
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Moyer @ 2018-01-03 20:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Dave Chinner, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Andrew Morton

Jan Kara <jack@suse.cz> writes:

> On Tue 02-01-18 13:51:49, Dan Williams wrote:
>> On Tue, Jan 2, 2018 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sat, Dec 23, 2017 at 04:56:43PM -0800, Dan Williams wrote:
>> >> In support of testing truncate colliding with dma add a mechanism that
>> >> delays the completion of block I/O requests by a programmable number of
>> >> seconds. This allows a truncate operation to be issued while page
>> >> references are held for direct-I/O.
>> >>
>> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> >
>> > Why not put this in the generic bio layer code and then write a
>> > generic fstest to exercise this truncate vs direct IO completion
>> > race condition on all types of storage and filesystems?
>> >
>> > i.e. if it sits in a nvdimm test suite, it's never going to be run
>> > by filesystem developers....
>> 
>> I do want to get it into xfstests eventually. I picked the nvdimm
>> infrastructure for expediency of getting the fix developed. Also, I
>> consider the collision in the non-dax case a solved problem since the
>> core mm will keep the page out of circulation indefinitely.
>
> Yes, but there are different races that could happen even for regular page
> cache pages. So I also think it would be worthwhile to have this inside the
> block layer possibly as part of the generic fault-injection framework which
> is already there for fail_make_request. That already supports various
> filtering, frequency, and other options that could be useful.

Or consider extending the dm-delay target (which delays the queuing of
bios) to support delaying the completions.  I'm not sure I'm a fan of
sticking all sorts of debug code into the generic I/O submission path.

Cheers,
Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 01/18] mm, dax: introduce pfn_t_special()
  2017-12-24  0:56 ` [PATCH v4 01/18] mm, dax: introduce pfn_t_special() Dan Williams
@ 2018-01-04  8:16   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens, hch,
	linux-xfs, Martin Schwidefsky, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:56:06PM -0800, Dan Williams wrote:
> In support of removing the VM_MIXEDMAP indication from DAX VMAs,
> introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
> should be used for DAX ptes. This also helps identify drivers like
> dccssblk that only want to use DAX in a read-only fashion without
> get_user_pages() support.
> 
> Ideally we could delete axonram and dcssblk DAX support, but if we need
> to keep it better make it explicit that axonram and dcssblk only support
> a sub-set of DAX due to missing _PAGE_DEVMAP support.

I'd suggest to just delete it for now.  If the powerpc and S/390
people have an interest in it they should resurrect it properly and
fully.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax
  2017-12-24  0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams
  2018-01-03 15:29   ` Jan Kara
@ 2018-01-04  8:16   ` Christoph Hellwig
  2018-01-08 11:58   ` Gerald Schaefer
  2 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens, hch,
	linux-xfs, Martin Schwidefsky, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, akpm, Gerald Schaefer

Fine with me:

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch
  2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
                   ` (17 preceding siblings ...)
  2017-12-24  0:57 ` [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Dan Williams
@ 2018-01-04  8:17 ` Christoph Hellwig
  18 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack, Peter Zijlstra, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, Andreas Dilger, hch, Matthew Wilcox,
	Michael Ellerman, Ingo Molnar, linux-ext4, Dave Chinner,
	Martin Schwidefsky, Jérôme Glisse, Alexander Viro,
	Gerald Schaefer, Theodore Ts'o, linux-nvdimm, linux-xfs,
	Jan Kara, linux-fsdevel, Paul Mackerras, akpm, Darrick J. Wong,
	Kirill A. Shutemov

> So far this solution only targets xfs since it already implements
> xfs_break_layouts in all the locations that would need this
> synchronization. It applies on top of the vmem_altmap / dev_pagemap
> reworks from Christoph.

Those got a rebase since your posting of this series, btw.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  2017-12-24  0:56 ` [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
@ 2018-01-04  8:20   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack, linux-nvdimm, hch, linux-xfs,
	Jérôme Glisse, linux-fsdevel, akpm

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  2017-12-24  0:56 ` [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
@ 2018-01-04  8:25   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, jack, linux-nvdimm, hch, linux-xfs,
	Jérôme Glisse, linux-fsdevel, akpm

This looks fine except for a few nitpicks below:

>  	}
>  
> +	dev_pagemap_enable_ops();
>  	pgmap->type = MEMORY_DEVICE_FS_DAX;
>  	pgmap->page_free = generic_dax_pagefree;
>  	pgmap->data = owner;
> @@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
>  	pgmap->type = MEMORY_DEVICE_HOST;
>  	pgmap->page_free = NULL;
>  	pgmap->data = NULL;
> +	dev_pagemap_disable_ops();

should these be get/put instead of enable/disable given that they
implement a refcount?

> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +void __put_devmap_managed_page(struct page *page);
> +DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> +static inline bool put_devmap_managed_page(struct page *page)
>  {
> +	if (!static_branch_unlikely(&devmap_managed_key))
> +		return false;
> +	if (!is_zone_device_page(page))
> +		return false;
> +	switch (page->pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_PUBLIC:
> +	case MEMORY_DEVICE_FS_DAX:
> +		__put_devmap_managed_page(page);
> +		return true;
> +	default:
> +		break;
> +	}
> +	return false;

Should the switch be moved into the out of line version to keep
the inline instructions at a minimum?

> + * pages go idle.
> + */
> +void dev_pagemap_enable_ops(void)
> +{
> +	if (atomic_inc_return(&devmap_enable) == 1)
> +		static_branch_enable(&devmap_managed_key);
> +}
> +EXPORT_SYMBOL(dev_pagemap_enable_ops);

_GPL?

> +EXPORT_SYMBOL(dev_pagemap_disable_ops);

Same.



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS
  2018-01-03 16:05       ` Jan Kara
@ 2018-01-04  8:27         ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Matthew Wilcox, Matthew Wilcox, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Wed, Jan 03, 2018 at 05:05:03PM +0100, Jan Kara wrote:
> Just for record I agree with what Dave said about this patch. Generic
> address_space_operations are not how aops are commonly defined by
> filesystems. Just create one structure for each fs as Dave suggested.

Agreed.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS
  2017-12-24  0:57 ` [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS Dan Williams
  2018-01-02 21:15   ` Darrick J. Wong
@ 2018-01-04  8:28   ` Christoph Hellwig
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Darrick J. Wong, linux-nvdimm, hch, linux-xfs, linux-fsdevel, akpm

On Sat, Dec 23, 2017 at 04:57:04PM -0800, Dan Williams wrote:
> In preparation for the dax implementation to start associating dax pages
> to inodes via page->mapping, we need to provide a 'struct
> address_space_operations' instance for dax. Otherwise, direct-I/O
> triggers incorrect page cache assumptions and warnings.

As mentioned in the previous patch please opencode the ops.

Also now that there are different address space ops please implement
a purely DAX-specific xfs_dax_writepages instead of handling both in
one helper.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 13/18] ext4: use DEFINE_FSDAX_AOPS
  2017-12-24  0:57 ` [PATCH v4 13/18] ext4: " Dan Williams
@ 2018-01-04  8:29   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, hch, linux-xfs, Andreas Dilger,
	linux-fsdevel, Theodore Ts'o, akpm, linux-ext4

Same comment as for XFS applies here as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 14/18] ext2: use DEFINE_FSDAX_AOPS
  2017-12-24  0:57 ` [PATCH v4 14/18] ext2: " Dan Williams
@ 2018-01-04  8:29   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, hch, linux-xfs, Jan Kara, linux-fsdevel, akpm

Same comment as for XFS applies here as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate
  2017-12-24  0:57 ` [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate Dan Williams
@ 2018-01-04  8:30   ` Christoph Hellwig
  2018-01-04  9:39   ` Jan Kara
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Matthew Wilcox, hch, linux-xfs, linux-fsdevel, akpm

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one
  2017-12-24  0:57 ` [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
@ 2018-01-04  8:30   ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Peter Zijlstra, hch, linux-xfs, Ingo Molnar,
	linux-fsdevel, akpm

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2017-12-24  0:57 ` [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions Dan Williams
@ 2018-01-04  8:31   ` Christoph Hellwig
  2018-01-04 11:12   ` Jan Kara
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Matthew Wilcox, Dave Hansen, Dave Chinner, hch, linux-xfs,
	linux-nvdimm, Alexander Viro, linux-fsdevel, akpm,
	Darrick J. Wong

Not pretty, but probably the best we can do for now..

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2018-01-02 23:00   ` Dave Chinner
  2018-01-03  2:21     ` Dan Williams
@ 2018-01-04  8:33     ` Christoph Hellwig
  1 sibling, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: jack, linux-nvdimm, Darrick J. Wong, hch, linux-xfs, linux-fsdevel, akpm

On Wed, Jan 03, 2018 at 10:00:27AM +1100, Dave Chinner wrote:
> AFAIC, these DMA references are just another external layout
> reference that needs to be broken.  IOWs, this "sync DMA" complexity
> needs to go inside xfs_break_layouts() as it is part of breaking the
> external reference to the file layout - it does not replace the
> layout breaking abstraction and so the implementation needs to
> reflect that.

Agreed.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper
  2018-01-03  7:51       ` Dave Chinner
@ 2018-01-04  8:34         ` Christoph Hellwig
  0 siblings, 0 replies; 66+ messages in thread
From: Christoph Hellwig @ 2018-01-04  8:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Christoph Hellwig,
	linux-xfs, linux-fsdevel, Andrew Morton

On Wed, Jan 03, 2018 at 06:51:12PM +1100, Dave Chinner wrote:
> > "Additionally we call it during the write operation, where aren't
> > concerned about exposing unallocated blocks but just want to provide
> > basic synchronization between a local writer and pNFS clients.  mmap
> > writes would also benefit from this sort of synchronization, but due
> > to the tricky locking rules in the page fault path we don't bother."
> > ---
> 
> The pnfs code  went into 3.20 (4.0, IIRC), whilst the XFS_MMAPLOCK
> code went into 4.1. So the pnfs code was written and tested by
> Christoph a long time before I added the XFS_MMAPLOCK, despite them
> landing only one release apart. We've never really gone back to look
> at this because there hasn't been a need until now....

I suspect we should drop the MMAPLOCK as well, but I'd need to re-read
and re-test the code.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate
  2017-12-24  0:57 ` [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate Dan Williams
  2018-01-04  8:30   ` Christoph Hellwig
@ 2018-01-04  9:39   ` Jan Kara
  1 sibling, 0 replies; 66+ messages in thread
From: Jan Kara @ 2018-01-04  9:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Matthew Wilcox, hch, linux-xfs, linux-fsdevel, akpm

On Sat 23-12-17 16:57:20, Dan Williams wrote:
> Catch cases where truncate encounters pages that are still under active
> dma. This warning is a canary for potential data corruption as truncated
> blocks could be allocated to a new file while the device is still
> perform i/o.
> 
> Here is an example of a collision that this implementation catches:
> 
>  WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
>  [..]
>  Call Trace:
>   __dax_invalidate_mapping_entry+0x6c/0xf0
>   dax_delete_mapping_entry+0xf/0x20
>   truncate_exceptional_pvec_entries.part.12+0x1af/0x200
>   truncate_inode_pages_range+0x268/0x970
>   ? tlb_gather_mmu+0x10/0x20
>   ? up_write+0x1c/0x40
>   ? unmap_mapping_range+0x73/0x140
>   xfs_free_file_space+0x1b6/0x5b0 [xfs]
>   ? xfs_file_fallocate+0x7f/0x320 [xfs]
>   ? down_write_nested+0x40/0x70
>   ? xfs_ilock+0x21d/0x2f0 [xfs]
>   xfs_file_fallocate+0x162/0x320 [xfs]
>   ? rcu_read_lock_sched_held+0x3f/0x70
>   ? rcu_sync_lockdep_assert+0x2a/0x50
>   ? __sb_start_write+0xd0/0x1b0
>   ? vfs_fallocate+0x20c/0x270
>   vfs_fallocate+0x154/0x270
>   SyS_fallocate+0x43/0x80
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2017-12-24  0:57 ` [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions Dan Williams
  2018-01-04  8:31   ` Christoph Hellwig
@ 2018-01-04 11:12   ` Jan Kara
  2018-01-07 21:58     ` Dan Williams
  1 sibling, 1 reply; 66+ messages in thread
From: Jan Kara @ 2018-01-04 11:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, Matthew Wilcox, Dave Hansen, Dave Chinner, hch, linux-xfs,
	linux-nvdimm, Alexander Viro, linux-fsdevel, akpm,
	Darrick J. Wong

On Sat 23-12-17 16:57:31, Dan Williams wrote:
> +static struct page *dma_busy_page(void *entry)
> +{
> +	unsigned long pfn, end_pfn;
> +
> +	for_each_entry_pfn(entry, pfn, end_pfn) {
> +		struct page *page = pfn_to_page(pfn);
> +
> +		if (page_ref_count(page) > 1)
> +			return page;
> +	}
> +	return NULL;
> +}
> +
>  /*
>   * Find radix tree entry at given index. If it points to an exceptional entry,
>   * return it with the radix tree entry locked. If the radix tree doesn't
> @@ -557,6 +570,87 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
>  	return entry;
>  }
>  
> +int dax_flush_dma(struct address_space *mapping, wait_atomic_t_action_f action)

I don't quite like the 'dma' terminology when this is all about page
references in fact. How about renaming like dma_busy_page() ->
devmap_page_referenced() instead and dax_flush_dma() -> dax_wait_pages_unused()
or something like that?

> +{
> +	pgoff_t	indices[PAGEVEC_SIZE];
> +	struct pagevec pvec;
> +	pgoff_t	index, end;
> +	unsigned i;
> +
> +	/* in the limited case get_user_pages for dax is disabled */
> +	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> +		return 0;
> +
> +	if (!dax_mapping(mapping))
> +		return 0;
> +
> +	if (mapping->nrexceptional == 0)
> +		return 0;
> +
> +retry:
> +	pagevec_init(&pvec);
> +	index = 0;
> +	end = -1;
> +	unmap_mapping_range(mapping, 0, 0, 1);

unmap_mapping_range() would be IMHO be more logical in the callers. Maybe
a cleaner API would be like providing a function
dax_find_referenced_page(mapping) which either returns NULL or a page that
has elevated refcount. Filesystem can then drop locks it needs to and call
wait_on_atomic_one() (possibly hidden in a DAX helper). When wait finishes,
filesystem can do the retry. That way the whole lock, unlock, wait, retry
logic is clearly visible in fs code, there's no need of 'action' function
or propagation of locking state etc.

> +	/*
> +	 * Flush dax_dma_lock() sections to ensure all possible page
> +	 * references have been taken, or will block on the fs
> +	 * 'mmap_lock'.
> +	 */
> +	synchronize_rcu();

Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
for the pfn? We could indicate that by some bit in struct page or something
like that.

> +	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
> +				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +				indices)) {
> +		int rc = 0;
> +
> +		for (i = 0; i < pagevec_count(&pvec); i++) {
> +			struct page *pvec_ent = pvec.pages[i];
> +			struct page *page = NULL;
> +			void *entry;
> +
> +			index = indices[i];
> +			if (index >= end)
> +				break;
> +
> +			if (!radix_tree_exceptional_entry(pvec_ent))
> +				continue;

This would be a bug so I'm not sure we need to handle that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2018-01-04 11:12   ` Jan Kara
@ 2018-01-07 21:58     ` Dan Williams
  2018-01-08 13:50       ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-01-07 21:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-nvdimm, Dave Hansen, Dave Chinner,
	linux-xfs, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Andrew Morton, Christoph Hellwig

On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@suse.cz> wrote:
> On Sat 23-12-17 16:57:31, Dan Williams wrote:
>> +static struct page *dma_busy_page(void *entry)
>> +{
>> +     unsigned long pfn, end_pfn;
>> +
>> +     for_each_entry_pfn(entry, pfn, end_pfn) {
>> +             struct page *page = pfn_to_page(pfn);
>> +
>> +             if (page_ref_count(page) > 1)
>> +                     return page;
>> +     }
>> +     return NULL;
>> +}
>> +
>>  /*
>>   * Find radix tree entry at given index. If it points to an exceptional entry,
>>   * return it with the radix tree entry locked. If the radix tree doesn't
>> @@ -557,6 +570,87 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
>>       return entry;
>>  }
>>
>> +int dax_flush_dma(struct address_space *mapping, wait_atomic_t_action_f action)
>
> I don't quite like the 'dma' terminology when this is all about page
> references in fact. How about renaming like dma_busy_page() ->
> devmap_page_referenced() instead and dax_flush_dma() -> dax_wait_pages_unused()
> or something like that?

Sure, but this is moot given your better proposal below.

>
>> +{
>> +     pgoff_t indices[PAGEVEC_SIZE];
>> +     struct pagevec pvec;
>> +     pgoff_t index, end;
>> +     unsigned i;
>> +
>> +     /* in the limited case get_user_pages for dax is disabled */
>> +     if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> +             return 0;
>> +
>> +     if (!dax_mapping(mapping))
>> +             return 0;
>> +
>> +     if (mapping->nrexceptional == 0)
>> +             return 0;
>> +
>> +retry:
>> +     pagevec_init(&pvec);
>> +     index = 0;
>> +     end = -1;
>> +     unmap_mapping_range(mapping, 0, 0, 1);
>
> unmap_mapping_range() would be IMHO be more logical in the callers. Maybe
> a cleaner API would be like providing a function
> dax_find_referenced_page(mapping) which either returns NULL or a page that
> has elevated refcount. Filesystem can then drop locks it needs to and call
> wait_on_atomic_one() (possibly hidden in a DAX helper). When wait finishes,
> filesystem can do the retry. That way the whole lock, unlock, wait, retry
> logic is clearly visible in fs code, there's no need of 'action' function
> or propagation of locking state etc.

Yes, sounds better, I'll go this way.

>
>> +     /*
>> +      * Flush dax_dma_lock() sections to ensure all possible page
>> +      * references have been taken, or will block on the fs
>> +      * 'mmap_lock'.
>> +      */
>> +     synchronize_rcu();
>
> Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
> Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
> for the pfn? We could indicate that by some bit in struct page or something
> like that.

We would need a lockless way to take a reference conditionally if the
page is not subject to truncation.

I recall the raid5 code did something similar where it split a
reference count into 2 fields. I.e. take page->_refcount and use the
upper bits as a truncation count. Something like:

do {
    old = atomic_read(&page->_refcount);
    if (old & trunc_mask) /* upper bits of _refcount */
        return false;
    new = cnt + 1;
} while (atomic_cmpxchg(&page->_refcount, old, new) != old);
return true; /* we incremented the _refcount while the truncation
count was zero */

...the only concern is teaching the put_page() path to consider that
'trunc_mask' when determining that the page is idle.

Other ideas?

>> +     while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
>> +                             min(end - index, (pgoff_t)PAGEVEC_SIZE),
>> +                             indices)) {
>> +             int rc = 0;
>> +
>> +             for (i = 0; i < pagevec_count(&pvec); i++) {
>> +                     struct page *pvec_ent = pvec.pages[i];
>> +                     struct page *page = NULL;
>> +                     void *entry;
>> +
>> +                     index = indices[i];
>> +                     if (index >= end)
>> +                             break;
>> +
>> +                     if (!radix_tree_exceptional_entry(pvec_ent))
>> +                             continue;
>
> This would be a bug so I'm not sure we need to handle that.

Sure, I can kill that check.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax
  2017-12-24  0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams
  2018-01-03 15:29   ` Jan Kara
  2018-01-04  8:16   ` Christoph Hellwig
@ 2018-01-08 11:58   ` Gerald Schaefer
  2 siblings, 0 replies; 66+ messages in thread
From: Gerald Schaefer @ 2018-01-08 11:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: jack, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens, hch,
	linux-xfs, Martin Schwidefsky, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, akpm

On Sat, 23 Dec 2017 16:56:22 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> If a dax buffer from a device that does not map pages is passed to
> read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
> gdb attempts to examine the contents of a dax buffer from a device that
> does not map pages it triggers SIGBUS. If fork(2) is called on a process
> with a dax mapping from a device that does not map pages it triggers
> SIGBUS. 'struct page' is required otherwise several kernel code paths
> break in surprising ways. Disable filesystem-dax on devices that do not
> map pages.
> 
> In addition to needing pfn_to_page() to be valid we also require devmap
> pages.  We need this to detect dax pages in the get_user_pages_fast()
> path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
> drivers that have not supported get_user_pages() to date we allow them
> to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
> option which requires ->direct_access() to return pfn_t_special() pfns.
> This leaves DAX support in brd disabled and scheduled for removal.
> 
> Note that when the initial dax support was being merged a few years back
> there was concern that struct page was unsuitable for use with next
> generation persistent memory devices. The theoretical concern was that
> struct page access, being such a hotly used data structure in the
> kernel, would lead to media wear out. While that was a reasonable
> conservative starting position it has not held true in practice. We have
> long since committed to using devm_memremap_pages() to support higher
> order kernel functionality that needs get_user_pages() and
> pfn_to_page().
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/powerpc/platforms/Kconfig |    1 +
>  drivers/dax/super.c            |   10 ++++++++++
>  drivers/s390/block/Kconfig     |    1 +
>  fs/Kconfig                     |    7 +++++++
>  4 files changed, 19 insertions(+)

dcssblk seems to work fine, I did not see any SIGBUS while "executing
in place" from dcssblk with the current upstream kernel, maybe because
we only use dcssblk with fs dax in read-only mode.

Anyway, the dcssblk change is fine with me. I will look into adding
struct pages for dcssblk memory later, to make it work again with
this change, but for now I do not know of anyone needing this in the
upstream kernel.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2018-01-07 21:58     ` Dan Williams
@ 2018-01-08 13:50       ` Jan Kara
  2018-03-08 17:02         ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2018-01-08 13:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner, linux-xfs,
	linux-nvdimm, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Andrew Morton, Christoph Hellwig

On Sun 07-01-18 13:58:42, Dan Williams wrote:
> On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sat 23-12-17 16:57:31, Dan Williams wrote:
> >
> >> +     /*
> >> +      * Flush dax_dma_lock() sections to ensure all possible page
> >> +      * references have been taken, or will block on the fs
> >> +      * 'mmap_lock'.
> >> +      */
> >> +     synchronize_rcu();
> >
> > Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
> > Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
> > for the pfn? We could indicate that by some bit in struct page or something
> > like that.
> 
> We would need a lockless way to take a reference conditionally if the
> page is not subject to truncation.
> 
> I recall the raid5 code did something similar where it split a
> reference count into 2 fields. I.e. take page->_refcount and use the
> upper bits as a truncation count. Something like:
> 
> do {
>     old = atomic_read(&page->_refcount);
>     if (old & trunc_mask) /* upper bits of _refcount */
>         return false;
>     new = cnt + 1;
> } while (atomic_cmpxchg(&page->_refcount, old, new) != old);
> return true; /* we incremented the _refcount while the truncation
> count was zero */
> 
> ...the only concern is teaching the put_page() path to consider that
> 'trunc_mask' when determining that the page is idle.
> 
> Other ideas?

What I rather thought about was an update to GUP paths (like
follow_page_pte()):

	if (flags & FOLL_GET) {
		get_page(page);
		if (pte_devmap(pte)) {
			/*
			 * Pairs with the barrier in the truncate path.
			 * Could be possibly _after_atomic version of the
			 * barrier.
			 */
			smp_mb();
			if (PageTruncateInProgress(page)) {
				put_page(page);
				..bail...
			}
		}
	}

and in the truncate path:

	down_write(inode->i_mmap_sem);
	walk all pages in the mapping and mark them PageTruncateInProgress().
	unmap_mapping_range(...);
	/*
	 * Pairs with the barrier in GUP path. In fact not necessary since
	 * unmap_mapping_range() provides us with the barrier already.
	 */
	smp_mb();
	/*
	 * By now we are either guaranteed to see grabbed page reference or
	 * GUP is guaranteed to see PageTruncateInProgress().
	 */
	while ((page = dax_find_referenced_page(mapping))) {
		...
	}

The barriers need some verification, I've opted for the conservative option
but I guess you get the idea.


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2018-01-08 13:50       ` Jan Kara
@ 2018-03-08 17:02         ` Dan Williams
  2018-03-09 12:56           ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-03-08 17:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-nvdimm, Dave Hansen, Dave Chinner,
	linux-xfs, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Andrew Morton, Christoph Hellwig

On Mon, Jan 8, 2018 at 5:50 AM, Jan Kara <jack@suse.cz> wrote:
> On Sun 07-01-18 13:58:42, Dan Williams wrote:
>> On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Sat 23-12-17 16:57:31, Dan Williams wrote:
>> >
>> >> +     /*
>> >> +      * Flush dax_dma_lock() sections to ensure all possible page
>> >> +      * references have been taken, or will block on the fs
>> >> +      * 'mmap_lock'.
>> >> +      */
>> >> +     synchronize_rcu();
>> >
>> > Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
>> > Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
>> > for the pfn? We could indicate that by some bit in struct page or something
>> > like that.
>>
>> We would need a lockless way to take a reference conditionally if the
>> page is not subject to truncation.
>>
>> I recall the raid5 code did something similar where it split a
>> reference count into 2 fields. I.e. take page->_refcount and use the
>> upper bits as a truncation count. Something like:
>>
>> do {
>>     old = atomic_read(&page->_refcount);
>>     if (old & trunc_mask) /* upper bits of _refcount */
>>         return false;
>>     new = cnt + 1;
>> } while (atomic_cmpxchg(&page->_refcount, old, new) != old);
>> return true; /* we incremented the _refcount while the truncation
>> count was zero */
>>
>> ...the only concern is teaching the put_page() path to consider that
>> 'trunc_mask' when determining that the page is idle.
>>
>> Other ideas?
>
> What I rather thought about was an update to GUP paths (like
> follow_page_pte()):
>
>         if (flags & FOLL_GET) {
>                 get_page(page);
>                 if (pte_devmap(pte)) {
>                         /*
>                          * Pairs with the barrier in the truncate path.
>                          * Could be possibly _after_atomic version of the
>                          * barrier.
>                          */
>                         smp_mb();
>                         if (PageTruncateInProgress(page)) {
>                                 put_page(page);
>                                 ..bail...
>                         }
>                 }
>         }
>
> and in the truncate path:
>
>         down_write(inode->i_mmap_sem);
>         walk all pages in the mapping and mark them PageTruncateInProgress().
>         unmap_mapping_range(...);
>         /*
>          * Pairs with the barrier in GUP path. In fact not necessary since
>          * unmap_mapping_range() provides us with the barrier already.
>          */
>         smp_mb();
>         /*
>          * By now we are either guaranteed to see grabbed page reference or
>          * GUP is guaranteed to see PageTruncateInProgress().
>          */
>         while ((page = dax_find_referenced_page(mapping))) {
>                 ...
>         }
>
> The barriers need some verification, I've opted for the conservative option
> but I guess you get the idea.

[ Reviving this thread for the next rev of this patch set for 4.17
consideration ]

I don't think this barrier scheme can work in the presence of
get_user_pages_fast(). The get_user_pages_fast() path can race
unmap_mapping_range() to take out an elevated reference count on a
page.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2018-03-08 17:02         ` Dan Williams
@ 2018-03-09 12:56           ` Jan Kara
  2018-03-09 16:15             ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2018-03-09 12:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner, linux-xfs,
	linux-nvdimm, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Andrew Morton, Christoph Hellwig

On Thu 08-03-18 09:02:30, Dan Williams wrote:
> On Mon, Jan 8, 2018 at 5:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 07-01-18 13:58:42, Dan Williams wrote:
> >> On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Sat 23-12-17 16:57:31, Dan Williams wrote:
> >> >
> >> >> +     /*
> >> >> +      * Flush dax_dma_lock() sections to ensure all possible page
> >> >> +      * references have been taken, or will block on the fs
> >> >> +      * 'mmap_lock'.
> >> >> +      */
> >> >> +     synchronize_rcu();
> >> >
> >> > Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
> >> > Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
> >> > for the pfn? We could indicate that by some bit in struct page or something
> >> > like that.
> >>
> >> We would need a lockless way to take a reference conditionally if the
> >> page is not subject to truncation.
> >>
> >> I recall the raid5 code did something similar where it split a
> >> reference count into 2 fields. I.e. take page->_refcount and use the
> >> upper bits as a truncation count. Something like:
> >>
> >> do {
> >>     old = atomic_read(&page->_refcount);
> >>     if (old & trunc_mask) /* upper bits of _refcount */
> >>         return false;
> >>     new = cnt + 1;
> >> } while (atomic_cmpxchg(&page->_refcount, old, new) != old);
> >> return true; /* we incremented the _refcount while the truncation
> >> count was zero */
> >>
> >> ...the only concern is teaching the put_page() path to consider that
> >> 'trunc_mask' when determining that the page is idle.
> >>
> >> Other ideas?
> >
> > What I rather thought about was an update to GUP paths (like
> > follow_page_pte()):
> >
> >         if (flags & FOLL_GET) {
> >                 get_page(page);
> >                 if (pte_devmap(pte)) {
> >                         /*
> >                          * Pairs with the barrier in the truncate path.
> >                          * Could be possibly _after_atomic version of the
> >                          * barrier.
> >                          */
> >                         smp_mb();
> >                         if (PageTruncateInProgress(page)) {
> >                                 put_page(page);
> >                                 ..bail...
> >                         }
> >                 }
> >         }
> >
> > and in the truncate path:
> >
> >         down_write(inode->i_mmap_sem);
> >         walk all pages in the mapping and mark them PageTruncateInProgress().
> >         unmap_mapping_range(...);
> >         /*
> >          * Pairs with the barrier in GUP path. In fact not necessary since
> >          * unmap_mapping_range() provides us with the barrier already.
> >          */
> >         smp_mb();
> >         /*
> >          * By now we are either guaranteed to see grabbed page reference or
> >          * GUP is guaranteed to see PageTruncateInProgress().
> >          */
> >         while ((page = dax_find_referenced_page(mapping))) {
> >                 ...
> >         }
> >
> > The barriers need some verification, I've opted for the conservative option
> > but I guess you get the idea.
> 
> [ Reviving this thread for the next rev of this patch set for 4.17
> consideration ]
> 
> I don't think this barrier scheme can work in the presence of
> get_user_pages_fast(). The get_user_pages_fast() path can race
> unmap_mapping_range() to take out an elevated reference count on a
> page.

Why the scheme cannot work? Sure you'd need to patch also gup_pte_range()
and a similar thing for PMDs to recheck PageTruncateInProgress() after
grabbing the page reference. But in principle I don't see anything
fundamentally different between gup_fast() and plain gup().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2018-03-09 12:56           ` Jan Kara
@ 2018-03-09 16:15             ` Dan Williams
  2018-03-09 17:26               ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-03-09 16:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-nvdimm, Dave Hansen, Dave Chinner,
	linux-xfs, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Andrew Morton, Christoph Hellwig

On Fri, Mar 9, 2018 at 4:56 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 08-03-18 09:02:30, Dan Williams wrote:
>> On Mon, Jan 8, 2018 at 5:50 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Sun 07-01-18 13:58:42, Dan Williams wrote:
>> >> On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@suse.cz> wrote:
>> >> > On Sat 23-12-17 16:57:31, Dan Williams wrote:
>> >> >
>> >> >> +     /*
>> >> >> +      * Flush dax_dma_lock() sections to ensure all possible page
>> >> >> +      * references have been taken, or will block on the fs
>> >> >> +      * 'mmap_lock'.
>> >> >> +      */
>> >> >> +     synchronize_rcu();
>> >> >
>> >> > Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
>> >> > Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
>> >> > for the pfn? We could indicate that by some bit in struct page or something
>> >> > like that.
>> >>
>> >> We would need a lockless way to take a reference conditionally if the
>> >> page is not subject to truncation.
>> >>
>> >> I recall the raid5 code did something similar where it split a
>> >> reference count into 2 fields. I.e. take page->_refcount and use the
>> >> upper bits as a truncation count. Something like:
>> >>
>> >> do {
>> >>     old = atomic_read(&page->_refcount);
>> >>     if (old & trunc_mask) /* upper bits of _refcount */
>> >>         return false;
>> >>     new = cnt + 1;
>> >> } while (atomic_cmpxchg(&page->_refcount, old, new) != old);
>> >> return true; /* we incremented the _refcount while the truncation
>> >> count was zero */
>> >>
>> >> ...the only concern is teaching the put_page() path to consider that
>> >> 'trunc_mask' when determining that the page is idle.
>> >>
>> >> Other ideas?
>> >
>> > What I rather thought about was an update to GUP paths (like
>> > follow_page_pte()):
>> >
>> >         if (flags & FOLL_GET) {
>> >                 get_page(page);
>> >                 if (pte_devmap(pte)) {
>> >                         /*
>> >                          * Pairs with the barrier in the truncate path.
>> >                          * Could be possibly _after_atomic version of the
>> >                          * barrier.
>> >                          */
>> >                         smp_mb();
>> >                         if (PageTruncateInProgress(page)) {
>> >                                 put_page(page);
>> >                                 ..bail...
>> >                         }
>> >                 }
>> >         }
>> >
>> > and in the truncate path:
>> >
>> >         down_write(inode->i_mmap_sem);
>> >         walk all pages in the mapping and mark them PageTruncateInProgress().
>> >         unmap_mapping_range(...);
>> >         /*
>> >          * Pairs with the barrier in GUP path. In fact not necessary since
>> >          * unmap_mapping_range() provides us with the barrier already.
>> >          */
>> >         smp_mb();
>> >         /*
>> >          * By now we are either guaranteed to see grabbed page reference or
>> >          * GUP is guaranteed to see PageTruncateInProgress().
>> >          */
>> >         while ((page = dax_find_referenced_page(mapping))) {
>> >                 ...
>> >         }
>> >
>> > The barriers need some verification, I've opted for the conservative option
>> > but I guess you get the idea.
>>
>> [ Reviving this thread for the next rev of this patch set for 4.17
>> consideration ]
>>
>> I don't think this barrier scheme can work in the presence of
>> get_user_pages_fast(). The get_user_pages_fast() path can race
>> unmap_mapping_range() to take out an elevated reference count on a
>> page.
>
> Why the scheme cannot work? Sure you'd need to patch also gup_pte_range()
> and a similar thing for PMDs to recheck PageTruncateInProgress() after
> grabbing the page reference. But in principle I don't see anything
> fundamentally different between gup_fast() and plain gup().

Ah, yes I didn't grok the abort on PageTruncateInProgress() until I
read this again (and again), I'll try that.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions
  2018-03-09 16:15             ` Dan Williams
@ 2018-03-09 17:26               ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2018-03-09 17:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-nvdimm, Dave Hansen, Dave Chinner,
	linux-xfs, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Andrew Morton, Christoph Hellwig

On Fri, Mar 9, 2018 at 8:15 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Mar 9, 2018 at 4:56 AM, Jan Kara <jack@suse.cz> wrote:
>> On Thu 08-03-18 09:02:30, Dan Williams wrote:
>>> On Mon, Jan 8, 2018 at 5:50 AM, Jan Kara <jack@suse.cz> wrote:
>>> > On Sun 07-01-18 13:58:42, Dan Williams wrote:
>>> >> On Thu, Jan 4, 2018 at 3:12 AM, Jan Kara <jack@suse.cz> wrote:
>>> >> > On Sat 23-12-17 16:57:31, Dan Williams wrote:
>>> >> >
>>> >> >> +     /*
>>> >> >> +      * Flush dax_dma_lock() sections to ensure all possible page
>>> >> >> +      * references have been taken, or will block on the fs
>>> >> >> +      * 'mmap_lock'.
>>> >> >> +      */
>>> >> >> +     synchronize_rcu();
>>> >> >
>>> >> > Frankly, I don't like synchronize_rcu() in a relatively hot path like this.
>>> >> > Cannot we just abuse get_dev_pagemap() to fail if truncation is in progress
>>> >> > for the pfn? We could indicate that by some bit in struct page or something
>>> >> > like that.
>>> >>
>>> >> We would need a lockless way to take a reference conditionally if the
>>> >> page is not subject to truncation.
>>> >>
>>> >> I recall the raid5 code did something similar where it split a
>>> >> reference count into 2 fields. I.e. take page->_refcount and use the
>>> >> upper bits as a truncation count. Something like:
>>> >>
>>> >> do {
>>> >>     old = atomic_read(&page->_refcount);
>>> >>     if (old & trunc_mask) /* upper bits of _refcount */
>>> >>         return false;
>>> >>     new = cnt + 1;
>>> >> } while (atomic_cmpxchg(&page->_refcount, old, new) != old);
>>> >> return true; /* we incremented the _refcount while the truncation
>>> >> count was zero */
>>> >>
>>> >> ...the only concern is teaching the put_page() path to consider that
>>> >> 'trunc_mask' when determining that the page is idle.
>>> >>
>>> >> Other ideas?
>>> >
>>> > What I rather thought about was an update to GUP paths (like
>>> > follow_page_pte()):
>>> >
>>> >         if (flags & FOLL_GET) {
>>> >                 get_page(page);
>>> >                 if (pte_devmap(pte)) {
>>> >                         /*
>>> >                          * Pairs with the barrier in the truncate path.
>>> >                          * Could be possibly _after_atomic version of the
>>> >                          * barrier.
>>> >                          */
>>> >                         smp_mb();
>>> >                         if (PageTruncateInProgress(page)) {
>>> >                                 put_page(page);
>>> >                                 ..bail...
>>> >                         }
>>> >                 }
>>> >         }
>>> >
>>> > and in the truncate path:
>>> >
>>> >         down_write(inode->i_mmap_sem);
>>> >         walk all pages in the mapping and mark them PageTruncateInProgress().
>>> >         unmap_mapping_range(...);
>>> >         /*
>>> >          * Pairs with the barrier in GUP path. In fact not necessary since
>>> >          * unmap_mapping_range() provides us with the barrier already.
>>> >          */
>>> >         smp_mb();
>>> >         /*
>>> >          * By now we are either guaranteed to see grabbed page reference or
>>> >          * GUP is guaranteed to see PageTruncateInProgress().
>>> >          */
>>> >         while ((page = dax_find_referenced_page(mapping))) {
>>> >                 ...
>>> >         }
>>> >
>>> > The barriers need some verification, I've opted for the conservative option
>>> > but I guess you get the idea.
>>>
>>> [ Reviving this thread for the next rev of this patch set for 4.17
>>> consideration ]
>>>
>>> I don't think this barrier scheme can work in the presence of
>>> get_user_pages_fast(). The get_user_pages_fast() path can race
>>> unmap_mapping_range() to take out an elevated reference count on a
>>> page.
>>
>> Why the scheme cannot work? Sure you'd need to patch also gup_pte_range()
>> and a similar thing for PMDs to recheck PageTruncateInProgress() after
>> grabbing the page reference. But in principle I don't see anything
>> fundamentally different between gup_fast() and plain gup().
>
> Ah, yes I didn't grok the abort on PageTruncateInProgress() until I
> read this again (and again), I'll try that.

Ok, so the problem is that PageTruncateInProgress() for a given page
is hard to detect without trapping deeper into the filesystem, at
least in the XFS case. The usage of xfs_break_layouts() happens well
before we know that a given file offset is going to be truncated. By
the time we're at a point in the call stack where we are committed to
truncating a given page it is then awkward to drop locks and wait on
the next page collision.

In order to support an early 'break' of dax layouts before touching
the extent map we can't rely on being able to positively determine the
pages that collide with a given truncate/hole-punch range. Instead the
approach I've taken drains all pinned / referenced pages for the inode
before attempting an operation that *might* lead to an extent unmap
event. This mirrors the pNFS lease case where all leases are broken
regardless of whether they actually collide with an extent that is
under active access from a remote client.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2018-03-09 17:20 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-24  0:56 [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Dan Williams
2017-12-24  0:56 ` [PATCH v4 01/18] mm, dax: introduce pfn_t_special() Dan Williams
2018-01-04  8:16   ` Christoph Hellwig
2017-12-24  0:56 ` [PATCH v4 02/18] ext4: auto disable dax instead of failing mount Dan Williams
2018-01-03 14:20   ` Jan Kara
2017-12-24  0:56 ` [PATCH v4 03/18] ext2: " Dan Williams
2018-01-03 14:21   ` Jan Kara
2017-12-24  0:56 ` [PATCH v4 04/18] dax: require 'struct page' by default for filesystem dax Dan Williams
2018-01-03 15:29   ` Jan Kara
2018-01-04  8:16   ` Christoph Hellwig
2018-01-08 11:58   ` Gerald Schaefer
2017-12-24  0:56 ` [PATCH v4 05/18] dax: stop using VM_MIXEDMAP for dax Dan Williams
2018-01-03 15:27   ` Jan Kara
2017-12-24  0:56 ` [PATCH v4 06/18] dax: stop using VM_HUGEPAGE " Dan Williams
2017-12-24  0:56 ` [PATCH v4 07/18] dax: store pfns in the radix Dan Williams
2017-12-27  0:17   ` Ross Zwisler
2018-01-02 20:15     ` Dan Williams
2018-01-03 15:39   ` Jan Kara
2017-12-24  0:56 ` [PATCH v4 08/18] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
2017-12-27 18:08   ` Ross Zwisler
2018-01-02 20:35     ` Dan Williams
2018-01-02 21:44   ` Dave Chinner
2018-01-02 21:51     ` Dan Williams
2018-01-03 15:46       ` Jan Kara
2018-01-03 20:37         ` Jeff Moyer
2017-12-24  0:56 ` [PATCH v4 09/18] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks Dan Williams
2018-01-04  8:20   ` Christoph Hellwig
2017-12-24  0:56 ` [PATCH v4 10/18] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS Dan Williams
2018-01-04  8:25   ` Christoph Hellwig
2017-12-24  0:56 ` [PATCH v4 11/18] fs, dax: introduce DEFINE_FSDAX_AOPS Dan Williams
2017-12-27  5:29   ` Matthew Wilcox
2018-01-02 20:21     ` Dan Williams
2018-01-03 16:05       ` Jan Kara
2018-01-04  8:27         ` Christoph Hellwig
2018-01-02 21:41   ` Dave Chinner
2017-12-24  0:57 ` [PATCH v4 12/18] xfs: use DEFINE_FSDAX_AOPS Dan Williams
2018-01-02 21:15   ` Darrick J. Wong
2018-01-02 21:40     ` Dan Williams
2018-01-03 16:09       ` Jan Kara
2018-01-04  8:28   ` Christoph Hellwig
2017-12-24  0:57 ` [PATCH v4 13/18] ext4: " Dan Williams
2018-01-04  8:29   ` Christoph Hellwig
2017-12-24  0:57 ` [PATCH v4 14/18] ext2: " Dan Williams
2018-01-04  8:29   ` Christoph Hellwig
2017-12-24  0:57 ` [PATCH v4 15/18] mm, fs, dax: use page->mapping to warn if dma collides with truncate Dan Williams
2018-01-04  8:30   ` Christoph Hellwig
2018-01-04  9:39   ` Jan Kara
2017-12-24  0:57 ` [PATCH v4 16/18] wait_bit: introduce {wait_on,wake_up}_atomic_one Dan Williams
2018-01-04  8:30   ` Christoph Hellwig
2017-12-24  0:57 ` [PATCH v4 17/18] mm, fs, dax: dax_flush_dma, handle dma vs block-map-change collisions Dan Williams
2018-01-04  8:31   ` Christoph Hellwig
2018-01-04 11:12   ` Jan Kara
2018-01-07 21:58     ` Dan Williams
2018-01-08 13:50       ` Jan Kara
2018-03-08 17:02         ` Dan Williams
2018-03-09 12:56           ` Jan Kara
2018-03-09 16:15             ` Dan Williams
2018-03-09 17:26               ` Dan Williams
2017-12-24  0:57 ` [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper Dan Williams
2018-01-02 21:07   ` Darrick J. Wong
2018-01-02 23:00   ` Dave Chinner
2018-01-03  2:21     ` Dan Williams
2018-01-03  7:51       ` Dave Chinner
2018-01-04  8:34         ` Christoph Hellwig
2018-01-04  8:33     ` Christoph Hellwig
2018-01-04  8:17 ` [PATCH v4 00/18] dax: fix dma vs truncate/hole-punch Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).