All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions
@ 2017-10-31 23:21 ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, Peter Zijlstra, Benjamin Herrenschmidt,
	Heiko Carstens, linux-mm, Paul Mackerras, Sean Hefty, hch,
	Matthew Wilcox, linux-rdma, Michael Ellerman, Jeff Moyer,
	Jason Gunthorpe, Doug Ledford, Ingo Molnar, Ross Zwisler,
	Hal Rosenstock, linux-media, linux-fsdevel,
	Jérôme Glisse, Mauro Carvalho Chehab, Gerald Schaefer,
	Jens Axboe, linux-kernel, stable, linux-xfs, Martin Schwidefsky,
	akpm, Kirill A. Shutemov

This is hopefully the uncontroversial lead-in set of changes that lay
the groundwork for solving the dax-dma vs truncate problem. The overview
of the changes is:

1/ Disable DAX when we do not have struct page entries backing dax
   mappings, or otherwise allow limited DAX support for axonram and
   dcssblk. Is anyone actually using the DAX capability of axonram
   dcssblk?

2/ Disable code paths that establish potentially long lived DMA
   access to a filesystem-dax memory mapping, i.e. RDMA and V4L2. In the
   4.16 timeframe the plan is to introduce a "register memory for DMA
   with a lease" mechanism for userspace to establish mappings but also
   be responsible for tearing down the mapping when the kernel needs to
   invalidate the mapping due to truncate or hole-punch.

3/ Add a wakeup mechanism for awaiting for DAX pages to be released
   from DMA access.

This overall effort started when Christoph noted during the review of
the MAP_DIRECT proposal:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.

    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

Included in the changes is a nfit_test mechanism to trivially trigger
this collision by delaying the put_page() that the block layer performs
after performing direct-I/O to a filesystem-DAX page.

Given the ongoing coordination of this set across multiple sub-systems
and the dax core my proposal is to manage this as a branch in the nvdimm
tree with acks from mm, rdma, v4l2, ext4, and xfs.

---

Dan Williams (15):
      dax: quiet bdev_dax_supported()
      mm, dax: introduce pfn_t_special()
      dax: require 'struct page' by default for filesystem dax
      brd: remove dax support
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      [media] v4l2: disable filesystem-dax mapping support
      mm, dax: enable filesystems to trigger page-idle callbacks
      mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
      dax: associate mappings with inodes, and warn if dma collides with truncate
      wait_bit: introduce {wait_on,wake_up}_devmap_idle


 arch/powerpc/platforms/Kconfig            |    1 
 arch/powerpc/sysdev/axonram.c             |    3 -
 drivers/block/Kconfig                     |   12 ---
 drivers/block/brd.c                       |   65 --------------
 drivers/dax/device.c                      |    1 
 drivers/dax/super.c                       |  113 +++++++++++++++++++++----
 drivers/infiniband/core/umem.c            |   49 ++++++++---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 ++++++++-
 drivers/nvdimm/pmem.c                     |   13 +++
 drivers/s390/block/Kconfig                |    1 
 drivers/s390/block/dcssblk.c              |    4 +
 fs/Kconfig                                |    8 ++
 fs/dax.c                                  |  131 +++++++++++++++++++----------
 fs/ext2/file.c                            |    1 
 fs/ext2/super.c                           |    6 +
 fs/ext4/file.c                            |    1 
 fs/ext4/super.c                           |    6 +
 fs/xfs/xfs_file.c                         |    2 
 fs/xfs/xfs_super.c                        |   20 ++--
 include/linux/dax.h                       |   17 ++--
 include/linux/memremap.h                  |   24 +++++
 include/linux/mm.h                        |   47 ++++++----
 include/linux/mm_types.h                  |   20 +++-
 include/linux/pfn_t.h                     |   13 +++
 include/linux/vma.h                       |   33 +++++++
 include/linux/wait_bit.h                  |   10 ++
 kernel/memremap.c                         |   36 ++++++--
 kernel/sched/wait_bit.c                   |   64 ++++++++++++--
 mm/Kconfig                                |    5 +
 mm/hmm.c                                  |   13 ---
 mm/huge_memory.c                          |    8 +-
 mm/ksm.c                                  |    3 +
 mm/madvise.c                              |    2 
 mm/memory.c                               |   22 ++++-
 mm/migrate.c                              |    3 -
 mm/mlock.c                                |    5 +
 mm/mmap.c                                 |    8 +-
 mm/swap.c                                 |    3 -
 tools/testing/nvdimm/Kbuild               |    1 
 tools/testing/nvdimm/test/iomap.c         |   62 ++++++++++++++
 tools/testing/nvdimm/test/nfit.c          |   34 ++++++++
 tools/testing/nvdimm/test/nfit_test.h     |    1 
 42 files changed, 650 insertions(+), 260 deletions(-)
 create mode 100644 include/linux/vma.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions
@ 2017-10-31 23:21 ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, Peter Zijlstra, Benjamin Herrenschmidt,
	Heiko Carstens, linux-mm, Paul Mackerras, Sean Hefty, hch,
	Matthew Wilcox, linux-rdma, Michael Ellerman, Jeff Moyer,
	Jason Gunthorpe, Doug Ledford, Ingo Molnar, Ross Zwisler,
	Hal Rosenstock, linux-media, linux-fsdevel,
	Jérôme Glisse, Mauro Carvalho Chehab, Gerald Schaefer,
	Jens Axboe

This is hopefully the uncontroversial lead-in set of changes that lay
the groundwork for solving the dax-dma vs truncate problem. The overview
of the changes is:

1/ Disable DAX when we do not have struct page entries backing dax
   mappings, or otherwise allow limited DAX support for axonram and
   dcssblk. Is anyone actually using the DAX capability of axonram
   dcssblk?

2/ Disable code paths that establish potentially long lived DMA
   access to a filesystem-dax memory mapping, i.e. RDMA and V4L2. In the
   4.16 timeframe the plan is to introduce a "register memory for DMA
   with a lease" mechanism for userspace to establish mappings but also
   be responsible for tearing down the mapping when the kernel needs to
   invalidate the mapping due to truncate or hole-punch.

3/ Add a wakeup mechanism for awaiting for DAX pages to be released
   from DMA access.

This overall effort started when Christoph noted during the review of
the MAP_DIRECT proposal:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.

    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

Included in the changes is a nfit_test mechanism to trivially trigger
this collision by delaying the put_page() that the block layer performs
after performing direct-I/O to a filesystem-DAX page.

Given the ongoing coordination of this set across multiple sub-systems
and the dax core my proposal is to manage this as a branch in the nvdimm
tree with acks from mm, rdma, v4l2, ext4, and xfs.

---

Dan Williams (15):
      dax: quiet bdev_dax_supported()
      mm, dax: introduce pfn_t_special()
      dax: require 'struct page' by default for filesystem dax
      brd: remove dax support
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      [media] v4l2: disable filesystem-dax mapping support
      mm, dax: enable filesystems to trigger page-idle callbacks
      mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
      dax: associate mappings with inodes, and warn if dma collides with truncate
      wait_bit: introduce {wait_on,wake_up}_devmap_idle


 arch/powerpc/platforms/Kconfig            |    1 
 arch/powerpc/sysdev/axonram.c             |    3 -
 drivers/block/Kconfig                     |   12 ---
 drivers/block/brd.c                       |   65 --------------
 drivers/dax/device.c                      |    1 
 drivers/dax/super.c                       |  113 +++++++++++++++++++++----
 drivers/infiniband/core/umem.c            |   49 ++++++++---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 ++++++++-
 drivers/nvdimm/pmem.c                     |   13 +++
 drivers/s390/block/Kconfig                |    1 
 drivers/s390/block/dcssblk.c              |    4 +
 fs/Kconfig                                |    8 ++
 fs/dax.c                                  |  131 +++++++++++++++++++----------
 fs/ext2/file.c                            |    1 
 fs/ext2/super.c                           |    6 +
 fs/ext4/file.c                            |    1 
 fs/ext4/super.c                           |    6 +
 fs/xfs/xfs_file.c                         |    2 
 fs/xfs/xfs_super.c                        |   20 ++--
 include/linux/dax.h                       |   17 ++--
 include/linux/memremap.h                  |   24 +++++
 include/linux/mm.h                        |   47 ++++++----
 include/linux/mm_types.h                  |   20 +++-
 include/linux/pfn_t.h                     |   13 +++
 include/linux/vma.h                       |   33 +++++++
 include/linux/wait_bit.h                  |   10 ++
 kernel/memremap.c                         |   36 ++++++--
 kernel/sched/wait_bit.c                   |   64 ++++++++++++--
 mm/Kconfig                                |    5 +
 mm/hmm.c                                  |   13 ---
 mm/huge_memory.c                          |    8 +-
 mm/ksm.c                                  |    3 +
 mm/madvise.c                              |    2 
 mm/memory.c                               |   22 ++++-
 mm/migrate.c                              |    3 -
 mm/mlock.c                                |    5 +
 mm/mmap.c                                 |    8 +-
 mm/swap.c                                 |    3 -
 tools/testing/nvdimm/Kbuild               |    1 
 tools/testing/nvdimm/test/iomap.c         |   62 ++++++++++++++
 tools/testing/nvdimm/test/nfit.c          |   34 ++++++++
 tools/testing/nvdimm/test/nfit_test.h     |    1 
 42 files changed, 650 insertions(+), 260 deletions(-)
 create mode 100644 include/linux/vma.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions
@ 2017-10-31 23:21 ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, Peter Zijlstra, Benjamin Herrenschmidt,
	Heiko Carstens, linux-mm, Paul Mackerras, Sean Hefty, hch,
	Matthew Wilcox, linux-rdma, Michael Ellerman, Jeff Moyer,
	Jason Gunthorpe, Doug Ledford, Ingo Molnar, Ross Zwisler,
	Hal Rosenstock, linux-media, linux-fsdevel,
	Jérôme Glisse, Mauro Carvalho Chehab, Gerald Schaefer,
	Jens Axboe, linux-kernel, stable, linux-xfs, Martin Schwidefsky,
	akpm, Kirill A. Shutemov

This is hopefully the uncontroversial lead-in set of changes that lay
the groundwork for solving the dax-dma vs truncate problem. The overview
of the changes is:

1/ Disable DAX when we do not have struct page entries backing dax
   mappings, or otherwise allow limited DAX support for axonram and
   dcssblk. Is anyone actually using the DAX capability of axonram
   dcssblk?

2/ Disable code paths that establish potentially long lived DMA
   access to a filesystem-dax memory mapping, i.e. RDMA and V4L2. In the
   4.16 timeframe the plan is to introduce a "register memory for DMA
   with a lease" mechanism for userspace to establish mappings but also
   be responsible for tearing down the mapping when the kernel needs to
   invalidate the mapping due to truncate or hole-punch.

3/ Add a wakeup mechanism for awaiting for DAX pages to be released
   from DMA access.

This overall effort started when Christoph noted during the review of
the MAP_DIRECT proposal:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.

    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

Included in the changes is a nfit_test mechanism to trivially trigger
this collision by delaying the put_page() that the block layer performs
after performing direct-I/O to a filesystem-DAX page.

Given the ongoing coordination of this set across multiple sub-systems
and the dax core my proposal is to manage this as a branch in the nvdimm
tree with acks from mm, rdma, v4l2, ext4, and xfs.

---

Dan Williams (15):
      dax: quiet bdev_dax_supported()
      mm, dax: introduce pfn_t_special()
      dax: require 'struct page' by default for filesystem dax
      brd: remove dax support
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      [media] v4l2: disable filesystem-dax mapping support
      mm, dax: enable filesystems to trigger page-idle callbacks
      mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
      dax: associate mappings with inodes, and warn if dma collides with truncate
      wait_bit: introduce {wait_on,wake_up}_devmap_idle


 arch/powerpc/platforms/Kconfig            |    1 
 arch/powerpc/sysdev/axonram.c             |    3 -
 drivers/block/Kconfig                     |   12 ---
 drivers/block/brd.c                       |   65 --------------
 drivers/dax/device.c                      |    1 
 drivers/dax/super.c                       |  113 +++++++++++++++++++++----
 drivers/infiniband/core/umem.c            |   49 ++++++++---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 ++++++++-
 drivers/nvdimm/pmem.c                     |   13 +++
 drivers/s390/block/Kconfig                |    1 
 drivers/s390/block/dcssblk.c              |    4 +
 fs/Kconfig                                |    8 ++
 fs/dax.c                                  |  131 +++++++++++++++++++----------
 fs/ext2/file.c                            |    1 
 fs/ext2/super.c                           |    6 +
 fs/ext4/file.c                            |    1 
 fs/ext4/super.c                           |    6 +
 fs/xfs/xfs_file.c                         |    2 
 fs/xfs/xfs_super.c                        |   20 ++--
 include/linux/dax.h                       |   17 ++--
 include/linux/memremap.h                  |   24 +++++
 include/linux/mm.h                        |   47 ++++++----
 include/linux/mm_types.h                  |   20 +++-
 include/linux/pfn_t.h                     |   13 +++
 include/linux/vma.h                       |   33 +++++++
 include/linux/wait_bit.h                  |   10 ++
 kernel/memremap.c                         |   36 ++++++--
 kernel/sched/wait_bit.c                   |   64 ++++++++++++--
 mm/Kconfig                                |    5 +
 mm/hmm.c                                  |   13 ---
 mm/huge_memory.c                          |    8 +-
 mm/ksm.c                                  |    3 +
 mm/madvise.c                              |    2 
 mm/memory.c                               |   22 ++++-
 mm/migrate.c                              |    3 -
 mm/mlock.c                                |    5 +
 mm/mmap.c                                 |    8 +-
 mm/swap.c                                 |    3 -
 tools/testing/nvdimm/Kbuild               |    1 
 tools/testing/nvdimm/test/iomap.c         |   62 ++++++++++++++
 tools/testing/nvdimm/test/nfit.c          |   34 ++++++++
 tools/testing/nvdimm/test/nfit_test.h     |    1 
 42 files changed, 650 insertions(+), 260 deletions(-)
 create mode 100644 include/linux/vma.h

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 01/15] dax: quiet bdev_dax_supported()
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:21   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

Before we add another failure reason, quiet the existing log messages.
Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 557b93703532..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 01/15] dax: quiet bdev_dax_supported()
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, linux-xfs, linux-mm, Jeff Moyer, linux-fsdevel, akpm, hch

Before we add another failure reason, quiet the existing log messages.
Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 557b93703532..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 01/15] dax: quiet bdev_dax_supported()
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-kernel, linux-xfs, linux-mm, Jeff Moyer, linux-fsdevel, akpm, hch

Before we add another failure reason, quiet the existing log messages.
Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 557b93703532..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 02/15] mm, dax: introduce pfn_t_special()
  2017-10-31 23:21 ` Dan Williams
@ 2017-10-31 23:21   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	linux-mm, Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, akpm, hch

In support of removing the VM_MIXEDMAP indication from DAX VMAs,
introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
should be used for DAX ptes. This also helps identify drivers like
dccssblk that only want to use DAX in a read-only fashion without
get_user_pages() support.

Ideally we could delete axonram and dcssblk DAX support, but if we need
to keep it better make it explicit that axonram and dcssblk only support
a sub-set of DAX due to missing _PAGE_DEVMAP support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    2 +-
 drivers/s390/block/dcssblk.c  |    3 ++-
 include/linux/pfn_t.h         |   13 +++++++++++++
 mm/memory.c                   |   16 +++++++++++++++-
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index c60e84e4558d..aaf540efb92c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -151,7 +151,7 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_page
 	resource_size_t offset = pgoff * PAGE_SIZE;
 
 	*kaddr = (void *) bank->io_addr + offset;
-	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV|PFN_SPECIAL);
 	return (bank->size - offset) / PAGE_SIZE;
 }
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 7abb240847c0..87756e28c29b 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -915,7 +915,8 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
 
 	dev_sz = dev_info->end - dev_info->start + 1;
 	*kaddr = (void *) dev_info->start + offset;
-	*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+	*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset),
+			PFN_DEV|PFN_SPECIAL);
 
 	return (dev_sz - offset) / PAGE_SIZE;
 }
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index a49b3259cad7..2a16386725b2 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -14,8 +14,10 @@
 #define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
 #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
+#define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5))
 
 #define PFN_FLAGS_TRACE \
+	{ PFN_SPECIAL,	"SPECIAL" }, \
 	{ PFN_SG_CHAIN,	"SG_CHAIN" }, \
 	{ PFN_SG_LAST,	"SG_LAST" }, \
 	{ PFN_DEV,	"DEV" }, \
@@ -119,4 +121,15 @@ pud_t pud_mkdevmap(pud_t pud);
 #endif
 #endif /* __HAVE_ARCH_PTE_DEVMAP */
 
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static inline bool pfn_t_special(pfn_t pfn)
+{
+	return (pfn.val & PFN_SPECIAL) == PFN_SPECIAL;
+}
+#else
+static inline bool pfn_t_special(pfn_t pfn)
+{
+	return false;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
 #endif /* _LINUX_PFN_T_H_ */
diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..e764dc5d8a87 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1896,12 +1896,26 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (pfn_t_special(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 02/15] mm, dax: introduce pfn_t_special()
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	linux-mm, Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, akpm, hch

In support of removing the VM_MIXEDMAP indication from DAX VMAs,
introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
should be used for DAX ptes. This also helps identify drivers like
dccssblk that only want to use DAX in a read-only fashion without
get_user_pages() support.

Ideally we could delete axonram and dcssblk DAX support, but if we need
to keep it better make it explicit that axonram and dcssblk only support
a sub-set of DAX due to missing _PAGE_DEVMAP support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    2 +-
 drivers/s390/block/dcssblk.c  |    3 ++-
 include/linux/pfn_t.h         |   13 +++++++++++++
 mm/memory.c                   |   16 +++++++++++++++-
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index c60e84e4558d..aaf540efb92c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -151,7 +151,7 @@ __axon_ram_direct_access(struct axon_ram_bank *bank, pgoff_t pgoff, long nr_page
 	resource_size_t offset = pgoff * PAGE_SIZE;
 
 	*kaddr = (void *) bank->io_addr + offset;
-	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV|PFN_SPECIAL);
 	return (bank->size - offset) / PAGE_SIZE;
 }
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 7abb240847c0..87756e28c29b 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -915,7 +915,8 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
 
 	dev_sz = dev_info->end - dev_info->start + 1;
 	*kaddr = (void *) dev_info->start + offset;
-	*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset), PFN_DEV);
+	*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset),
+			PFN_DEV|PFN_SPECIAL);
 
 	return (dev_sz - offset) / PAGE_SIZE;
 }
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index a49b3259cad7..2a16386725b2 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -14,8 +14,10 @@
 #define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
 #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
+#define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5))
 
 #define PFN_FLAGS_TRACE \
+	{ PFN_SPECIAL,	"SPECIAL" }, \
 	{ PFN_SG_CHAIN,	"SG_CHAIN" }, \
 	{ PFN_SG_LAST,	"SG_LAST" }, \
 	{ PFN_DEV,	"DEV" }, \
@@ -119,4 +121,15 @@ pud_t pud_mkdevmap(pud_t pud);
 #endif
 #endif /* __HAVE_ARCH_PTE_DEVMAP */
 
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static inline bool pfn_t_special(pfn_t pfn)
+{
+	return (pfn.val & PFN_SPECIAL) == PFN_SPECIAL;
+}
+#else
+static inline bool pfn_t_special(pfn_t pfn)
+{
+	return false;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
 #endif /* _LINUX_PFN_T_H_ */
diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..e764dc5d8a87 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1896,12 +1896,26 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (pfn_t_special(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 03/15] dax: require 'struct page' by default for filesystem dax
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:21   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Benjamin Herrenschmidt, Heiko Carstens, linux-kernel,
	hch, linux-xfs, linux-mm, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, akpm, Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages.  We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
drivers that have not supported get_user_pages() to date we allow them
to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
option which requires ->direct_access() to return pfn_t_special() pfns.
This leaves DAX support in brd disabled and scheduled for removal.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/Kconfig |    1 +
 arch/powerpc/sysdev/axonram.c  |    1 +
 drivers/dax/super.c            |   10 ++++++++++
 drivers/s390/block/Kconfig     |    1 +
 drivers/s390/block/dcssblk.c   |    1 +
 fs/Kconfig                     |    7 +++++++
 6 files changed, 21 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 4fd64d3f5c44..031313968f9a 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -296,6 +296,7 @@ config AXON_RAM
 	tristate "Axon DDR2 memory device driver"
 	depends on PPC_IBM_CELL_BLADE && BLOCK
 	select DAX
+	select FS_DAX_LIMITED
 	default m
 	help
 	  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index aaf540efb92c..c1abd443836f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static const struct dax_operations axon_ram_dax_ops = {
 	.direct_access = axon_ram_dax_direct_access,
+
 	.copy_from_iter = axon_ram_copy_from_iter,
 };
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..66bcdf42c413 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,15 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
+			|| pfn_t_devmap(pfn))
+		/* pass */;
+	else {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 31f014b57bfc..594ae5fc8e9d 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,6 +15,7 @@ config BLK_DEV_XPRAM
 config DCSSBLK
 	def_tristate m
 	select DAX
+	select FS_DAX_LIMITED
 	prompt "DCSSBLK support"
 	depends on S390 && BLOCK
 	help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 87756e28c29b..dbe07ab71e32 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
 
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
+
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 };
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..b40128bf6d1a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,13 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+# Selected by DAX drivers that do not expect filesystem DAX to support
+# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
+# for fork() of processes with MAP_SHARED mappings or support for
+# direct-I/O to a DAX mapping.
+config FS_DAX_LIMITED
+	bool
+
 endif # BLOCK
 
 # Posix ACL utility routines

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 03/15] dax: require 'struct page' by default for filesystem dax
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, akpm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Ross Zwisler, hch, Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages.  We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
drivers that have not supported get_user_pages() to date we allow them
to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
option which requires ->direct_access() to return pfn_t_special() pfns.
This leaves DAX support in brd disabled and scheduled for removal.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/Kconfig |    1 +
 arch/powerpc/sysdev/axonram.c  |    1 +
 drivers/dax/super.c            |   10 ++++++++++
 drivers/s390/block/Kconfig     |    1 +
 drivers/s390/block/dcssblk.c   |    1 +
 fs/Kconfig                     |    7 +++++++
 6 files changed, 21 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 4fd64d3f5c44..031313968f9a 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -296,6 +296,7 @@ config AXON_RAM
 	tristate "Axon DDR2 memory device driver"
 	depends on PPC_IBM_CELL_BLADE && BLOCK
 	select DAX
+	select FS_DAX_LIMITED
 	default m
 	help
 	  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index aaf540efb92c..c1abd443836f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static const struct dax_operations axon_ram_dax_ops = {
 	.direct_access = axon_ram_dax_direct_access,
+
 	.copy_from_iter = axon_ram_copy_from_iter,
 };
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..66bcdf42c413 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,15 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
+			|| pfn_t_devmap(pfn))
+		/* pass */;
+	else {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 31f014b57bfc..594ae5fc8e9d 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,6 +15,7 @@ config BLK_DEV_XPRAM
 config DCSSBLK
 	def_tristate m
 	select DAX
+	select FS_DAX_LIMITED
 	prompt "DCSSBLK support"
 	depends on S390 && BLOCK
 	help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 87756e28c29b..dbe07ab71e32 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
 
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
+
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 };
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..b40128bf6d1a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,13 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+# Selected by DAX drivers that do not expect filesystem DAX to support
+# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
+# for fork() of processes with MAP_SHARED mappings or support for
+# direct-I/O to a DAX mapping.
+config FS_DAX_LIMITED
+	bool
+
 endif # BLOCK
 
 # Posix ACL utility routines

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 03/15] dax: require 'struct page' by default for filesystem dax
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, akpm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Ross Zwisler, hch, Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages.  We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
drivers that have not supported get_user_pages() to date we allow them
to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
option which requires ->direct_access() to return pfn_t_special() pfns.
This leaves DAX support in brd disabled and scheduled for removal.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/Kconfig |    1 +
 arch/powerpc/sysdev/axonram.c  |    1 +
 drivers/dax/super.c            |   10 ++++++++++
 drivers/s390/block/Kconfig     |    1 +
 drivers/s390/block/dcssblk.c   |    1 +
 fs/Kconfig                     |    7 +++++++
 6 files changed, 21 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 4fd64d3f5c44..031313968f9a 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -296,6 +296,7 @@ config AXON_RAM
 	tristate "Axon DDR2 memory device driver"
 	depends on PPC_IBM_CELL_BLADE && BLOCK
 	select DAX
+	select FS_DAX_LIMITED
 	default m
 	help
 	  It registers one block device per Axon's DDR2 memory bank found
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index aaf540efb92c..c1abd443836f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static const struct dax_operations axon_ram_dax_ops = {
 	.direct_access = axon_ram_dax_direct_access,
+
 	.copy_from_iter = axon_ram_copy_from_iter,
 };
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..66bcdf42c413 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,15 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
+			|| pfn_t_devmap(pfn))
+		/* pass */;
+	else {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index 31f014b57bfc..594ae5fc8e9d 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -15,6 +15,7 @@ config BLK_DEV_XPRAM
 config DCSSBLK
 	def_tristate m
 	select DAX
+	select FS_DAX_LIMITED
 	prompt "DCSSBLK support"
 	depends on S390 && BLOCK
 	help
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 87756e28c29b..dbe07ab71e32 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
 
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
+
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 };
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..b40128bf6d1a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,13 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+# Selected by DAX drivers that do not expect filesystem DAX to support
+# get_user_pages() of DAX mappings. I.e. "limited" indicates no support
+# for fork() of processes with MAP_SHARED mappings or support for
+# direct-I/O to a DAX mapping.
+config FS_DAX_LIMITED
+	bool
+
 endif # BLOCK
 
 # Posix ACL utility routines

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 04/15] brd: remove dax support
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:21   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jens Axboe, Matthew Wilcox, linux-kernel, hch, linux-xfs,
	linux-mm, linux-fsdevel, akpm

DAX support in brd is awkward because its backing page frames are
distinct from the ones provided by pmem, dcssblk, or axonram. We need
pfn_t_devmap() entries to fully support DAX, and the limited DAX support
for pfn_t_special() page frames is not interesting for brd when pmem is
already a superset of brd.  Lastly, brd is the only dax capable driver
that may sleep in its ->direct_access() implementation. So it causes a
global burden with no net gain of kernel functionality.

For all these reasons, remove DAX support.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/Kconfig |   12 ---------
 drivers/block/brd.c   |   65 -------------------------------------------------
 2 files changed, 77 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 2dfe99b328f8..da8bf0268ade 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -297,7 +297,6 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
 	tristate "RAM block device support"
-	select DAX if BLK_DEV_RAM_DAX
 	---help---
 	  Saying Y here will allow you to use a portion of your RAM memory as
 	  a block device, so that you can make file systems on it, read and
@@ -333,17 +332,6 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
-config BLK_DEV_RAM_DAX
-	bool "Support Direct Access (DAX) to RAM block devices"
-	depends on BLK_DEV_RAM && FS_DAX
-	default n
-	help
-	  Support filesystems using DAX to access RAM block devices.  This
-	  avoids double-buffering data in the page cache before copying it
-	  to the block device.  Answering Y will slightly enlarge the kernel,
-	  and will prevent RAM block device backing store memory from being
-	  allocated from highmem (only a problem for highmem systems).
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media (DEPRECATED)"
 	depends on !UML
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 2d7178f7754e..b2391bbd7e5a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -20,11 +20,6 @@
 #include <linux/radix-tree.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-#include <linux/pfn_t.h>
-#include <linux/dax.h>
-#include <linux/uio.h>
-#endif
 
 #include <linux/uaccess.h>
 
@@ -44,9 +39,6 @@ struct brd_device {
 
 	struct request_queue	*brd_queue;
 	struct gendisk		*brd_disk;
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	struct dax_device	*dax_dev;
-#endif
 	struct list_head	brd_list;
 
 	/*
@@ -112,9 +104,6 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	 * restriction might be able to be lifted.
 	 */
 	gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_RAM_DAX
-	gfp_flags |= __GFP_HIGHMEM;
-#endif
 	page = alloc_page(gfp_flags);
 	if (!page)
 		return NULL;
@@ -334,43 +323,6 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 	return err;
 }
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
-{
-	struct page *page;
-
-	if (!brd)
-		return -ENODEV;
-	page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);
-	if (!page)
-		return -ENOSPC;
-	*kaddr = page_address(page);
-	*pfn = page_to_pfn_t(page);
-
-	return 1;
-}
-
-static long brd_dax_direct_access(struct dax_device *dax_dev,
-		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
-{
-	struct brd_device *brd = dax_get_private(dax_dev);
-
-	return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
-}
-
-static size_t brd_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
-{
-	return copy_from_iter(addr, bytes, i);
-}
-
-static const struct dax_operations brd_dax_ops = {
-	.direct_access = brd_dax_direct_access,
-	.copy_from_iter = brd_dax_copy_from_iter,
-};
-#endif
-
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		brd_rw_page,
@@ -450,21 +402,8 @@ static struct brd_device *brd_alloc(int i)
 	sprintf(disk->disk_name, "ram%d", i);
 	set_capacity(disk, rd_size * 2);
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-	brd->dax_dev = alloc_dax(brd, disk->disk_name, &brd_dax_ops);
-	if (!brd->dax_dev)
-		goto out_free_inode;
-#endif
-
-
 	return brd;
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-out_free_inode:
-	kill_dax(brd->dax_dev);
-	put_dax(brd->dax_dev);
-#endif
 out_free_queue:
 	blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -504,10 +443,6 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	kill_dax(brd->dax_dev);
-	put_dax(brd->dax_dev);
-#endif
 	del_gendisk(brd->brd_disk);
 	brd_free(brd);
 }

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 04/15] brd: remove dax support
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jens Axboe, akpm, Matthew Wilcox, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, Ross Zwisler, hch

DAX support in brd is awkward because its backing page frames are
distinct from the ones provided by pmem, dcssblk, or axonram. We need
pfn_t_devmap() entries to fully support DAX, and the limited DAX support
for pfn_t_special() page frames is not interesting for brd when pmem is
already a superset of brd.  Lastly, brd is the only dax capable driver
that may sleep in its ->direct_access() implementation. So it causes a
global burden with no net gain of kernel functionality.

For all these reasons, remove DAX support.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/Kconfig |   12 ---------
 drivers/block/brd.c   |   65 -------------------------------------------------
 2 files changed, 77 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 2dfe99b328f8..da8bf0268ade 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -297,7 +297,6 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
 	tristate "RAM block device support"
-	select DAX if BLK_DEV_RAM_DAX
 	---help---
 	  Saying Y here will allow you to use a portion of your RAM memory as
 	  a block device, so that you can make file systems on it, read and
@@ -333,17 +332,6 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
-config BLK_DEV_RAM_DAX
-	bool "Support Direct Access (DAX) to RAM block devices"
-	depends on BLK_DEV_RAM && FS_DAX
-	default n
-	help
-	  Support filesystems using DAX to access RAM block devices.  This
-	  avoids double-buffering data in the page cache before copying it
-	  to the block device.  Answering Y will slightly enlarge the kernel,
-	  and will prevent RAM block device backing store memory from being
-	  allocated from highmem (only a problem for highmem systems).
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media (DEPRECATED)"
 	depends on !UML
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 2d7178f7754e..b2391bbd7e5a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -20,11 +20,6 @@
 #include <linux/radix-tree.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-#include <linux/pfn_t.h>
-#include <linux/dax.h>
-#include <linux/uio.h>
-#endif
 
 #include <linux/uaccess.h>
 
@@ -44,9 +39,6 @@ struct brd_device {
 
 	struct request_queue	*brd_queue;
 	struct gendisk		*brd_disk;
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	struct dax_device	*dax_dev;
-#endif
 	struct list_head	brd_list;
 
 	/*
@@ -112,9 +104,6 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	 * restriction might be able to be lifted.
 	 */
 	gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_RAM_DAX
-	gfp_flags |= __GFP_HIGHMEM;
-#endif
 	page = alloc_page(gfp_flags);
 	if (!page)
 		return NULL;
@@ -334,43 +323,6 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 	return err;
 }
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
-{
-	struct page *page;
-
-	if (!brd)
-		return -ENODEV;
-	page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);
-	if (!page)
-		return -ENOSPC;
-	*kaddr = page_address(page);
-	*pfn = page_to_pfn_t(page);
-
-	return 1;
-}
-
-static long brd_dax_direct_access(struct dax_device *dax_dev,
-		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
-{
-	struct brd_device *brd = dax_get_private(dax_dev);
-
-	return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
-}
-
-static size_t brd_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
-{
-	return copy_from_iter(addr, bytes, i);
-}
-
-static const struct dax_operations brd_dax_ops = {
-	.direct_access = brd_dax_direct_access,
-	.copy_from_iter = brd_dax_copy_from_iter,
-};
-#endif
-
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		brd_rw_page,
@@ -450,21 +402,8 @@ static struct brd_device *brd_alloc(int i)
 	sprintf(disk->disk_name, "ram%d", i);
 	set_capacity(disk, rd_size * 2);
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-	brd->dax_dev = alloc_dax(brd, disk->disk_name, &brd_dax_ops);
-	if (!brd->dax_dev)
-		goto out_free_inode;
-#endif
-
-
 	return brd;
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-out_free_inode:
-	kill_dax(brd->dax_dev);
-	put_dax(brd->dax_dev);
-#endif
 out_free_queue:
 	blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -504,10 +443,6 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	kill_dax(brd->dax_dev);
-	put_dax(brd->dax_dev);
-#endif
 	del_gendisk(brd->brd_disk);
 	brd_free(brd);
 }

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 04/15] brd: remove dax support
@ 2017-10-31 23:21   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:21 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jens Axboe, akpm, Matthew Wilcox, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, Ross Zwisler, hch

DAX support in brd is awkward because its backing page frames are
distinct from the ones provided by pmem, dcssblk, or axonram. We need
pfn_t_devmap() entries to fully support DAX, and the limited DAX support
for pfn_t_special() page frames is not interesting for brd when pmem is
already a superset of brd.  Lastly, brd is the only dax capable driver
that may sleep in its ->direct_access() implementation. So it causes a
global burden with no net gain of kernel functionality.

For all these reasons, remove DAX support.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/block/Kconfig |   12 ---------
 drivers/block/brd.c   |   65 -------------------------------------------------
 2 files changed, 77 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 2dfe99b328f8..da8bf0268ade 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -297,7 +297,6 @@ config BLK_DEV_SX8
 
 config BLK_DEV_RAM
 	tristate "RAM block device support"
-	select DAX if BLK_DEV_RAM_DAX
 	---help---
 	  Saying Y here will allow you to use a portion of your RAM memory as
 	  a block device, so that you can make file systems on it, read and
@@ -333,17 +332,6 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
-config BLK_DEV_RAM_DAX
-	bool "Support Direct Access (DAX) to RAM block devices"
-	depends on BLK_DEV_RAM && FS_DAX
-	default n
-	help
-	  Support filesystems using DAX to access RAM block devices.  This
-	  avoids double-buffering data in the page cache before copying it
-	  to the block device.  Answering Y will slightly enlarge the kernel,
-	  and will prevent RAM block device backing store memory from being
-	  allocated from highmem (only a problem for highmem systems).
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media (DEPRECATED)"
 	depends on !UML
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 2d7178f7754e..b2391bbd7e5a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -20,11 +20,6 @@
 #include <linux/radix-tree.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-#include <linux/pfn_t.h>
-#include <linux/dax.h>
-#include <linux/uio.h>
-#endif
 
 #include <linux/uaccess.h>
 
@@ -44,9 +39,6 @@ struct brd_device {
 
 	struct request_queue	*brd_queue;
 	struct gendisk		*brd_disk;
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	struct dax_device	*dax_dev;
-#endif
 	struct list_head	brd_list;
 
 	/*
@@ -112,9 +104,6 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 	 * restriction might be able to be lifted.
 	 */
 	gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_RAM_DAX
-	gfp_flags |= __GFP_HIGHMEM;
-#endif
 	page = alloc_page(gfp_flags);
 	if (!page)
 		return NULL;
@@ -334,43 +323,6 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 	return err;
 }
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
-		long nr_pages, void **kaddr, pfn_t *pfn)
-{
-	struct page *page;
-
-	if (!brd)
-		return -ENODEV;
-	page = brd_insert_page(brd, (sector_t)pgoff << PAGE_SECTORS_SHIFT);
-	if (!page)
-		return -ENOSPC;
-	*kaddr = page_address(page);
-	*pfn = page_to_pfn_t(page);
-
-	return 1;
-}
-
-static long brd_dax_direct_access(struct dax_device *dax_dev,
-		pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn)
-{
-	struct brd_device *brd = dax_get_private(dax_dev);
-
-	return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
-}
-
-static size_t brd_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t bytes, struct iov_iter *i)
-{
-	return copy_from_iter(addr, bytes, i);
-}
-
-static const struct dax_operations brd_dax_ops = {
-	.direct_access = brd_dax_direct_access,
-	.copy_from_iter = brd_dax_copy_from_iter,
-};
-#endif
-
 static const struct block_device_operations brd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		brd_rw_page,
@@ -450,21 +402,8 @@ static struct brd_device *brd_alloc(int i)
 	sprintf(disk->disk_name, "ram%d", i);
 	set_capacity(disk, rd_size * 2);
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
-	brd->dax_dev = alloc_dax(brd, disk->disk_name, &brd_dax_ops);
-	if (!brd->dax_dev)
-		goto out_free_inode;
-#endif
-
-
 	return brd;
 
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-out_free_inode:
-	kill_dax(brd->dax_dev);
-	put_dax(brd->dax_dev);
-#endif
 out_free_queue:
 	blk_cleanup_queue(brd->brd_queue);
 out_free_dev:
@@ -504,10 +443,6 @@ static struct brd_device *brd_init_one(int i, bool *new)
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-#ifdef CONFIG_BLK_DEV_RAM_DAX
-	kill_dax(brd->dax_dev);
-	put_dax(brd->dax_dev);
-#endif
 	del_gendisk(brd->brd_disk);
 	brd_free(brd);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 05/15] dax: stop using VM_MIXEDMAP for dax
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, akpm, linux-kernel, linux-xfs, linux-mm,
	linux-fsdevel, hch, Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/mm.h   |    1 +
 include/linux/vma.h  |   33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |    8 ++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 13 files changed, 60 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..c419c6fdb769 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1134,7 +1134,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..8c1e3ac77285 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2338,6 +2338,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
 int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..135ad5262cd1
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates checks for VM_SPECIAL,
+ * hugetlb and dax vmas, but note that there are 'special' vmas and
+ * special circumstances beyond these types. In other words this helper
+ * is not exhaustive.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..3cabd682da1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..72f196a36503 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2361,6 +2361,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25bade36e9ca..50513a7a11f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index e764dc5d8a87..150c9194db27 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,7 +1900,7 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
-static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 {
 	/* these checks mirror the abort conditions in vm_normal_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..13f8748e7cba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -45,6 +45,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2927,7 +2928,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4e20915ddfef 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -22,6 +22,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -519,8 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..c28996f74320 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1722,11 +1723,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 05/15] dax: stop using VM_MIXEDMAP for dax
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, linux-kernel, hch, linux-xfs, linux-mm,
	Jeff Moyer, linux-fsdevel, Ross Zwisler, akpm,
	Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/mm.h   |    1 +
 include/linux/vma.h  |   33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |    8 ++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 13 files changed, 60 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..c419c6fdb769 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1134,7 +1134,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..8c1e3ac77285 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2338,6 +2338,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
 int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..135ad5262cd1
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates checks for VM_SPECIAL,
+ * hugetlb and dax vmas, but note that there are 'special' vmas and
+ * special circumstances beyond these types. In other words this helper
+ * is not exhaustive.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..3cabd682da1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..72f196a36503 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2361,6 +2361,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25bade36e9ca..50513a7a11f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index e764dc5d8a87..150c9194db27 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,7 +1900,7 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
-static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 {
 	/* these checks mirror the abort conditions in vm_normal_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..13f8748e7cba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -45,6 +45,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2927,7 +2928,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4e20915ddfef 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -22,6 +22,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -519,8 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..c28996f74320 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1722,11 +1723,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 05/15] dax: stop using VM_MIXEDMAP for dax
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Jan Kara, linux-kernel, hch, linux-xfs, linux-mm,
	Jeff Moyer, linux-fsdevel, Ross Zwisler, akpm,
	Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/mm.h   |    1 +
 include/linux/vma.h  |   33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |    8 ++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 13 files changed, 60 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..c419c6fdb769 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1134,7 +1134,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..8c1e3ac77285 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2338,6 +2338,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
 int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..135ad5262cd1
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates checks for VM_SPECIAL,
+ * hugetlb and dax vmas, but note that there are 'special' vmas and
+ * special circumstances beyond these types. In other words this helper
+ * is not exhaustive.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..3cabd682da1c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| vm_mixed_ok(vma, pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..72f196a36503 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2361,6 +2361,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25bade36e9ca..50513a7a11f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index e764dc5d8a87..150c9194db27 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,7 +1900,7 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
-static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 {
 	/* these checks mirror the abort conditions in vm_normal_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..13f8748e7cba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -45,6 +45,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2927,7 +2928,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4e20915ddfef 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -22,6 +22,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -519,8 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..c28996f74320 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1722,11 +1723,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 06/15] dax: stop using VM_HUGEPAGE for dax
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-kernel, hch, linux-xfs, linux-mm, linux-fsdevel, akpm

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c419c6fdb769..c6780743f8ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1133,8 +1133,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 06/15] dax: stop using VM_HUGEPAGE for dax
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	Ross Zwisler, linux-fsdevel, akpm, hch

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c419c6fdb769..c6780743f8ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1133,8 +1133,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 06/15] dax: stop using VM_HUGEPAGE for dax
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	Ross Zwisler, linux-fsdevel, akpm, hch

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c419c6fdb769..c6780743f8ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1133,8 +1133,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/15] dax: stop requiring a live device for dax_flush()
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d
"dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the
dax_read_lock() to be held and the device to be alive.  This is in
preparation for switching filesystem-dax to store pfns instead of
sectors in the radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 66bcdf42c413..abfd4e92d669 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -283,9 +283,6 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_alive(dax_dev)))
-		return;
-
 	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/15] dax: stop requiring a live device for dax_flush()
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d
"dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the
dax_read_lock() to be held and the device to be alive.  This is in
preparation for switching filesystem-dax to store pfns instead of
sectors in the radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 66bcdf42c413..abfd4e92d669 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -283,9 +283,6 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_alive(dax_dev)))
-		return;
-
 	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/15] dax: stop requiring a live device for dax_flush()
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d
"dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the
dax_read_lock() to be held and the device to be alive.  This is in
preparation for switching filesystem-dax to store pfns instead of
sectors in the radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 66bcdf42c413..abfd4e92d669 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -283,9 +283,6 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_alive(dax_dev)))
-		return;
-
 	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 08/15] dax: store pfns in the radix
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Matthew Wilcox, linux-kernel, hch, linux-xfs, linux-mm,
	linux-fsdevel, akpm

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 ++++++++--
 fs/dax.c            |   75 +++++++++++++++++++--------------------------------
 2 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index abfd4e92d669..3ccb064d200d 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..ac6497dcfebd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -653,17 +653,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -682,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,29 +709,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -745,8 +724,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -804,8 +781,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -843,7 +820,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 	}
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -852,6 +829,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		return vm_insert_mixed(vma, vaddr, pfn);
+	return rc;
 }
 
 /*
@@ -869,6 +847,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -876,14 +855,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1250,8 +1230,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto unlock_fallback;
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
-			RADIX_DAX_PMD);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, RADIX_DAX_PMD);
 	if (IS_ERR(ret))
 		goto fallback;
 
@@ -1276,13 +1255,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(ret))
 		goto fallback;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 08/15] dax: store pfns in the radix
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs, linux-mm,
	Jeff Moyer, Ross Zwisler, linux-fsdevel, akpm, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 ++++++++--
 fs/dax.c            |   75 +++++++++++++++++++--------------------------------
 2 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index abfd4e92d669..3ccb064d200d 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..ac6497dcfebd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -653,17 +653,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -682,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,29 +709,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -745,8 +724,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -804,8 +781,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -843,7 +820,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 	}
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -852,6 +829,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		return vm_insert_mixed(vma, vaddr, pfn);
+	return rc;
 }
 
 /*
@@ -869,6 +847,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -876,14 +855,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1250,8 +1230,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto unlock_fallback;
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
-			RADIX_DAX_PMD);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, RADIX_DAX_PMD);
 	if (IS_ERR(ret))
 		goto fallback;
 
@@ -1276,13 +1255,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(ret))
 		goto fallback;

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 08/15] dax: store pfns in the radix
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs, linux-mm,
	Jeff Moyer, Ross Zwisler, linux-fsdevel, akpm, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |   15 ++++++++--
 fs/dax.c            |   75 +++++++++++++++++++--------------------------------
 2 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index abfd4e92d669..3ccb064d200d 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
-	if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-			|| pfn_t_devmap(pfn))
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+		/*
+		 * An arch that has enabled the pmem api should also
+		 * have its drivers support pfn_t_devmap()
+		 *
+		 * This is a developer warning and should not trigger in
+		 * production. dax_flush() will crash since it depends
+		 * on being able to do (page_address(pfn_to_page())).
+		 */
+		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+	} else if (pfn_t_devmap(pfn)) {
 		/* pass */;
-	else {
+	} else {
 		pr_debug("VFS (%s): error: dax support not enabled\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..ac6497dcfebd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -653,17 +653,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -682,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,29 +709,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -745,8 +724,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -804,8 +781,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -843,7 +820,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 	}
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -852,6 +829,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		return vm_insert_mixed(vma, vaddr, pfn);
+	return rc;
 }
 
 /*
@@ -869,6 +847,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -876,14 +855,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1250,8 +1230,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto unlock_fallback;
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
-			RADIX_DAX_PMD);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, RADIX_DAX_PMD);
 	if (IS_ERR(ret))
 		goto fallback;
 
@@ -1276,13 +1255,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(ret))
 		goto fallback;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 09/15] tools/testing/nvdimm: add 'bio_delay' mechanism
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..5946cf3afe74 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1f5d7182ca9c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -388,4 +390,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bef419d4266d..2c871c8b4a56 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -656,6 +656,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -1905,6 +1938,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index d3d63dd5ed38..0d818d2adaf7 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -46,4 +46,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 09/15] tools/testing/nvdimm: add 'bio_delay' mechanism
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..5946cf3afe74 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1f5d7182ca9c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -388,4 +390,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bef419d4266d..2c871c8b4a56 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -656,6 +656,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -1905,6 +1938,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index d3d63dd5ed38..0d818d2adaf7 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -46,4 +46,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 09/15] tools/testing/nvdimm: add 'bio_delay' mechanism
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-kernel, linux-xfs, linux-mm, linux-fsdevel, akpm, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..5946cf3afe74 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1f5d7182ca9c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -388,4 +390,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bef419d4266d..2c871c8b4a56 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -656,6 +656,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -1905,6 +1938,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index d3d63dd5ed38..0d818d2adaf7 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -46,4 +46,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma, linux-kernel, stable,
	hch, linux-xfs, linux-mm, Hal Rosenstock, linux-fsdevel,
	Sean Hefty, akpm

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jason Gunthorpe, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA, hch-jcswGhMUV9g,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Hal Rosenstock,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Sean Hefty,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Hal Rosenstock <hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Cc: <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Sean Hefty, linux-xfs, akpm, linux-rdma, linux-kernel,
	Jeff Moyer, stable, hch, Jason Gunthorpe, linux-mm, Doug Ledford,
	linux-fsdevel, Ross Zwisler, Hal Rosenstock

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Sean Hefty, linux-xfs, akpm, linux-rdma, linux-kernel,
	Jeff Moyer, stable, hch, Jason Gunthorpe, linux-mm, Doug Ledford,
	linux-fsdevel, Ross Zwisler, Hal Rosenstock

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-kernel, stable, linux-xfs, linux-mm,
	linux-fsdevel, akpm, Mauro Carvalho Chehab, hch, linux-media

V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara <jack@suse.cz>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-media@vger.kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 ++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..37a4ae61b2c0 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -155,8 +155,9 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 			int direction, unsigned long data, unsigned long size)
 {
 	unsigned long first, last;
-	int err, rw = 0;
+	int err, rw = 0, i, nr_pages;
 	unsigned int flags = FOLL_FORCE;
+	struct vm_area_struct **vmas = NULL;
 
 	dma->direction = direction;
 	switch (dma->direction) {
@@ -179,6 +180,16 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 	if (NULL == dma->pages)
 		return -ENOMEM;
 
+	if (IS_ENABLED(CONFIG_FS_DAX)) {
+		vmas = kmalloc(dma->nr_pages * sizeof(struct vm_area_struct *),
+				GFP_KERNEL);
+		if (NULL == vmas) {
+			kfree(dma->pages);
+			dma->pages = NULL;
+			return -ENOMEM;
+		}
+	}
+
 	if (rw == READ)
 		flags |= FOLL_WRITE;
 
@@ -186,7 +197,31 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 		data, size, dma->nr_pages);
 
 	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-			     flags, dma->pages, NULL);
+			     flags, dma->pages, vmas);
+	nr_pages = err;
+
+	for (i = 0; vmas && i < nr_pages; i++) {
+		struct vm_area_struct *vma = vmas[i];
+		struct inode *inode;
+
+		if (!vma_is_dax(vma))
+			continue;
+
+		/* device-dax is safe for long-lived v4l2 mappings... */
+		inode = file_inode(vma->vm_file);
+		if (inode->i_mode == S_IFCHR)
+			continue;
+
+		/* ...filesystem-dax is not. */
+		err = -EOPNOTSUPP;
+		break;
+
+		/*
+		 * FIXME: add a 'with lease' mechanism for v4l2 to
+		 * obtain time bounded access to filesytem-dax mappings
+		 */
+	}
+	kfree(vmas);
 
 	if (err != dma->nr_pages) {
 		dma->nr_pages = (err >= 0) ? err : 0;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-kernel, stable, linux-xfs, linux-mm,
	linux-fsdevel, akpm, Mauro Carvalho Chehab, hch, linux-media

V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara <jack@suse.cz>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-media@vger.kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 ++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..37a4ae61b2c0 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -155,8 +155,9 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 			int direction, unsigned long data, unsigned long size)
 {
 	unsigned long first, last;
-	int err, rw = 0;
+	int err, rw = 0, i, nr_pages;
 	unsigned int flags = FOLL_FORCE;
+	struct vm_area_struct **vmas = NULL;
 
 	dma->direction = direction;
 	switch (dma->direction) {
@@ -179,6 +180,16 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 	if (NULL == dma->pages)
 		return -ENOMEM;
 
+	if (IS_ENABLED(CONFIG_FS_DAX)) {
+		vmas = kmalloc(dma->nr_pages * sizeof(struct vm_area_struct *),
+				GFP_KERNEL);
+		if (NULL == vmas) {
+			kfree(dma->pages);
+			dma->pages = NULL;
+			return -ENOMEM;
+		}
+	}
+
 	if (rw == READ)
 		flags |= FOLL_WRITE;
 
@@ -186,7 +197,31 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 		data, size, dma->nr_pages);
 
 	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-			     flags, dma->pages, NULL);
+			     flags, dma->pages, vmas);
+	nr_pages = err;
+
+	for (i = 0; vmas && i < nr_pages; i++) {
+		struct vm_area_struct *vma = vmas[i];
+		struct inode *inode;
+
+		if (!vma_is_dax(vma))
+			continue;
+
+		/* device-dax is safe for long-lived v4l2 mappings... */
+		inode = file_inode(vma->vm_file);
+		if (inode->i_mode == S_IFCHR)
+			continue;
+
+		/* ...filesystem-dax is not. */
+		err = -EOPNOTSUPP;
+		break;
+
+		/*
+		 * FIXME: add a 'with lease' mechanism for v4l2 to
+		 * obtain time bounded access to filesytem-dax mappings
+		 */
+	}
+	kfree(vmas);
 
 	if (err != dma->nr_pages) {
 		dma->nr_pages = (err >= 0) ? err : 0;

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-kernel, stable, linux-xfs, linux-mm,
	linux-fsdevel, akpm, Mauro Carvalho Chehab, hch, linux-media

V4L2 memory registrations are incompatible with filesystem-dax that
needs the ability to revoke dma access to a mapping at will, or
otherwise allow the kernel to wait for completion of DMA. The
filesystem-dax implementation breaks the traditional solution of
truncate of active file backed mappings since there is no page-cache
page we can orphan to sustain ongoing DMA.

If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.

Reported-by: Jan Kara <jack@suse.cz>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-media@vger.kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c |   39 ++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 0b5c43f7e020..37a4ae61b2c0 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -155,8 +155,9 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 			int direction, unsigned long data, unsigned long size)
 {
 	unsigned long first, last;
-	int err, rw = 0;
+	int err, rw = 0, i, nr_pages;
 	unsigned int flags = FOLL_FORCE;
+	struct vm_area_struct **vmas = NULL;
 
 	dma->direction = direction;
 	switch (dma->direction) {
@@ -179,6 +180,16 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 	if (NULL == dma->pages)
 		return -ENOMEM;
 
+	if (IS_ENABLED(CONFIG_FS_DAX)) {
+		vmas = kmalloc(dma->nr_pages * sizeof(struct vm_area_struct *),
+				GFP_KERNEL);
+		if (NULL == vmas) {
+			kfree(dma->pages);
+			dma->pages = NULL;
+			return -ENOMEM;
+		}
+	}
+
 	if (rw == READ)
 		flags |= FOLL_WRITE;
 
@@ -186,7 +197,31 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 		data, size, dma->nr_pages);
 
 	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-			     flags, dma->pages, NULL);
+			     flags, dma->pages, vmas);
+	nr_pages = err;
+
+	for (i = 0; vmas && i < nr_pages; i++) {
+		struct vm_area_struct *vma = vmas[i];
+		struct inode *inode;
+
+		if (!vma_is_dax(vma))
+			continue;
+
+		/* device-dax is safe for long-lived v4l2 mappings... */
+		inode = file_inode(vma->vm_file);
+		if (inode->i_mode == S_IFCHR)
+			continue;
+
+		/* ...filesystem-dax is not. */
+		err = -EOPNOTSUPP;
+		break;
+
+		/*
+		 * FIXME: add a 'with lease' mechanism for v4l2 to
+		 * obtain time bounded access to filesytem-dax mappings
+		 */
+	}
+	kfree(vmas);
 
 	if (err != dma->nr_pages) {
 		dma->nr_pages = (err >= 0) ? err : 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks
  2017-10-31 23:21 ` Dan Williams
  (?)
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Towards solving DAX-DMA vs truncate arrange for filesystems to set up a
page-idle callback when they mount a dax_device.

No functional changes are expected as this only registers a nop handler
for the ->page_free() event for device-mapped pages.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   80 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |   13 +++++++
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    6 +++
 kernel/memremap.c        |    1 +
 8 files changed, 113 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3ccb064d200d..193e0cd8d90c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,65 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -502,10 +549,23 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
 		return;
+	put_dev_pagemap(dax_dev->pgmap);
+	dax_dev->pgmap = NULL;
 	iput(&dax_dev->inode);
 }
 EXPORT_SYMBOL_GPL(put_dax);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 39dfd7affa31..9f36a2193eb6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -300,6 +300,7 @@ static int pmem_attach_disk(struct device *dev,
 	struct vmem_altmap __altmap, *altmap = NULL;
 	int nid = dev_to_node(dev), fua, wbc;
 	struct resource *res = &nsio->res;
+	struct dev_pagemap *pgmap = NULL;
 	struct nd_pfn *nd_pfn = NULL;
 	struct dax_device *dax_dev;
 	struct nd_pfn_sb *pfn_sb;
@@ -358,14 +359,24 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pfn_flags |= PFN_MAP;
 		res = &pfn_res; /* for badblocks populate */
 		res->start += pmem->data_offset;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
 				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
+
+	/* we still hold a reference until the driver is unloaded */
+	put_dev_pagemap(pgmap);
+
 	/*
 	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
@@ -402,7 +413,7 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops, pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 1458706bd2ec..884784acd9fd 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -171,7 +171,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -814,7 +814,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1202,7 +1202,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b104096fce9e..f27355f41616 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -950,7 +950,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3397,7 +3397,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4400,7 +4400,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 584cf2d573ba..fdc4785a5c85 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -722,7 +722,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -730,10 +730,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -751,9 +751,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -763,7 +763,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -777,7 +777,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -811,14 +811,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..ea21ebfd1889 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -50,12 +50,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -67,13 +63,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 #endif
 
@@ -81,6 +78,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79f8ba7c3894..39d2de3f744b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -64,11 +64,17 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * MEMORY_DEVICE_HOST memory that is being managed by a filesystem. The
+ * filesystem needs page idle callbacks to coordinate direct-I/O + DMA
+ * (get_user_pages) vs truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 403ab9cdb949..bf61cfa89c7d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -323,6 +323,7 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	page_map = radix_tree_lookup(&pgmap_radix, PHYS_PFN(phys));
 	return page_map ? &page_map->pgmap : NULL;
 }
+EXPORT_SYMBOL_GPL(find_dev_pagemap);
 
 /**
  * devm_memremap_pages - remap and provide memmap backing for the given resource

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Towards solving DAX-DMA vs truncate arrange for filesystems to set up a
page-idle callback when they mount a dax_device.

No functional changes are expected as this only registers a nop handler
for the ->page_free() event for device-mapped pages.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   80 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |   13 +++++++
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    6 +++
 kernel/memremap.c        |    1 +
 8 files changed, 113 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3ccb064d200d..193e0cd8d90c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,65 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -502,10 +549,23 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
 		return;
+	put_dev_pagemap(dax_dev->pgmap);
+	dax_dev->pgmap = NULL;
 	iput(&dax_dev->inode);
 }
 EXPORT_SYMBOL_GPL(put_dax);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 39dfd7affa31..9f36a2193eb6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -300,6 +300,7 @@ static int pmem_attach_disk(struct device *dev,
 	struct vmem_altmap __altmap, *altmap = NULL;
 	int nid = dev_to_node(dev), fua, wbc;
 	struct resource *res = &nsio->res;
+	struct dev_pagemap *pgmap = NULL;
 	struct nd_pfn *nd_pfn = NULL;
 	struct dax_device *dax_dev;
 	struct nd_pfn_sb *pfn_sb;
@@ -358,14 +359,24 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pfn_flags |= PFN_MAP;
 		res = &pfn_res; /* for badblocks populate */
 		res->start += pmem->data_offset;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
 				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
+
+	/* we still hold a reference until the driver is unloaded */
+	put_dev_pagemap(pgmap);
+
 	/*
 	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
@@ -402,7 +413,7 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops, pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 1458706bd2ec..884784acd9fd 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -171,7 +171,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -814,7 +814,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1202,7 +1202,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b104096fce9e..f27355f41616 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -950,7 +950,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3397,7 +3397,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4400,7 +4400,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 584cf2d573ba..fdc4785a5c85 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -722,7 +722,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -730,10 +730,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -751,9 +751,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -763,7 +763,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -777,7 +777,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -811,14 +811,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..ea21ebfd1889 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -50,12 +50,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -67,13 +63,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 #endif
 
@@ -81,6 +78,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79f8ba7c3894..39d2de3f744b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -64,11 +64,17 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * MEMORY_DEVICE_HOST memory that is being managed by a filesystem. The
+ * filesystem needs page idle callbacks to coordinate direct-I/O + DMA
+ * (get_user_pages) vs truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 403ab9cdb949..bf61cfa89c7d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -323,6 +323,7 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	page_map = radix_tree_lookup(&pgmap_radix, PHYS_PFN(phys));
 	return page_map ? &page_map->pgmap : NULL;
 }
+EXPORT_SYMBOL_GPL(find_dev_pagemap);
 
 /**
  * devm_memremap_pages - remap and provide memmap backing for the given resource

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Towards solving DAX-DMA vs truncate arrange for filesystems to set up a
page-idle callback when they mount a dax_device.

No functional changes are expected as this only registers a nop handler
for the ->page_free() event for device-mapped pages.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   80 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |   13 +++++++
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    6 +++
 kernel/memremap.c        |    1 +
 8 files changed, 113 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3ccb064d200d..193e0cd8d90c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,65 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -502,10 +549,23 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
 		return;
+	put_dev_pagemap(dax_dev->pgmap);
+	dax_dev->pgmap = NULL;
 	iput(&dax_dev->inode);
 }
 EXPORT_SYMBOL_GPL(put_dax);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 39dfd7affa31..9f36a2193eb6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -300,6 +300,7 @@ static int pmem_attach_disk(struct device *dev,
 	struct vmem_altmap __altmap, *altmap = NULL;
 	int nid = dev_to_node(dev), fua, wbc;
 	struct resource *res = &nsio->res;
+	struct dev_pagemap *pgmap = NULL;
 	struct nd_pfn *nd_pfn = NULL;
 	struct dax_device *dax_dev;
 	struct nd_pfn_sb *pfn_sb;
@@ -358,14 +359,24 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pfn_flags |= PFN_MAP;
 		res = &pfn_res; /* for badblocks populate */
 		res->start += pmem->data_offset;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
 				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
+
+	/* we still hold a reference until the driver is unloaded */
+	put_dev_pagemap(pgmap);
+
 	/*
 	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
@@ -402,7 +413,7 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops, pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 1458706bd2ec..884784acd9fd 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -171,7 +171,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -814,7 +814,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1202,7 +1202,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b104096fce9e..f27355f41616 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -950,7 +950,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3397,7 +3397,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4400,7 +4400,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 584cf2d573ba..fdc4785a5c85 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -722,7 +722,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -730,10 +730,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -751,9 +751,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -763,7 +763,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -777,7 +777,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -811,14 +811,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..ea21ebfd1889 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -50,12 +50,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -67,13 +63,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 #endif
 
@@ -81,6 +78,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79f8ba7c3894..39d2de3f744b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -64,11 +64,17 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * MEMORY_DEVICE_HOST memory that is being managed by a filesystem. The
+ * filesystem needs page idle callbacks to coordinate direct-I/O + DMA
+ * (get_user_pages) vs truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 403ab9cdb949..bf61cfa89c7d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -323,6 +323,7 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	page_map = radix_tree_lookup(&pgmap_radix, PHYS_PFN(phys));
 	return page_map ? &page_map->pgmap : NULL;
 }
+EXPORT_SYMBOL_GPL(find_dev_pagemap);
 
 /**
  * devm_memremap_pages - remap and provide memmap backing for the given resource

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Towards solving DAX-DMA vs truncate arrange for filesystems to set up a
page-idle callback when they mount a dax_device.

No functional changes are expected as this only registers a nop handler
for the ->page_free() event for device-mapped pages.

Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |   80 ++++++++++++++++++++++++++++++++++++++++------
 drivers/nvdimm/pmem.c    |   13 +++++++
 fs/ext2/super.c          |    6 ++-
 fs/ext4/super.c          |    6 ++-
 fs/xfs/xfs_super.c       |   20 ++++++------
 include/linux/dax.h      |   17 +++++-----
 include/linux/memremap.h |    6 +++
 kernel/memremap.c        |    1 +
 8 files changed, 113 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 3ccb064d200d..193e0cd8d90c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-	if (!blk_queue_dax(bdev->bd_queue))
-		return NULL;
-	return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,65 @@ struct dax_device {
 	const char *host;
 	void *private;
 	unsigned long flags;
+	struct dev_pagemap *pgmap;
 	const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+
+	if (!blk_queue_dax(bdev->bd_queue))
+		return NULL;
+	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+	if (!dax_dev->pgmap)
+		return dax_dev;
+	pgmap = dax_dev->pgmap;
+
+	mutex_lock(&devmap_lock);
+	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+			|| pgmap->page_fault
+			|| pgmap->type != MEMORY_DEVICE_HOST) {
+		put_dax(dax_dev);
+		mutex_unlock(&devmap_lock);
+		return NULL;
+	}
+
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+	pgmap->data = owner;
+	mutex_unlock(&devmap_lock);
+
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+	struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+	put_dax(dax_dev);
+	if (!pgmap)
+		return;
+	if (!pgmap->data)
+		return;
+
+	mutex_lock(&devmap_lock);
+	WARN_ON(pgmap->data != owner);
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
+	mutex_unlock(&devmap_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -502,10 +549,23 @@ struct dax_device *alloc_dax(void *private, const char *__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+	struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+	if (dax_dev)
+		dax_dev->pgmap = pgmap;
+	return dax_dev;
+}
+EXPORT_SYMBOL_GPL(alloc_dax_devmap);
+
 void put_dax(struct dax_device *dax_dev)
 {
 	if (!dax_dev)
 		return;
+	put_dev_pagemap(dax_dev->pgmap);
+	dax_dev->pgmap = NULL;
 	iput(&dax_dev->inode);
 }
 EXPORT_SYMBOL_GPL(put_dax);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 39dfd7affa31..9f36a2193eb6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -300,6 +300,7 @@ static int pmem_attach_disk(struct device *dev,
 	struct vmem_altmap __altmap, *altmap = NULL;
 	int nid = dev_to_node(dev), fua, wbc;
 	struct resource *res = &nsio->res;
+	struct dev_pagemap *pgmap = NULL;
 	struct nd_pfn *nd_pfn = NULL;
 	struct dax_device *dax_dev;
 	struct nd_pfn_sb *pfn_sb;
@@ -358,14 +359,24 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pfn_flags |= PFN_MAP;
 		res = &pfn_res; /* for badblocks populate */
 		res->start += pmem->data_offset;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
 				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
+		pgmap = get_dev_pagemap(PHYS_PFN(virt_to_phys(addr)), NULL);
+		if (!pgmap)
+			return -ENOMEM;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 
+
+	/* we still hold a reference until the driver is unloaded */
+	put_dev_pagemap(pgmap);
+
 	/*
 	 * At release time the queue must be frozen before
 	 * devm_memremap_pages is unwound
@@ -402,7 +413,7 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, res);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+	dax_dev = alloc_dax_devmap(pmem, disk->disk_name, &pmem_dax_ops, pgmap);
 	if (!dax_dev) {
 		put_disk(disk);
 		return -ENOMEM;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 1458706bd2ec..884784acd9fd 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -171,7 +171,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -814,7 +814,7 @@ static unsigned long descriptor_loc(struct super_block *sb,
 
 static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	struct buffer_head * bh;
 	struct ext2_sb_info * sbi;
 	struct ext2_super_block * es;
@@ -1202,7 +1202,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
 failed:
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return ret;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b104096fce9e..f27355f41616 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -950,7 +950,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_dax_release(sbi->s_daxdev, sb);
 	kfree(sbi);
 }
 
@@ -3397,7 +3397,7 @@ static void ext4_set_resv_clusters(struct super_block *sb)
 
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 {
-	struct dax_device *dax_dev = fs_dax_get_by_bdev(sb->s_bdev);
+	struct dax_device *dax_dev = fs_dax_claim_bdev(sb->s_bdev, sb);
 	char *orig_data = kstrdup(data, GFP_KERNEL);
 	struct buffer_head *bh;
 	struct ext4_super_block *es = NULL;
@@ -4400,7 +4400,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 out_free_base:
 	kfree(sbi);
 	kfree(orig_data);
-	fs_put_dax(dax_dev);
+	fs_dax_release(dax_dev, sb);
 	return err ? err : ret;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 584cf2d573ba..fdc4785a5c85 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -722,7 +722,7 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
 	if (mp->m_rtdev_targp) {
 		struct block_device *rtdev = mp->m_rtdev_targp->bt_bdev;
@@ -730,10 +730,10 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp, mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
-		fs_put_dax(dax_rtdev);
+		fs_dax_release(dax_rtdev, mp);
 	}
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 }
 
 /*
@@ -751,9 +751,9 @@ xfs_open_devices(
 	struct xfs_mount	*mp)
 {
 	struct block_device	*ddev = mp->m_super->s_bdev;
-	struct dax_device	*dax_ddev = fs_dax_get_by_bdev(ddev);
-	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
+	struct dax_device	*dax_ddev = fs_dax_claim_bdev(ddev, mp);
 	struct block_device	*logdev = NULL, *rtdev = NULL;
+	struct dax_device	*dax_logdev = NULL, *dax_rtdev = NULL;
 	int			error;
 
 	/*
@@ -763,7 +763,7 @@ xfs_open_devices(
 		error = xfs_blkdev_get(mp, mp->m_logname, &logdev);
 		if (error)
 			goto out;
-		dax_logdev = fs_dax_get_by_bdev(logdev);
+		dax_logdev = fs_dax_claim_bdev(logdev, mp);
 	}
 
 	if (mp->m_rtname) {
@@ -777,7 +777,7 @@ xfs_open_devices(
 			error = -EINVAL;
 			goto out_close_rtdev;
 		}
-		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		dax_rtdev = fs_dax_claim_bdev(rtdev, mp);
 	}
 
 	/*
@@ -811,14 +811,14 @@ xfs_open_devices(
 	xfs_free_buftarg(mp, mp->m_ddev_targp);
  out_close_rtdev:
 	xfs_blkdev_put(rtdev);
-	fs_put_dax(dax_rtdev);
+	fs_dax_release(dax_rtdev, mp);
  out_close_logdev:
 	if (logdev && logdev != ddev) {
 		xfs_blkdev_put(logdev);
-		fs_put_dax(dax_logdev);
+		fs_dax_release(dax_logdev, mp);
 	}
  out:
-	fs_put_dax(dax_ddev);
+	fs_dax_release(dax_ddev, mp);
 	return error;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..ea21ebfd1889 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -50,12 +50,8 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return dax_get_by_host(host);
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
-
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner);
+void fs_dax_release(struct dax_device *dax_dev, void *owner);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -67,13 +63,14 @@ static inline struct dax_device *fs_dax_get_by_host(const char *host)
 	return NULL;
 }
 
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline struct dax_device *fs_dax_claim_bdev(struct block_device *bdev,
+		void *owner)
 {
+	return NULL;
 }
 
-static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
+static inline void fs_dax_release(struct dax_device *dax_dev, void *owner)
 {
-	return NULL;
 }
 #endif
 
@@ -81,6 +78,8 @@ int dax_read_lock(void);
 void dax_read_unlock(int id);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops);
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+		const struct dax_operations *ops, struct dev_pagemap *pgmap);
 bool dax_alive(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79f8ba7c3894..39d2de3f744b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -64,11 +64,17 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * MEMORY_DEVICE_HOST memory that is being managed by a filesystem. The
+ * filesystem needs page idle callbacks to coordinate direct-I/O + DMA
+ * (get_user_pages) vs truncate.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 403ab9cdb949..bf61cfa89c7d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -323,6 +323,7 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	page_map = radix_tree_lookup(&pgmap_radix, PHYS_PFN(phys));
 	return page_map ? &page_map->pgmap : NULL;
 }
+EXPORT_SYMBOL_GPL(find_dev_pagemap);
 
 /**
  * devm_memremap_pages - remap and provide memmap backing for the given resource

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
  2017-10-31 23:21 ` Dan Williams
  (?)
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Combine the now three use cases of page-idle callbacks for ZONE_DEVICE
memory into a common selectable symbol.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   18 +++++++++++++++---
 include/linux/mm.h       |   46 ++++++++++++++++++++++++----------------------
 kernel/memremap.c        |   25 +++++++++++++++++++++----
 mm/Kconfig               |    5 +++++
 mm/hmm.c                 |   13 ++-----------
 mm/swap.c                |    3 ++-
 8 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 193e0cd8d90c..4ac359e14777 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	devmap_managed_pages_enable();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -214,6 +215,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	devmap_managed_pages_disable();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index b40128bf6d1a..cd4ee17ecdd8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEVMAP_MANAGED_PAGES
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 39d2de3f744b..a6716f5335e7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,5 @@
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -138,6 +137,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void devmap_managed_pages_enable(void);
+void devmap_managed_pages_disable(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
@@ -164,7 +166,7 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
 static inline bool is_device_private_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
@@ -176,7 +178,17 @@ static inline bool is_device_public_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 /**
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8c1e3ac77285..2d6cf2583e10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -800,27 +800,32 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
-{
-}
-#define IS_HMM_ENABLED 0
-static inline bool is_device_private_page(const struct page *page)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
 	return false;
 }
-static inline bool is_device_public_page(const struct page *page)
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool put_devmap_managed_page(struct page *page)
 {
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 static inline void get_page(struct page *page)
 {
@@ -838,16 +843,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bf61cfa89c7d..8a4ebfe9db4e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -503,9 +503,26 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL(devmap_managed_key);
+static atomic_t devmap_enable;
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+void devmap_managed_pages_enable(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_enable);
+
+void devmap_managed_pages_disable(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_disable);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -525,5 +542,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL(__put_devmap_managed_page);
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c4bdddd80c2..8ee95197dc9f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -705,6 +705,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEVMAP_MANAGED_PAGES
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -725,6 +728,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -735,6 +739,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index a88a847bccba..53c8f1dd821d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -998,7 +989,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1092,7 +1083,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index a77d68f2c1b6..09c71044b565 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -772,7 +773,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Combine the now three use cases of page-idle callbacks for ZONE_DEVICE
memory into a common selectable symbol.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   18 +++++++++++++++---
 include/linux/mm.h       |   46 ++++++++++++++++++++++++----------------------
 kernel/memremap.c        |   25 +++++++++++++++++++++----
 mm/Kconfig               |    5 +++++
 mm/hmm.c                 |   13 ++-----------
 mm/swap.c                |    3 ++-
 8 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 193e0cd8d90c..4ac359e14777 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	devmap_managed_pages_enable();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -214,6 +215,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	devmap_managed_pages_disable();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index b40128bf6d1a..cd4ee17ecdd8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEVMAP_MANAGED_PAGES
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 39d2de3f744b..a6716f5335e7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,5 @@
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -138,6 +137,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void devmap_managed_pages_enable(void);
+void devmap_managed_pages_disable(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
@@ -164,7 +166,7 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
 static inline bool is_device_private_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
@@ -176,7 +178,17 @@ static inline bool is_device_public_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 /**
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8c1e3ac77285..2d6cf2583e10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -800,27 +800,32 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
-{
-}
-#define IS_HMM_ENABLED 0
-static inline bool is_device_private_page(const struct page *page)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
 	return false;
 }
-static inline bool is_device_public_page(const struct page *page)
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool put_devmap_managed_page(struct page *page)
 {
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 static inline void get_page(struct page *page)
 {
@@ -838,16 +843,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bf61cfa89c7d..8a4ebfe9db4e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -503,9 +503,26 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL(devmap_managed_key);
+static atomic_t devmap_enable;
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+void devmap_managed_pages_enable(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_enable);
+
+void devmap_managed_pages_disable(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_disable);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -525,5 +542,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL(__put_devmap_managed_page);
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c4bdddd80c2..8ee95197dc9f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -705,6 +705,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEVMAP_MANAGED_PAGES
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -725,6 +728,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -735,6 +739,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index a88a847bccba..53c8f1dd821d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -998,7 +989,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1092,7 +1083,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index a77d68f2c1b6..09c71044b565 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -772,7 +773,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Combine the now three use cases of page-idle callbacks for ZONE_DEVICE
memory into a common selectable symbol.

Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   18 +++++++++++++++---
 include/linux/mm.h       |   46 ++++++++++++++++++++++++----------------------
 kernel/memremap.c        |   25 +++++++++++++++++++++----
 mm/Kconfig               |    5 +++++
 mm/hmm.c                 |   13 ++-----------
 mm/swap.c                |    3 ++-
 8 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 193e0cd8d90c..4ac359e14777 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	devmap_managed_pages_enable();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -214,6 +215,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	devmap_managed_pages_disable();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index b40128bf6d1a..cd4ee17ecdd8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEVMAP_MANAGED_PAGES
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 39d2de3f744b..a6716f5335e7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,5 @@
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -138,6 +137,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void devmap_managed_pages_enable(void);
+void devmap_managed_pages_disable(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
@@ -164,7 +166,7 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
 static inline bool is_device_private_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
@@ -176,7 +178,17 @@ static inline bool is_device_public_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 /**
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8c1e3ac77285..2d6cf2583e10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -800,27 +800,32 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
-{
-}
-#define IS_HMM_ENABLED 0
-static inline bool is_device_private_page(const struct page *page)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
 	return false;
 }
-static inline bool is_device_public_page(const struct page *page)
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool put_devmap_managed_page(struct page *page)
 {
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 static inline void get_page(struct page *page)
 {
@@ -838,16 +843,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bf61cfa89c7d..8a4ebfe9db4e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -503,9 +503,26 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL(devmap_managed_key);
+static atomic_t devmap_enable;
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+void devmap_managed_pages_enable(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_enable);
+
+void devmap_managed_pages_disable(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_disable);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -525,5 +542,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL(__put_devmap_managed_page);
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c4bdddd80c2..8ee95197dc9f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -705,6 +705,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEVMAP_MANAGED_PAGES
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -725,6 +728,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -735,6 +739,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index a88a847bccba..53c8f1dd821d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -998,7 +989,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1092,7 +1083,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index a77d68f2c1b6..09c71044b565 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -772,7 +773,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

Combine the now three use cases of page-idle callbacks for ZONE_DEVICE
memory into a common selectable symbol.

Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    2 ++
 fs/Kconfig               |    1 +
 include/linux/memremap.h |   18 +++++++++++++++---
 include/linux/mm.h       |   46 ++++++++++++++++++++++++----------------------
 kernel/memremap.c        |   25 +++++++++++++++++++++----
 mm/Kconfig               |    5 +++++
 mm/hmm.c                 |   13 ++-----------
 mm/swap.c                |    3 ++-
 8 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 193e0cd8d90c..4ac359e14777 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -190,6 +190,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
 		return NULL;
 	}
 
+	devmap_managed_pages_enable();
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
 	pgmap->page_free = generic_dax_pagefree;
 	pgmap->data = owner;
@@ -214,6 +215,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
 	pgmap->type = MEMORY_DEVICE_HOST;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
+	devmap_managed_pages_disable();
 	mutex_unlock(&devmap_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index b40128bf6d1a..cd4ee17ecdd8 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEVMAP_MANAGED_PAGES
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 39d2de3f744b..a6716f5335e7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,5 @@
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -138,6 +137,9 @@ struct dev_pagemap {
 	enum memory_type type;
 };
 
+void devmap_managed_pages_enable(void);
+void devmap_managed_pages_disable(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
@@ -164,7 +166,7 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
 static inline bool is_device_private_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
@@ -176,7 +178,17 @@ static inline bool is_device_public_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 /**
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8c1e3ac77285..2d6cf2583e10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -800,27 +800,32 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
-{
-}
-#define IS_HMM_ENABLED 0
-static inline bool is_device_private_page(const struct page *page)
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
 	return false;
 }
-static inline bool is_device_public_page(const struct page *page)
+#else /* CONFIG_DEVMAP_MANAGED_PAGES */
+static inline bool put_devmap_managed_page(struct page *page)
 {
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
 
 static inline void get_page(struct page *page)
 {
@@ -838,16 +843,13 @@ static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bf61cfa89c7d..8a4ebfe9db4e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -503,9 +503,26 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_DEVMAP_MANAGED_PAGES
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL(devmap_managed_key);
+static atomic_t devmap_enable;
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+void devmap_managed_pages_enable(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_enable);
+
+void devmap_managed_pages_disable(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL(devmap_managed_pages_disable);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -525,5 +542,5 @@ void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL(__put_devmap_managed_page);
+#endif /* CONFIG_DEVMAP_MANAGED_PAGES */
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c4bdddd80c2..8ee95197dc9f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -705,6 +705,9 @@ config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEVMAP_MANAGED_PAGES
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -725,6 +728,7 @@ config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -735,6 +739,7 @@ config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEVMAP_MANAGED_PAGES
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index a88a847bccba..53c8f1dd821d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -998,7 +989,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1092,7 +1083,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	devmap_managed_pages_enable();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index a77d68f2c1b6..09c71044b565 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -772,7 +773,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Matthew Wilcox, linux-kernel, hch, linux-xfs, linux-mm,
	linux-fsdevel, akpm

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   56 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   20 ++++++++++++----
 kernel/memremap.c        |   10 ++++----
 3 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..fd5d385988d1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -297,6 +297,55 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct inode *inode)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->inode);
+		page->inode = inode;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->inode && page->inode != inode);
+		page->inode = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -403,6 +452,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping->host, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -452,6 +502,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping->host, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -529,6 +580,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	unsigned long pfn = pfn_t_to_pfn(pfn_t);
+	struct inode *inode = mapping->host;
 	pgoff_t index = vmf->pgoff;
 	void *new_entry;
 
@@ -548,6 +600,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, inode, false);
+		dax_associate_entry(new_entry, inode);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 46f4ecf5479a..dd976851e8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,11 +118,21 @@ struct page {
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
-		struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an
-					    * lru or handled by a slab
-					    * allocator, this points to the
-					    * hosting device page map.
-					    */
+		struct {
+			/*
+			 * ZONE_DEVICE pages are never on an lru or handled by
+			 * a slab allocator, this points to the hosting device
+			 * page map.
+			 */
+			struct dev_pagemap *pgmap;
+			/*
+			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
+			 * callbacks. Note that we don't use ->mapping since
+			 * that has hard coded page-cache assumptions in
+			 * several paths.
+			 */
+			struct inode *inode;
+		};
 		struct {		/* slub per cpu partial pages */
 			struct page *next;	/* Next partial slab */
 #ifdef CONFIG_64BIT
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 8a4ebfe9db4e..f9a2929fc310 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -441,13 +441,13 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct page *page = pfn_to_page(pfn);
 
 		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
-		 * freed or placed on a driver-private list.  Seed the
-		 * storage with LIST_POISON* values.
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+		 * and ->inode (for the MEMORY_DEVICE_FS_DAX case) association.
+		 * It is a bug if a ZONE_DEVICE page is ever freed or placed on
+		 * a driver-private list.
 		 */
-		list_del(&page->lru);
 		page->pgmap = pgmap;
+		page->inode = NULL;
 		percpu_ref_get(ref);
 		if (!(++i % 1024))
 			cond_resched();

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs, linux-mm,
	Jeff Moyer, Ross Zwisler, linux-fsdevel, akpm, hch

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   56 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   20 ++++++++++++----
 kernel/memremap.c        |   10 ++++----
 3 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..fd5d385988d1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -297,6 +297,55 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct inode *inode)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->inode);
+		page->inode = inode;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->inode && page->inode != inode);
+		page->inode = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -403,6 +452,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping->host, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -452,6 +502,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping->host, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -529,6 +580,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	unsigned long pfn = pfn_t_to_pfn(pfn_t);
+	struct inode *inode = mapping->host;
 	pgoff_t index = vmf->pgoff;
 	void *new_entry;
 
@@ -548,6 +600,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, inode, false);
+		dax_associate_entry(new_entry, inode);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 46f4ecf5479a..dd976851e8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,11 +118,21 @@ struct page {
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
-		struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an
-					    * lru or handled by a slab
-					    * allocator, this points to the
-					    * hosting device page map.
-					    */
+		struct {
+			/*
+			 * ZONE_DEVICE pages are never on an lru or handled by
+			 * a slab allocator, this points to the hosting device
+			 * page map.
+			 */
+			struct dev_pagemap *pgmap;
+			/*
+			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
+			 * callbacks. Note that we don't use ->mapping since
+			 * that has hard coded page-cache assumptions in
+			 * several paths.
+			 */
+			struct inode *inode;
+		};
 		struct {		/* slub per cpu partial pages */
 			struct page *next;	/* Next partial slab */
 #ifdef CONFIG_64BIT
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 8a4ebfe9db4e..f9a2929fc310 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -441,13 +441,13 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct page *page = pfn_to_page(pfn);
 
 		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
-		 * freed or placed on a driver-private list.  Seed the
-		 * storage with LIST_POISON* values.
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+		 * and ->inode (for the MEMORY_DEVICE_FS_DAX case) association.
+		 * It is a bug if a ZONE_DEVICE page is ever freed or placed on
+		 * a driver-private list.
 		 */
-		list_del(&page->lru);
 		page->pgmap = pgmap;
+		page->inode = NULL;
 		percpu_ref_get(ref);
 		if (!(++i % 1024))
 			cond_resched();

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs, linux-mm,
	Jeff Moyer, Ross Zwisler, linux-fsdevel, akpm, hch

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   56 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   20 ++++++++++++----
 kernel/memremap.c        |   10 ++++----
 3 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..fd5d385988d1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -297,6 +297,55 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct inode *inode)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->inode);
+		page->inode = inode;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->inode && page->inode != inode);
+		page->inode = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -403,6 +452,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping->host, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -452,6 +502,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping->host, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -529,6 +580,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	unsigned long pfn = pfn_t_to_pfn(pfn_t);
+	struct inode *inode = mapping->host;
 	pgoff_t index = vmf->pgoff;
 	void *new_entry;
 
@@ -548,6 +600,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, inode, false);
+		dax_associate_entry(new_entry, inode);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 46f4ecf5479a..dd976851e8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,11 +118,21 @@ struct page {
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
-		struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an
-					    * lru or handled by a slab
-					    * allocator, this points to the
-					    * hosting device page map.
-					    */
+		struct {
+			/*
+			 * ZONE_DEVICE pages are never on an lru or handled by
+			 * a slab allocator, this points to the hosting device
+			 * page map.
+			 */
+			struct dev_pagemap *pgmap;
+			/*
+			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
+			 * callbacks. Note that we don't use ->mapping since
+			 * that has hard coded page-cache assumptions in
+			 * several paths.
+			 */
+			struct inode *inode;
+		};
 		struct {		/* slub per cpu partial pages */
 			struct page *next;	/* Next partial slab */
 #ifdef CONFIG_64BIT
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 8a4ebfe9db4e..f9a2929fc310 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -441,13 +441,13 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct page *page = pfn_to_page(pfn);
 
 		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
-		 * freed or placed on a driver-private list.  Seed the
-		 * storage with LIST_POISON* values.
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+		 * and ->inode (for the MEMORY_DEVICE_FS_DAX case) association.
+		 * It is a bug if a ZONE_DEVICE page is ever freed or placed on
+		 * a driver-private list.
 		 */
-		list_del(&page->lru);
 		page->pgmap = pgmap;
+		page->inode = NULL;
 		percpu_ref_get(ref);
 		if (!(++i % 1024))
 			cond_resched();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 15/15] wait_bit: introduce {wait_on,wake_up}_devmap_idle
  2017-10-31 23:21 ` Dan Williams
  (?)
@ 2017-10-31 23:22   ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Peter Zijlstra, linux-kernel, linux-xfs, linux-mm, Ingo Molnar,
	linux-fsdevel, akpm, hch

Add hashed waitqueue infrastructure to wait for ZONE_DEVICE pages to
drop their reference counts and be considered idle for DMA. This
facility will be used for filesystem callbacks / wakeups when DMA to a
DAX mapped range of a file ends.

For now, this implementation does not have functional behavior change
outside of waking waitqueues that do not have any waiters present.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    1 +
 include/linux/wait_bit.h |   10 +++++++
 kernel/sched/wait_bit.c  |   64 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 15/15] wait_bit: introduce {wait_on,wake_up}_devmap_idle
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Peter Zijlstra, linux-kernel, linux-xfs, linux-mm, Ingo Molnar,
	linux-fsdevel, akpm, hch

Add hashed waitqueue infrastructure to wait for ZONE_DEVICE pages to
drop their reference counts and be considered idle for DMA. This
facility will be used for filesystem callbacks / wakeups when DMA to a
DAX mapped range of a file ends.

For now, this implementation does not have functional behavior change
outside of waking waitqueues that do not have any waiters present.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    1 +
 include/linux/wait_bit.h |   10 +++++++
 kernel/sched/wait_bit.c  |   64 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 15/15] wait_bit: introduce {wait_on,wake_up}_devmap_idle
@ 2017-10-31 23:22   ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-10-31 23:22 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Peter Zijlstra, linux-kernel, linux-xfs, linux-mm, Ingo Molnar,
	linux-fsdevel, akpm, hch

Add hashed waitqueue infrastructure to wait for ZONE_DEVICE pages to
drop their reference counts and be considered idle for DMA. This
facility will be used for filesystem callbacks / wakeups when DMA to a
DAX mapped range of a file ends.

For now, this implementation does not have functional behavior change
outside of waking waitqueues that do not have any waiters present.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c      |    1 +
 include/linux/wait_bit.h |   10 +++++++
 kernel/sched/wait_bit.c  |   64 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH 01/15] dax: quiet bdev_dax_supported()
  2017-10-31 23:21   ` Dan Williams
@ 2017-11-02 20:11     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, akpm, hch

On Tue, Oct 31, 2017 at 04:21:40PM -0700, Dan Williams wrote:
> Before we add another failure reason, quiet the existing log messages.
> Leave it to the caller to decide if bdev_dax_supported() failures are
> errors worth emitting to the log.
> 
> Reported-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 01/15] dax: quiet bdev_dax_supported()
@ 2017-11-02 20:11     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, akpm, hch

On Tue, Oct 31, 2017 at 04:21:40PM -0700, Dan Williams wrote:
> Before we add another failure reason, quiet the existing log messages.
> Leave it to the caller to decide if bdev_dax_supported() failures are
> errors worth emitting to the log.
> 
> Reported-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/15] brd: remove dax support
  2017-10-31 23:21   ` Dan Williams
@ 2017-11-02 20:12     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jens Axboe, akpm, Matthew Wilcox, linux-kernel,
	linux-xfs, linux-mm, linux-fsdevel, Ross Zwisler, hch

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/15] brd: remove dax support
@ 2017-11-02 20:12     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jens Axboe, akpm, Matthew Wilcox, linux-kernel,
	linux-xfs, linux-mm, linux-fsdevel, Ross Zwisler, hch

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/15] dax: stop requiring a live device for dax_flush()
  2017-10-31 23:22   ` Dan Williams
@ 2017-11-02 20:12     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel,
	akpm, hch

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/15] dax: stop requiring a live device for dax_flush()
@ 2017-11-02 20:12     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel,
	akpm, hch

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
  2017-10-31 23:22   ` Dan Williams
@ 2017-11-02 20:13     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Sean Hefty, linux-xfs, akpm, linux-rdma,
	linux-kernel, Jeff Moyer, stable, hch, Jason Gunthorpe, linux-mm,
	Doug Ledford, linux-fsdevel, Ross Zwisler, Hal Rosenstock

Any chance we could add a new get_user_pages_longerm or similar
helper instead of opencoding this in the various callers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-11-02 20:13     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-02 20:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Sean Hefty, linux-xfs, akpm, linux-rdma,
	linux-kernel, Jeff Moyer, stable, hch, Jason Gunthorpe, linux-mm,
	Doug Ledford, linux-fsdevel, Ross Zwisler, Hal Rosenstock

Any chance we could add a new get_user_pages_longerm or similar
helper instead of opencoding this in the various callers?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
  2017-11-02 20:13     ` Christoph Hellwig
  (?)
@ 2017-11-02 21:06       ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-11-02 21:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Doug Ledford, linux-nvdimm, linux-rdma,
	linux-kernel, stable, Hal Rosenstock, linux-xfs, Linux MM,
	linux-fsdevel, Sean Hefty, Andrew Morton

On Thu, Nov 2, 2017 at 1:13 PM, Christoph Hellwig <hch@lst.de> wrote:
> Any chance we could add a new get_user_pages_longerm or similar
> helper instead of opencoding this in the various callers?

Sounds like a great idea to me, I'll take a look...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-11-02 21:06       ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-11-02 21:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, Sean Hefty, linux-xfs, Andrew Morton, linux-rdma,
	linux-kernel, Jeff Moyer, stable, Jason Gunthorpe, Linux MM,
	Doug Ledford, linux-fsdevel, Ross Zwisler, Hal Rosenstock

On Thu, Nov 2, 2017 at 1:13 PM, Christoph Hellwig <hch@lst.de> wrote:
> Any chance we could add a new get_user_pages_longerm or similar
> helper instead of opencoding this in the various callers?

Sounds like a great idea to me, I'll take a look...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-11-02 21:06       ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-11-02 21:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, Sean Hefty, linux-xfs, Andrew Morton, linux-rdma,
	linux-kernel, Jeff Moyer, stable, Jason Gunthorpe, Linux MM,
	Doug Ledford, linux-fsdevel, Ross Zwisler, Hal Rosenstock

On Thu, Nov 2, 2017 at 1:13 PM, Christoph Hellwig <hch@lst.de> wrote:
> Any chance we could add a new get_user_pages_longerm or similar
> helper instead of opencoding this in the various callers?

Sounds like a great idea to me, I'll take a look...

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/15] mm, dax: introduce pfn_t_special()
  2017-10-31 23:21   ` Dan Williams
  (?)
@ 2017-11-03  2:32     ` Michael Ellerman
  -1 siblings, 0 replies; 92+ messages in thread
From: Michael Ellerman @ 2017-11-03  2:32 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Arnd Bergmann, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, linux-mm, Paul Mackerras,
	Martin Schwidefsky, linux-fsdevel, akpm, hch

Dan Williams <dan.j.williams@intel.com> writes:

> In support of removing the VM_MIXEDMAP indication from DAX VMAs,
> introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
> should be used for DAX ptes. This also helps identify drivers like
> dccssblk that only want to use DAX in a read-only fashion without
> get_user_pages() support.
>
> Ideally we could delete axonram and dcssblk DAX support, but if we need
> to keep it better make it explicit that axonram and dcssblk only support
> a sub-set of DAX due to missing _PAGE_DEVMAP support.

I sent a patch to remove axonram (sorry meant to Cc you):

  http://patchwork.ozlabs.org/patch/833588/

Will see if there's any feedback.

cheers
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/15] mm, dax: introduce pfn_t_special()
@ 2017-11-03  2:32     ` Michael Ellerman
  0 siblings, 0 replies; 92+ messages in thread
From: Michael Ellerman @ 2017-11-03  2:32 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	linux-mm, Paul Mackerras, Martin Schwidefsky, linux-fsdevel,
	akpm, hch, Arnd Bergmann

Dan Williams <dan.j.williams@intel.com> writes:

> In support of removing the VM_MIXEDMAP indication from DAX VMAs,
> introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
> should be used for DAX ptes. This also helps identify drivers like
> dccssblk that only want to use DAX in a read-only fashion without
> get_user_pages() support.
>
> Ideally we could delete axonram and dcssblk DAX support, but if we need
> to keep it better make it explicit that axonram and dcssblk only support
> a sub-set of DAX due to missing _PAGE_DEVMAP support.

I sent a patch to remove axonram (sorry meant to Cc you):

  http://patchwork.ozlabs.org/patch/833588/

Will see if there's any feedback.

cheers

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/15] mm, dax: introduce pfn_t_special()
@ 2017-11-03  2:32     ` Michael Ellerman
  0 siblings, 0 replies; 92+ messages in thread
From: Michael Ellerman @ 2017-11-03  2:32 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	linux-mm, Paul Mackerras, Martin Schwidefsky, linux-fsdevel,
	akpm, hch, Arnd Bergmann

Dan Williams <dan.j.williams@intel.com> writes:

> In support of removing the VM_MIXEDMAP indication from DAX VMAs,
> introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
> should be used for DAX ptes. This also helps identify drivers like
> dccssblk that only want to use DAX in a read-only fashion without
> get_user_pages() support.
>
> Ideally we could delete axonram and dcssblk DAX support, but if we need
> to keep it better make it explicit that axonram and dcssblk only support
> a sub-set of DAX due to missing _PAGE_DEVMAP support.

I sent a patch to remove axonram (sorry meant to Cc you):

  http://patchwork.ozlabs.org/patch/833588/

Will see if there's any feedback.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/15] brd: remove dax support
  2017-10-31 23:21   ` Dan Williams
  (?)
@ 2017-11-04 16:31     ` Jens Axboe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2017-11-04 16:31 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Matthew Wilcox, linux-kernel, hch, linux-xfs, linux-mm,
	linux-fsdevel, akpm

On 10/31/2017 05:21 PM, Dan Williams wrote:
> DAX support in brd is awkward because its backing page frames are
> distinct from the ones provided by pmem, dcssblk, or axonram. We need
> pfn_t_devmap() entries to fully support DAX, and the limited DAX support
> for pfn_t_special() page frames is not interesting for brd when pmem is
> already a superset of brd.  Lastly, brd is the only dax capable driver
> that may sleep in its ->direct_access() implementation. So it causes a
> global burden with no net gain of kernel functionality.
> 
> For all these reasons, remove DAX support.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/15] brd: remove dax support
@ 2017-11-04 16:31     ` Jens Axboe
  0 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2017-11-04 16:31 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: akpm, Matthew Wilcox, linux-kernel, linux-xfs, linux-mm,
	linux-fsdevel, Ross Zwisler, hch

On 10/31/2017 05:21 PM, Dan Williams wrote:
> DAX support in brd is awkward because its backing page frames are
> distinct from the ones provided by pmem, dcssblk, or axonram. We need
> pfn_t_devmap() entries to fully support DAX, and the limited DAX support
> for pfn_t_special() page frames is not interesting for brd when pmem is
> already a superset of brd.  Lastly, brd is the only dax capable driver
> that may sleep in its ->direct_access() implementation. So it causes a
> global burden with no net gain of kernel functionality.
> 
> For all these reasons, remove DAX support.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/15] brd: remove dax support
@ 2017-11-04 16:31     ` Jens Axboe
  0 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2017-11-04 16:31 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: akpm, Matthew Wilcox, linux-kernel, linux-xfs, linux-mm,
	linux-fsdevel, Ross Zwisler, hch

On 10/31/2017 05:21 PM, Dan Williams wrote:
> DAX support in brd is awkward because its backing page frames are
> distinct from the ones provided by pmem, dcssblk, or axonram. We need
> pfn_t_devmap() entries to fully support DAX, and the limited DAX support
> for pfn_t_special() page frames is not interesting for brd when pmem is
> already a superset of brd.  Lastly, brd is the only dax capable driver
> that may sleep in its ->direct_access() implementation. So it causes a
> global burden with no net gain of kernel functionality.
> 
> For all these reasons, remove DAX support.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks
  2017-10-31 23:22   ` Dan Williams
@ 2017-11-10  9:04     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

> +DEFINE_MUTEX(devmap_lock);

static?

> +#if IS_ENABLED(CONFIG_FS_DAX)
> +static void generic_dax_pagefree(struct page *page, void *data)
> +{
> +}
> +
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
> +{
> +	struct dax_device *dax_dev;
> +	struct dev_pagemap *pgmap;
> +
> +	if (!blk_queue_dax(bdev->bd_queue))
> +		return NULL;
> +	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
> +	if (!dax_dev->pgmap)
> +		return dax_dev;
> +	pgmap = dax_dev->pgmap;

> +	mutex_lock(&devmap_lock);
> +	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
> +			|| pgmap->page_fault
> +			|| pgmap->type != MEMORY_DEVICE_HOST) {
> +		put_dax(dax_dev);
> +		mutex_unlock(&devmap_lock);
> +		return NULL;
> +	}
> +
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = generic_dax_pagefree;
> +	pgmap->data = owner;
> +	mutex_unlock(&devmap_lock);

All this deep magic will need some explanation.  So far I don't understand
it at all, but maybe the later patches will help..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks
@ 2017-11-10  9:04     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

> +DEFINE_MUTEX(devmap_lock);

static?

> +#if IS_ENABLED(CONFIG_FS_DAX)
> +static void generic_dax_pagefree(struct page *page, void *data)
> +{
> +}
> +
> +struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
> +{
> +	struct dax_device *dax_dev;
> +	struct dev_pagemap *pgmap;
> +
> +	if (!blk_queue_dax(bdev->bd_queue))
> +		return NULL;
> +	dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
> +	if (!dax_dev->pgmap)
> +		return dax_dev;
> +	pgmap = dax_dev->pgmap;

> +	mutex_lock(&devmap_lock);
> +	if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
> +			|| pgmap->page_fault
> +			|| pgmap->type != MEMORY_DEVICE_HOST) {
> +		put_dax(dax_dev);
> +		mutex_unlock(&devmap_lock);
> +		return NULL;
> +	}
> +
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = generic_dax_pagefree;
> +	pgmap->data = owner;
> +	mutex_unlock(&devmap_lock);

All this deep magic will need some explanation.  So far I don't understand
it at all, but maybe the later patches will help..

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
  2017-10-31 23:22   ` Dan Williams
@ 2017-11-10  9:06     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

On Tue, Oct 31, 2017 at 04:22:46PM -0700, Dan Williams wrote:
> Combine the now three use cases of page-idle callbacks for ZONE_DEVICE
> memory into a common selectable symbol.

Very sparse changelog.  I understand the Kconfig bit, but it also seems to
introduce new static key functionality that isn't explained at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES
@ 2017-11-10  9:06     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Michal Hocko, linux-kernel, linux-xfs, linux-mm,
	Jérôme Glisse, linux-fsdevel, akpm, hch

On Tue, Oct 31, 2017 at 04:22:46PM -0700, Dan Williams wrote:
> Combine the now three use cases of page-idle callbacks for ZONE_DEVICE
> memory into a common selectable symbol.

Very sparse changelog.  I understand the Kconfig bit, but it also seems to
introduce new static key functionality that isn't explained at all.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-10-31 23:22   ` Dan Williams
  (?)
@ 2017-11-10  9:08     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, linux-kernel, hch,
	linux-xfs, linux-mm, linux-fsdevel, akpm

> +		struct {
> +			/*
> +			 * ZONE_DEVICE pages are never on an lru or handled by
> +			 * a slab allocator, this points to the hosting device
> +			 * page map.
> +			 */
> +			struct dev_pagemap *pgmap;
> +			/*
> +			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
> +			 * callbacks. Note that we don't use ->mapping since
> +			 * that has hard coded page-cache assumptions in
> +			 * several paths.
> +			 */

What assumptions?  I'd much rather fix those up than having two fields
that have the same functionality.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-11-10  9:08     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, Ross Zwisler, linux-fsdevel, akpm, hch

> +		struct {
> +			/*
> +			 * ZONE_DEVICE pages are never on an lru or handled by
> +			 * a slab allocator, this points to the hosting device
> +			 * page map.
> +			 */
> +			struct dev_pagemap *pgmap;
> +			/*
> +			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
> +			 * callbacks. Note that we don't use ->mapping since
> +			 * that has hard coded page-cache assumptions in
> +			 * several paths.
> +			 */

What assumptions?  I'd much rather fix those up than having two fields
that have the same functionality.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-11-10  9:08     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, Ross Zwisler, linux-fsdevel, akpm, hch

> +		struct {
> +			/*
> +			 * ZONE_DEVICE pages are never on an lru or handled by
> +			 * a slab allocator, this points to the hosting device
> +			 * page map.
> +			 */
> +			struct dev_pagemap *pgmap;
> +			/*
> +			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
> +			 * callbacks. Note that we don't use ->mapping since
> +			 * that has hard coded page-cache assumptions in
> +			 * several paths.
> +			 */

What assumptions?  I'd much rather fix those up than having two fields
that have the same functionality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 15/15] wait_bit: introduce {wait_on,wake_up}_devmap_idle
  2017-10-31 23:22   ` Dan Williams
@ 2017-11-10  9:09     ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Peter Zijlstra, linux-kernel, linux-xfs, linux-mm,
	Ingo Molnar, linux-fsdevel, akpm, hch

On Tue, Oct 31, 2017 at 04:22:56PM -0700, Dan Williams wrote:
> Add hashed waitqueue infrastructure to wait for ZONE_DEVICE pages to
> drop their reference counts and be considered idle for DMA. This
> facility will be used for filesystem callbacks / wakeups when DMA to a
> DAX mapped range of a file ends.
> 
> For now, this implementation does not have functional behavior change
> outside of waking waitqueues that do not have any waiters present.

Hmm.  What is the point of the patch then?

You also probably want to split this into one well documented patch
that changes the bit wait infrastructure and another one using it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 15/15] wait_bit: introduce {wait_on,wake_up}_devmap_idle
@ 2017-11-10  9:09     ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2017-11-10  9:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Peter Zijlstra, linux-kernel, linux-xfs, linux-mm,
	Ingo Molnar, linux-fsdevel, akpm, hch

On Tue, Oct 31, 2017 at 04:22:56PM -0700, Dan Williams wrote:
> Add hashed waitqueue infrastructure to wait for ZONE_DEVICE pages to
> drop their reference counts and be considered idle for DMA. This
> facility will be used for filesystem callbacks / wakeups when DMA to a
> DAX mapped range of a file ends.
> 
> For now, this implementation does not have functional behavior change
> outside of waking waitqueues that do not have any waiters present.

Hmm.  What is the point of the patch then?

You also probably want to split this into one well documented patch
that changes the bit wait infrastructure and another one using it.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-11-10  9:08     ` Christoph Hellwig
@ 2017-12-20  1:11       ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-20  1:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Ross Zwisler, linux-fsdevel, Andrew Morton

On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +             struct {
>> +                     /*
>> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> +                      * a slab allocator, this points to the hosting device
>> +                      * page map.
>> +                      */
>> +                     struct dev_pagemap *pgmap;
>> +                     /*
>> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> +                      * callbacks. Note that we don't use ->mapping since
>> +                      * that has hard coded page-cache assumptions in
>> +                      * several paths.
>> +                      */
>
> What assumptions?  I'd much rather fix those up than having two fields
> that have the same functionality.

[ Reviving this old thread where you asked why I introduce page->inode
instead of reusing page->mapping ]

For example, xfs_vm_set_page_dirty() assumes that page->mapping being
non-NULL indicates a typical page cache page, this is a false
assumption for DAX. My guess at a fix for this is to add
pagecache_page() checks to locations like this, but I worry about how
to find them all. Where pagecache_page() is:

bool pagecache_page(struct page *page)
{
        if (!page->mapping)
                return false;
        if (!IS_DAX(page->mapping->host))
                return false;
        return true;
}

Otherwise we go off the rails:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-20  1:11       ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-20  1:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, Jan Kara, Matthew Wilcox, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Ross Zwisler, linux-fsdevel, Andrew Morton

On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +             struct {
>> +                     /*
>> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> +                      * a slab allocator, this points to the hosting device
>> +                      * page map.
>> +                      */
>> +                     struct dev_pagemap *pgmap;
>> +                     /*
>> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> +                      * callbacks. Note that we don't use ->mapping since
>> +                      * that has hard coded page-cache assumptions in
>> +                      * several paths.
>> +                      */
>
> What assumptions?  I'd much rather fix those up than having two fields
> that have the same functionality.

[ Reviving this old thread where you asked why I introduce page->inode
instead of reusing page->mapping ]

For example, xfs_vm_set_page_dirty() assumes that page->mapping being
non-NULL indicates a typical page cache page, this is a false
assumption for DAX. My guess at a fix for this is to add
pagecache_page() checks to locations like this, but I worry about how
to find them all. Where pagecache_page() is:

bool pagecache_page(struct page *page)
{
        if (!page->mapping)
                return false;
        if (!IS_DAX(page->mapping->host))
                return false;
        return true;
}

Otherwise we go off the rails:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-12-20  1:11       ` Dan Williams
@ 2017-12-20 14:38         ` Jan Kara
  -1 siblings, 0 replies; 92+ messages in thread
From: Jan Kara @ 2017-12-20 14:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, Jan Kara, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Tue 19-12-17 17:11:38, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX. My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }
> 
> Otherwise we go off the rails:
> 
>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]

But this just shows that mapping->a_ops are wrong for this mapping, doesn't
it? ->set_page_dirty handler for DAX mapping should just properly handle
DAX pages... (and only those)

>  [..]
>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
> 4.15.0-rc2+ #984
>  [..]
>  Call Trace:
>   set_page_dirty_lock+0x40/0x60
>   bio_set_pages_dirty+0x37/0x50
>   iomap_dio_actor+0x2b7/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   iomap_apply+0xa4/0x110
>   iomap_dio_rw+0x29e/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>   __vfs_read+0xf9/0x170
>   vfs_read+0xa6/0x150
>   SyS_pread64+0x93/0xb0
>   entry_SYSCALL_64_fastpath+0x1f/0x96

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-20 14:38         ` Jan Kara
  0 siblings, 0 replies; 92+ messages in thread
From: Jan Kara @ 2017-12-20 14:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, Jan Kara, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Tue 19-12-17 17:11:38, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX. My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }
> 
> Otherwise we go off the rails:
> 
>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]

But this just shows that mapping->a_ops are wrong for this mapping, doesn't
it? ->set_page_dirty handler for DAX mapping should just properly handle
DAX pages... (and only those)

>  [..]
>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
> 4.15.0-rc2+ #984
>  [..]
>  Call Trace:
>   set_page_dirty_lock+0x40/0x60
>   bio_set_pages_dirty+0x37/0x50
>   iomap_dio_actor+0x2b7/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   iomap_apply+0xa4/0x110
>   iomap_dio_rw+0x29e/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>   __vfs_read+0xf9/0x170
>   vfs_read+0xa6/0x150
>   SyS_pread64+0x93/0xb0
>   entry_SYSCALL_64_fastpath+0x1f/0x96

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-12-20  1:11       ` Dan Williams
  (?)
@ 2017-12-20 22:14         ` Dave Chinner
  -1 siblings, 0 replies; 92+ messages in thread
From: Dave Chinner @ 2017-12-20 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Andrew Morton, Matthew Wilcox, linux-nvdimm,
	linux-kernel, linux-xfs, Linux MM, linux-fsdevel,
	Christoph Hellwig

On Tue, Dec 19, 2017 at 05:11:38PM -0800, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX.

That means every single filesystem has an incorrect assumption for
DAX pages. xfs_vm_set_page_dirty() is derived directly from
__set_page_dirty_buffers(), which is the default function that
set_page_dirty() calls to do it's work. Indeed, ext4 also calls
__set_page_dirty_buffers(), so whatever problem XFS has here with
DAX and racing truncates is going to manifest in ext4 as well.

> My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }

This is likely to be a problem in lots more places if we have to
treat "has page been truncated away" race checks on dax mappings
differently to page cache mappings. This smells of a whack-a-mole
style bandaid to me....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-20 22:14         ` Dave Chinner
  0 siblings, 0 replies; 92+ messages in thread
From: Dave Chinner @ 2017-12-20 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, Jan Kara, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Tue, Dec 19, 2017 at 05:11:38PM -0800, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX.

That means every single filesystem has an incorrect assumption for
DAX pages. xfs_vm_set_page_dirty() is derived directly from
__set_page_dirty_buffers(), which is the default function that
set_page_dirty() calls to do it's work. Indeed, ext4 also calls
__set_page_dirty_buffers(), so whatever problem XFS has here with
DAX and racing truncates is going to manifest in ext4 as well.

> My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }

This is likely to be a problem in lots more places if we have to
treat "has page been truncated away" race checks on dax mappings
differently to page cache mappings. This smells of a whack-a-mole
style bandaid to me....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-20 22:14         ` Dave Chinner
  0 siblings, 0 replies; 92+ messages in thread
From: Dave Chinner @ 2017-12-20 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, Jan Kara, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Tue, Dec 19, 2017 at 05:11:38PM -0800, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX.

That means every single filesystem has an incorrect assumption for
DAX pages. xfs_vm_set_page_dirty() is derived directly from
__set_page_dirty_buffers(), which is the default function that
set_page_dirty() calls to do it's work. Indeed, ext4 also calls
__set_page_dirty_buffers(), so whatever problem XFS has here with
DAX and racing truncates is going to manifest in ext4 as well.

> My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }

This is likely to be a problem in lots more places if we have to
treat "has page been truncated away" race checks on dax mappings
differently to page cache mappings. This smells of a whack-a-mole
style bandaid to me....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-12-20 14:38         ` Jan Kara
  (?)
@ 2017-12-20 22:41           ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-20 22:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Matthew Wilcox, linux-nvdimm, linux-kernel,
	linux-xfs, Linux MM, linux-fsdevel, Christoph Hellwig

On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> +             struct {
>> >> +                     /*
>> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> +                      * a slab allocator, this points to the hosting device
>> >> +                      * page map.
>> >> +                      */
>> >> +                     struct dev_pagemap *pgmap;
>> >> +                     /*
>> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> +                      * that has hard coded page-cache assumptions in
>> >> +                      * several paths.
>> >> +                      */
>> >
>> > What assumptions?  I'd much rather fix those up than having two fields
>> > that have the same functionality.
>>
>> [ Reviving this old thread where you asked why I introduce page->inode
>> instead of reusing page->mapping ]
>>
>> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> non-NULL indicates a typical page cache page, this is a false
>> assumption for DAX. My guess at a fix for this is to add
>> pagecache_page() checks to locations like this, but I worry about how
>> to find them all. Where pagecache_page() is:
>>
>> bool pagecache_page(struct page *page)
>> {
>>         if (!page->mapping)
>>                 return false;
>>         if (!IS_DAX(page->mapping->host))
>>                 return false;
>>         return true;
>> }
>>
>> Otherwise we go off the rails:
>>
>>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>
> But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> it? ->set_page_dirty handler for DAX mapping should just properly handle
> DAX pages... (and only those)

Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
enable all the address_space_operations to start firing. However,
instead of adding DAX specific address_space_operations it appears
->mapping should never be set for DAX pages, because DAX pages are
disconnected from the page-writeback machinery. In other words never
setting ->mapping bypasses all the possible broken assumptions and
code paths that take page-cache specific actions before calling an
address_space_operation.

>
>>  [..]
>>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
>> 4.15.0-rc2+ #984
>>  [..]
>>  Call Trace:
>>   set_page_dirty_lock+0x40/0x60
>>   bio_set_pages_dirty+0x37/0x50
>>   iomap_dio_actor+0x2b7/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   iomap_apply+0xa4/0x110
>>   iomap_dio_rw+0x29e/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>>   __vfs_read+0xf9/0x170
>>   vfs_read+0xa6/0x150
>>   SyS_pread64+0x93/0xb0
>>   entry_SYSCALL_64_fastpath+0x1f/0x96
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-20 22:41           ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-20 22:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Matthew Wilcox, linux-kernel,
	linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler, linux-fsdevel,
	Andrew Morton

On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> +             struct {
>> >> +                     /*
>> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> +                      * a slab allocator, this points to the hosting device
>> >> +                      * page map.
>> >> +                      */
>> >> +                     struct dev_pagemap *pgmap;
>> >> +                     /*
>> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> +                      * that has hard coded page-cache assumptions in
>> >> +                      * several paths.
>> >> +                      */
>> >
>> > What assumptions?  I'd much rather fix those up than having two fields
>> > that have the same functionality.
>>
>> [ Reviving this old thread where you asked why I introduce page->inode
>> instead of reusing page->mapping ]
>>
>> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> non-NULL indicates a typical page cache page, this is a false
>> assumption for DAX. My guess at a fix for this is to add
>> pagecache_page() checks to locations like this, but I worry about how
>> to find them all. Where pagecache_page() is:
>>
>> bool pagecache_page(struct page *page)
>> {
>>         if (!page->mapping)
>>                 return false;
>>         if (!IS_DAX(page->mapping->host))
>>                 return false;
>>         return true;
>> }
>>
>> Otherwise we go off the rails:
>>
>>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>
> But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> it? ->set_page_dirty handler for DAX mapping should just properly handle
> DAX pages... (and only those)

Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
enable all the address_space_operations to start firing. However,
instead of adding DAX specific address_space_operations it appears
->mapping should never be set for DAX pages, because DAX pages are
disconnected from the page-writeback machinery. In other words never
setting ->mapping bypasses all the possible broken assumptions and
code paths that take page-cache specific actions before calling an
address_space_operation.

>
>>  [..]
>>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
>> 4.15.0-rc2+ #984
>>  [..]
>>  Call Trace:
>>   set_page_dirty_lock+0x40/0x60
>>   bio_set_pages_dirty+0x37/0x50
>>   iomap_dio_actor+0x2b7/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   iomap_apply+0xa4/0x110
>>   iomap_dio_rw+0x29e/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>>   __vfs_read+0xf9/0x170
>>   vfs_read+0xa6/0x150
>>   SyS_pread64+0x93/0xb0
>>   entry_SYSCALL_64_fastpath+0x1f/0x96
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-20 22:41           ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-20 22:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Matthew Wilcox, linux-kernel,
	linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler, linux-fsdevel,
	Andrew Morton

On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> +             struct {
>> >> +                     /*
>> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> +                      * a slab allocator, this points to the hosting device
>> >> +                      * page map.
>> >> +                      */
>> >> +                     struct dev_pagemap *pgmap;
>> >> +                     /*
>> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> +                      * that has hard coded page-cache assumptions in
>> >> +                      * several paths.
>> >> +                      */
>> >
>> > What assumptions?  I'd much rather fix those up than having two fields
>> > that have the same functionality.
>>
>> [ Reviving this old thread where you asked why I introduce page->inode
>> instead of reusing page->mapping ]
>>
>> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> non-NULL indicates a typical page cache page, this is a false
>> assumption for DAX. My guess at a fix for this is to add
>> pagecache_page() checks to locations like this, but I worry about how
>> to find them all. Where pagecache_page() is:
>>
>> bool pagecache_page(struct page *page)
>> {
>>         if (!page->mapping)
>>                 return false;
>>         if (!IS_DAX(page->mapping->host))
>>                 return false;
>>         return true;
>> }
>>
>> Otherwise we go off the rails:
>>
>>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>
> But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> it? ->set_page_dirty handler for DAX mapping should just properly handle
> DAX pages... (and only those)

Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
enable all the address_space_operations to start firing. However,
instead of adding DAX specific address_space_operations it appears
->mapping should never be set for DAX pages, because DAX pages are
disconnected from the page-writeback machinery. In other words never
setting ->mapping bypasses all the possible broken assumptions and
code paths that take page-cache specific actions before calling an
address_space_operation.

>
>>  [..]
>>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
>> 4.15.0-rc2+ #984
>>  [..]
>>  Call Trace:
>>   set_page_dirty_lock+0x40/0x60
>>   bio_set_pages_dirty+0x37/0x50
>>   iomap_dio_actor+0x2b7/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   iomap_apply+0xa4/0x110
>>   iomap_dio_rw+0x29e/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>>   __vfs_read+0xf9/0x170
>>   vfs_read+0xa6/0x150
>>   SyS_pread64+0x93/0xb0
>>   entry_SYSCALL_64_fastpath+0x1f/0x96
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-12-20 22:41           ` Dan Williams
@ 2017-12-21 12:14             ` Jan Kara
  -1 siblings, 0 replies; 92+ messages in thread
From: Jan Kara @ 2017-12-21 12:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, linux-nvdimm, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Wed 20-12-17 14:41:14, Dan Williams wrote:
> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> +             struct {
> >> >> +                     /*
> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> >> +                      * a slab allocator, this points to the hosting device
> >> >> +                      * page map.
> >> >> +                      */
> >> >> +                     struct dev_pagemap *pgmap;
> >> >> +                     /*
> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> >> +                      * callbacks. Note that we don't use ->mapping since
> >> >> +                      * that has hard coded page-cache assumptions in
> >> >> +                      * several paths.
> >> >> +                      */
> >> >
> >> > What assumptions?  I'd much rather fix those up than having two fields
> >> > that have the same functionality.
> >>
> >> [ Reviving this old thread where you asked why I introduce page->inode
> >> instead of reusing page->mapping ]
> >>
> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> >> non-NULL indicates a typical page cache page, this is a false
> >> assumption for DAX. My guess at a fix for this is to add
> >> pagecache_page() checks to locations like this, but I worry about how
> >> to find them all. Where pagecache_page() is:
> >>
> >> bool pagecache_page(struct page *page)
> >> {
> >>         if (!page->mapping)
> >>                 return false;
> >>         if (!IS_DAX(page->mapping->host))
> >>                 return false;
> >>         return true;
> >> }
> >>
> >> Otherwise we go off the rails:
> >>
> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
> >
> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> > it? ->set_page_dirty handler for DAX mapping should just properly handle
> > DAX pages... (and only those)
> 
> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
> enable all the address_space_operations to start firing. However,
> instead of adding DAX specific address_space_operations it appears
> ->mapping should never be set for DAX pages, because DAX pages are
> disconnected from the page-writeback machinery.

page->mapping is not only about page-writeback machinery. It is generally
about page <-> inode relation and that still exists for DAX pages. We even
reuse the mapping->page_tree to store DAX pages. Also requiring proper
address_space_operations for DAX inodes is IMO not a bad thing as such.

That being said I'm not 100% convinced we should really set page->mapping
for DAX pages. After all they are not page cache pages but rather a
physical storage for the data, don't ever get to LRU, etc. But if you need
page->inode relation somewhere, that is a good indication to me that it
might be just easier to set page->mapping and provide aops that do the
right thing (i.e. usually not much) for them.

BTW: the ->set_page_dirty() in particular actually *does* need to do
something for DAX pages - corresponding radix tree entries should be
marked dirty so that caches can get flushed when needed.

> In other words never
> setting ->mapping bypasses all the possible broken assumptions and
> code paths that take page-cache specific actions before calling an
> address_space_operation.

If there are any assumptions left after aops are set properly, then we can
reconsider this but for now setting ->mapping and proper aops looks cleaner
to me...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-21 12:14             ` Jan Kara
  0 siblings, 0 replies; 92+ messages in thread
From: Jan Kara @ 2017-12-21 12:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, linux-nvdimm, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Wed 20-12-17 14:41:14, Dan Williams wrote:
> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> +             struct {
> >> >> +                     /*
> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> >> +                      * a slab allocator, this points to the hosting device
> >> >> +                      * page map.
> >> >> +                      */
> >> >> +                     struct dev_pagemap *pgmap;
> >> >> +                     /*
> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> >> +                      * callbacks. Note that we don't use ->mapping since
> >> >> +                      * that has hard coded page-cache assumptions in
> >> >> +                      * several paths.
> >> >> +                      */
> >> >
> >> > What assumptions?  I'd much rather fix those up than having two fields
> >> > that have the same functionality.
> >>
> >> [ Reviving this old thread where you asked why I introduce page->inode
> >> instead of reusing page->mapping ]
> >>
> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> >> non-NULL indicates a typical page cache page, this is a false
> >> assumption for DAX. My guess at a fix for this is to add
> >> pagecache_page() checks to locations like this, but I worry about how
> >> to find them all. Where pagecache_page() is:
> >>
> >> bool pagecache_page(struct page *page)
> >> {
> >>         if (!page->mapping)
> >>                 return false;
> >>         if (!IS_DAX(page->mapping->host))
> >>                 return false;
> >>         return true;
> >> }
> >>
> >> Otherwise we go off the rails:
> >>
> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
> >
> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> > it? ->set_page_dirty handler for DAX mapping should just properly handle
> > DAX pages... (and only those)
> 
> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
> enable all the address_space_operations to start firing. However,
> instead of adding DAX specific address_space_operations it appears
> ->mapping should never be set for DAX pages, because DAX pages are
> disconnected from the page-writeback machinery.

page->mapping is not only about page-writeback machinery. It is generally
about page <-> inode relation and that still exists for DAX pages. We even
reuse the mapping->page_tree to store DAX pages. Also requiring proper
address_space_operations for DAX inodes is IMO not a bad thing as such.

That being said I'm not 100% convinced we should really set page->mapping
for DAX pages. After all they are not page cache pages but rather a
physical storage for the data, don't ever get to LRU, etc. But if you need
page->inode relation somewhere, that is a good indication to me that it
might be just easier to set page->mapping and provide aops that do the
right thing (i.e. usually not much) for them.

BTW: the ->set_page_dirty() in particular actually *does* need to do
something for DAX pages - corresponding radix tree entries should be
marked dirty so that caches can get flushed when needed.

> In other words never
> setting ->mapping bypasses all the possible broken assumptions and
> code paths that take page-cache specific actions before calling an
> address_space_operation.

If there are any assumptions left after aops are set properly, then we can
reconsider this but for now setting ->mapping and proper aops looks cleaner
to me...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-12-21 12:14             ` Jan Kara
@ 2017-12-21 17:31               ` Dan Williams
  -1 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-21 17:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Matthew Wilcox, linux-kernel,
	linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler, linux-fsdevel,
	Andrew Morton

On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 20-12-17 14:41:14, Dan Williams wrote:
>> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> +             struct {
>> >> >> +                     /*
>> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> >> +                      * a slab allocator, this points to the hosting device
>> >> >> +                      * page map.
>> >> >> +                      */
>> >> >> +                     struct dev_pagemap *pgmap;
>> >> >> +                     /*
>> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> >> +                      * that has hard coded page-cache assumptions in
>> >> >> +                      * several paths.
>> >> >> +                      */
>> >> >
>> >> > What assumptions?  I'd much rather fix those up than having two fields
>> >> > that have the same functionality.
>> >>
>> >> [ Reviving this old thread where you asked why I introduce page->inode
>> >> instead of reusing page->mapping ]
>> >>
>> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> >> non-NULL indicates a typical page cache page, this is a false
>> >> assumption for DAX. My guess at a fix for this is to add
>> >> pagecache_page() checks to locations like this, but I worry about how
>> >> to find them all. Where pagecache_page() is:
>> >>
>> >> bool pagecache_page(struct page *page)
>> >> {
>> >>         if (!page->mapping)
>> >>                 return false;
>> >>         if (!IS_DAX(page->mapping->host))
>> >>                 return false;
>> >>         return true;
>> >> }
>> >>
>> >> Otherwise we go off the rails:
>> >>
>> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>> >
>> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
>> > it? ->set_page_dirty handler for DAX mapping should just properly handle
>> > DAX pages... (and only those)
>>
>> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
>> enable all the address_space_operations to start firing. However,
>> instead of adding DAX specific address_space_operations it appears
>> ->mapping should never be set for DAX pages, because DAX pages are
>> disconnected from the page-writeback machinery.
>
> page->mapping is not only about page-writeback machinery. It is generally
> about page <-> inode relation and that still exists for DAX pages. We even
> reuse the mapping->page_tree to store DAX pages. Also requiring proper
> address_space_operations for DAX inodes is IMO not a bad thing as such.
>
> That being said I'm not 100% convinced we should really set page->mapping
> for DAX pages. After all they are not page cache pages but rather a
> physical storage for the data, don't ever get to LRU, etc. But if you need
> page->inode relation somewhere, that is a good indication to me that it
> might be just easier to set page->mapping and provide aops that do the
> right thing (i.e. usually not much) for them.
>
> BTW: the ->set_page_dirty() in particular actually *does* need to do
> something for DAX pages - corresponding radix tree entries should be
> marked dirty so that caches can get flushed when needed.

For this specific concern, the get_user_pages() path will have
triggered mkwrite, so the dax dirty tracking in the radix will have
already happened by the time we call ->set_page_dirty(). So, it's not
yet clear to me that we need that particular op.

>> In other words never
>> setting ->mapping bypasses all the possible broken assumptions and
>> code paths that take page-cache specific actions before calling an
>> address_space_operation.
>
> If there are any assumptions left after aops are set properly, then we can
> reconsider this but for now setting ->mapping and proper aops looks cleaner
> to me...

I'll try an address_space_operation with a nop ->set_page_dirty() and
see if anything else falls out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-21 17:31               ` Dan Williams
  0 siblings, 0 replies; 92+ messages in thread
From: Dan Williams @ 2017-12-21 17:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Matthew Wilcox, linux-kernel,
	linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler, linux-fsdevel,
	Andrew Morton

On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 20-12-17 14:41:14, Dan Williams wrote:
>> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> +             struct {
>> >> >> +                     /*
>> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> >> +                      * a slab allocator, this points to the hosting device
>> >> >> +                      * page map.
>> >> >> +                      */
>> >> >> +                     struct dev_pagemap *pgmap;
>> >> >> +                     /*
>> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> >> +                      * that has hard coded page-cache assumptions in
>> >> >> +                      * several paths.
>> >> >> +                      */
>> >> >
>> >> > What assumptions?  I'd much rather fix those up than having two fields
>> >> > that have the same functionality.
>> >>
>> >> [ Reviving this old thread where you asked why I introduce page->inode
>> >> instead of reusing page->mapping ]
>> >>
>> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> >> non-NULL indicates a typical page cache page, this is a false
>> >> assumption for DAX. My guess at a fix for this is to add
>> >> pagecache_page() checks to locations like this, but I worry about how
>> >> to find them all. Where pagecache_page() is:
>> >>
>> >> bool pagecache_page(struct page *page)
>> >> {
>> >>         if (!page->mapping)
>> >>                 return false;
>> >>         if (!IS_DAX(page->mapping->host))
>> >>                 return false;
>> >>         return true;
>> >> }
>> >>
>> >> Otherwise we go off the rails:
>> >>
>> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>> >
>> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
>> > it? ->set_page_dirty handler for DAX mapping should just properly handle
>> > DAX pages... (and only those)
>>
>> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
>> enable all the address_space_operations to start firing. However,
>> instead of adding DAX specific address_space_operations it appears
>> ->mapping should never be set for DAX pages, because DAX pages are
>> disconnected from the page-writeback machinery.
>
> page->mapping is not only about page-writeback machinery. It is generally
> about page <-> inode relation and that still exists for DAX pages. We even
> reuse the mapping->page_tree to store DAX pages. Also requiring proper
> address_space_operations for DAX inodes is IMO not a bad thing as such.
>
> That being said I'm not 100% convinced we should really set page->mapping
> for DAX pages. After all they are not page cache pages but rather a
> physical storage for the data, don't ever get to LRU, etc. But if you need
> page->inode relation somewhere, that is a good indication to me that it
> might be just easier to set page->mapping and provide aops that do the
> right thing (i.e. usually not much) for them.
>
> BTW: the ->set_page_dirty() in particular actually *does* need to do
> something for DAX pages - corresponding radix tree entries should be
> marked dirty so that caches can get flushed when needed.

For this specific concern, the get_user_pages() path will have
triggered mkwrite, so the dax dirty tracking in the radix will have
already happened by the time we call ->set_page_dirty(). So, it's not
yet clear to me that we need that particular op.

>> In other words never
>> setting ->mapping bypasses all the possible broken assumptions and
>> code paths that take page-cache specific actions before calling an
>> address_space_operation.
>
> If there are any assumptions left after aops are set properly, then we can
> reconsider this but for now setting ->mapping and proper aops looks cleaner
> to me...

I'll try an address_space_operation with a nop ->set_page_dirty() and
see if anything else falls out.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
  2017-12-21 17:31               ` Dan Williams
@ 2017-12-22  8:51                 ` Jan Kara
  -1 siblings, 0 replies; 92+ messages in thread
From: Jan Kara @ 2017-12-22  8:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, linux-nvdimm, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Thu 21-12-17 09:31:50, Dan Williams wrote:
> On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 20-12-17 14:41:14, Dan Williams wrote:
> >> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
> >> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> >> +             struct {
> >> >> >> +                     /*
> >> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> >> >> +                      * a slab allocator, this points to the hosting device
> >> >> >> +                      * page map.
> >> >> >> +                      */
> >> >> >> +                     struct dev_pagemap *pgmap;
> >> >> >> +                     /*
> >> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> >> >> +                      * callbacks. Note that we don't use ->mapping since
> >> >> >> +                      * that has hard coded page-cache assumptions in
> >> >> >> +                      * several paths.
> >> >> >> +                      */
> >> >> >
> >> >> > What assumptions?  I'd much rather fix those up than having two fields
> >> >> > that have the same functionality.
> >> >>
> >> >> [ Reviving this old thread where you asked why I introduce page->inode
> >> >> instead of reusing page->mapping ]
> >> >>
> >> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> >> >> non-NULL indicates a typical page cache page, this is a false
> >> >> assumption for DAX. My guess at a fix for this is to add
> >> >> pagecache_page() checks to locations like this, but I worry about how
> >> >> to find them all. Where pagecache_page() is:
> >> >>
> >> >> bool pagecache_page(struct page *page)
> >> >> {
> >> >>         if (!page->mapping)
> >> >>                 return false;
> >> >>         if (!IS_DAX(page->mapping->host))
> >> >>                 return false;
> >> >>         return true;
> >> >> }
> >> >>
> >> >> Otherwise we go off the rails:
> >> >>
> >> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> >> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
> >> >
> >> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> >> > it? ->set_page_dirty handler for DAX mapping should just properly handle
> >> > DAX pages... (and only those)
> >>
> >> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
> >> enable all the address_space_operations to start firing. However,
> >> instead of adding DAX specific address_space_operations it appears
> >> ->mapping should never be set for DAX pages, because DAX pages are
> >> disconnected from the page-writeback machinery.
> >
> > page->mapping is not only about page-writeback machinery. It is generally
> > about page <-> inode relation and that still exists for DAX pages. We even
> > reuse the mapping->page_tree to store DAX pages. Also requiring proper
> > address_space_operations for DAX inodes is IMO not a bad thing as such.
> >
> > That being said I'm not 100% convinced we should really set page->mapping
> > for DAX pages. After all they are not page cache pages but rather a
> > physical storage for the data, don't ever get to LRU, etc. But if you need
> > page->inode relation somewhere, that is a good indication to me that it
> > might be just easier to set page->mapping and provide aops that do the
> > right thing (i.e. usually not much) for them.
> >
> > BTW: the ->set_page_dirty() in particular actually *does* need to do
> > something for DAX pages - corresponding radix tree entries should be
> > marked dirty so that caches can get flushed when needed.
> 
> For this specific concern, the get_user_pages() path will have
> triggered mkwrite, so the dax dirty tracking in the radix will have
> already happened by the time we call ->set_page_dirty(). So, it's not
> yet clear to me that we need that particular op.

Right, but it would still be nice to at least verify in ->set_page_dirty()
that the DAX page is marked as dirty in the radix tree (that may be
slightly expensive to keep forever but probably still worth it at least for
some time).

BTW: You just made me realize get_user_pages() is likely the path I have
been looking for for about an year which can indeed dirty a mapped page
from a different path than page fault (I'm now speaking about non-DAX case
but it translates to DAX as well). I have been getting sporadical reports
of filesystems (ext4 and xfs) crashing because a page was dirty but
filesystem was not prepared for that (which happens in ->page_mkwrite or
->write_begin). What I think could happen is that someone used mmaped file
as a buffer for DIO or something like that, that triggered GUP, which
triggered fault and dirtied a page. Then kswapd came, wrote the page,
writeprotected it in page tables, and reclaimed buffers from the page. It
could not free the page itself because DIO still held reference to it.
And then DIO eventually called bio_set_pages_dirty() which marked pages
dirty and filesystem eventually barfed on these... I'll talk to MM people
about this and how could it be possibly fixed - but probably after
Christmas so I'm writing it here so that I don't forget :).

> >> In other words never
> >> setting ->mapping bypasses all the possible broken assumptions and
> >> code paths that take page-cache specific actions before calling an
> >> address_space_operation.
> >
> > If there are any assumptions left after aops are set properly, then we can
> > reconsider this but for now setting ->mapping and proper aops looks cleaner
> > to me...
> 
> I'll try an address_space_operation with a nop ->set_page_dirty() and
> see if anything else falls out.

There are also other aops which would be good to NOPify. readpage and
writepage should probably directly BUG, write_begin and write_end as well
as we expect iomap to be used with DAX. Ditto for invalidatepage,
releasepage, freepage, launder_page, is_dirty_writeback, isolate_page as we
don't expect page reclaim for our pages. Other functions in there would
need some thought...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate
@ 2017-12-22  8:51                 ` Jan Kara
  0 siblings, 0 replies; 92+ messages in thread
From: Jan Kara @ 2017-12-22  8:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, linux-nvdimm, Matthew Wilcox,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Ross Zwisler,
	linux-fsdevel, Andrew Morton

On Thu 21-12-17 09:31:50, Dan Williams wrote:
> On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 20-12-17 14:41:14, Dan Williams wrote:
> >> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
> >> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> >> +             struct {
> >> >> >> +                     /*
> >> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> >> >> +                      * a slab allocator, this points to the hosting device
> >> >> >> +                      * page map.
> >> >> >> +                      */
> >> >> >> +                     struct dev_pagemap *pgmap;
> >> >> >> +                     /*
> >> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> >> >> +                      * callbacks. Note that we don't use ->mapping since
> >> >> >> +                      * that has hard coded page-cache assumptions in
> >> >> >> +                      * several paths.
> >> >> >> +                      */
> >> >> >
> >> >> > What assumptions?  I'd much rather fix those up than having two fields
> >> >> > that have the same functionality.
> >> >>
> >> >> [ Reviving this old thread where you asked why I introduce page->inode
> >> >> instead of reusing page->mapping ]
> >> >>
> >> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> >> >> non-NULL indicates a typical page cache page, this is a false
> >> >> assumption for DAX. My guess at a fix for this is to add
> >> >> pagecache_page() checks to locations like this, but I worry about how
> >> >> to find them all. Where pagecache_page() is:
> >> >>
> >> >> bool pagecache_page(struct page *page)
> >> >> {
> >> >>         if (!page->mapping)
> >> >>                 return false;
> >> >>         if (!IS_DAX(page->mapping->host))
> >> >>                 return false;
> >> >>         return true;
> >> >> }
> >> >>
> >> >> Otherwise we go off the rails:
> >> >>
> >> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> >> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
> >> >
> >> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> >> > it? ->set_page_dirty handler for DAX mapping should just properly handle
> >> > DAX pages... (and only those)
> >>
> >> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
> >> enable all the address_space_operations to start firing. However,
> >> instead of adding DAX specific address_space_operations it appears
> >> ->mapping should never be set for DAX pages, because DAX pages are
> >> disconnected from the page-writeback machinery.
> >
> > page->mapping is not only about page-writeback machinery. It is generally
> > about page <-> inode relation and that still exists for DAX pages. We even
> > reuse the mapping->page_tree to store DAX pages. Also requiring proper
> > address_space_operations for DAX inodes is IMO not a bad thing as such.
> >
> > That being said I'm not 100% convinced we should really set page->mapping
> > for DAX pages. After all they are not page cache pages but rather a
> > physical storage for the data, don't ever get to LRU, etc. But if you need
> > page->inode relation somewhere, that is a good indication to me that it
> > might be just easier to set page->mapping and provide aops that do the
> > right thing (i.e. usually not much) for them.
> >
> > BTW: the ->set_page_dirty() in particular actually *does* need to do
> > something for DAX pages - corresponding radix tree entries should be
> > marked dirty so that caches can get flushed when needed.
> 
> For this specific concern, the get_user_pages() path will have
> triggered mkwrite, so the dax dirty tracking in the radix will have
> already happened by the time we call ->set_page_dirty(). So, it's not
> yet clear to me that we need that particular op.

Right, but it would still be nice to at least verify in ->set_page_dirty()
that the DAX page is marked as dirty in the radix tree (that may be
slightly expensive to keep forever but probably still worth it at least for
some time).

BTW: You just made me realize get_user_pages() is likely the path I have
been looking for for about an year which can indeed dirty a mapped page
from a different path than page fault (I'm now speaking about non-DAX case
but it translates to DAX as well). I have been getting sporadical reports
of filesystems (ext4 and xfs) crashing because a page was dirty but
filesystem was not prepared for that (which happens in ->page_mkwrite or
->write_begin). What I think could happen is that someone used mmaped file
as a buffer for DIO or something like that, that triggered GUP, which
triggered fault and dirtied a page. Then kswapd came, wrote the page,
writeprotected it in page tables, and reclaimed buffers from the page. It
could not free the page itself because DIO still held reference to it.
And then DIO eventually called bio_set_pages_dirty() which marked pages
dirty and filesystem eventually barfed on these... I'll talk to MM people
about this and how could it be possibly fixed - but probably after
Christmas so I'm writing it here so that I don't forget :).

> >> In other words never
> >> setting ->mapping bypasses all the possible broken assumptions and
> >> code paths that take page-cache specific actions before calling an
> >> address_space_operation.
> >
> > If there are any assumptions left after aops are set properly, then we can
> > reconsider this but for now setting ->mapping and proper aops looks cleaner
> > to me...
> 
> I'll try an address_space_operation with a nop ->set_page_dirty() and
> see if anything else falls out.

There are also other aops which would be good to NOPify. readpage and
writepage should probably directly BUG, write_begin and write_end as well
as we expect iomap to be used with DAX. Ditto for invalidatepage,
releasepage, freepage, launder_page, is_dirty_writeback, isolate_page as we
don't expect page reclaim for our pages. Other functions in there would
need some thought...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2017-12-22 10:41 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31 23:21 [PATCH 00/15] dax: prep work for fixing dax-dma vs truncate collisions Dan Williams
2017-10-31 23:21 ` Dan Williams
2017-10-31 23:21 ` Dan Williams
2017-10-31 23:21 ` [PATCH 01/15] dax: quiet bdev_dax_supported() Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-11-02 20:11   ` Christoph Hellwig
2017-11-02 20:11     ` Christoph Hellwig
2017-10-31 23:21 ` [PATCH 02/15] mm, dax: introduce pfn_t_special() Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-11-03  2:32   ` Michael Ellerman
2017-11-03  2:32     ` Michael Ellerman
2017-11-03  2:32     ` Michael Ellerman
2017-10-31 23:21 ` [PATCH 03/15] dax: require 'struct page' by default for filesystem dax Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-10-31 23:21 ` [PATCH 04/15] brd: remove dax support Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-10-31 23:21   ` Dan Williams
2017-11-02 20:12   ` Christoph Hellwig
2017-11-02 20:12     ` Christoph Hellwig
2017-11-04 16:31   ` Jens Axboe
2017-11-04 16:31     ` Jens Axboe
2017-11-04 16:31     ` Jens Axboe
2017-10-31 23:22 ` [PATCH 05/15] dax: stop using VM_MIXEDMAP for dax Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22 ` [PATCH 06/15] dax: stop using VM_HUGEPAGE " Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22 ` [PATCH 07/15] dax: stop requiring a live device for dax_flush() Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-11-02 20:12   ` Christoph Hellwig
2017-11-02 20:12     ` Christoph Hellwig
2017-10-31 23:22 ` [PATCH 08/15] dax: store pfns in the radix Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22 ` [PATCH 09/15] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22 ` [PATCH 10/15] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-11-02 20:13   ` Christoph Hellwig
2017-11-02 20:13     ` Christoph Hellwig
2017-11-02 21:06     ` Dan Williams
2017-11-02 21:06       ` Dan Williams
2017-11-02 21:06       ` Dan Williams
2017-10-31 23:22 ` [PATCH 11/15] [media] v4l2: disable filesystem-dax mapping support Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22 ` [PATCH 12/15] mm, dax: enable filesystems to trigger page-idle callbacks Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-11-10  9:04   ` Christoph Hellwig
2017-11-10  9:04     ` Christoph Hellwig
2017-10-31 23:22 ` [PATCH 13/15] mm, devmap: introduce CONFIG_DEVMAP_MANAGED_PAGES Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-11-10  9:06   ` Christoph Hellwig
2017-11-10  9:06     ` Christoph Hellwig
2017-10-31 23:22 ` [PATCH 14/15] dax: associate mappings with inodes, and warn if dma collides with truncate Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-11-10  9:08   ` Christoph Hellwig
2017-11-10  9:08     ` Christoph Hellwig
2017-11-10  9:08     ` Christoph Hellwig
2017-12-20  1:11     ` Dan Williams
2017-12-20  1:11       ` Dan Williams
2017-12-20 14:38       ` Jan Kara
2017-12-20 14:38         ` Jan Kara
2017-12-20 22:41         ` Dan Williams
2017-12-20 22:41           ` Dan Williams
2017-12-20 22:41           ` Dan Williams
2017-12-21 12:14           ` Jan Kara
2017-12-21 12:14             ` Jan Kara
2017-12-21 17:31             ` Dan Williams
2017-12-21 17:31               ` Dan Williams
2017-12-22  8:51               ` Jan Kara
2017-12-22  8:51                 ` Jan Kara
2017-12-20 22:14       ` Dave Chinner
2017-12-20 22:14         ` Dave Chinner
2017-12-20 22:14         ` Dave Chinner
2017-10-31 23:22 ` [PATCH 15/15] wait_bit: introduce {wait_on,wake_up}_devmap_idle Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-10-31 23:22   ` Dan Williams
2017-11-10  9:09   ` Christoph Hellwig
2017-11-10  9:09     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.