All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  2:38 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:38 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Sean Hefty, hch, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, Dave Chinner,
	linux-fsdevel, Alexander Viro, Jeff Layton, Gerald Schaefer,
	linux-nvdimm, linux-kernel, linux-xfs, Martin Schwidefsky,
	Darrick J. Wong, Kirill A. Shutemov

Changes since v2 [1]:
* Add 'dax: handle truncate of dma-busy pages' which builds on the
  removal of page-less dax to fix a latent bug handling dma vs truncate.
* Disable get_user_pages_fast() for dax
* Disable RDMA memory registrations against filesystem-DAX mappings for
  non-ODP (On Demand Paging / Shared Virtual Memory) hardware.
* Fix a compile error when building with HMM enabled

---
tl;dr: A brute force approach to ensure that truncate waits for any
in-flight DMA before freeing filesystem-DAX blocks to the filesystem's
block allocator.

While reviewing the MAP_DIRECT proposal Christoph noted:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.
    
    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

I was able to trigger the failure with "[PATCH v3 08/13]
tools/testing/nvdimm: add 'bio_delay' mechanism" to keep block i/o pages
busy so a punch-hole operation can truncate the blocks before the DMA
finishes.

The solution presented is not pretty. It creates a stream of leases, one
for each get_user_pages() invocation, and polls page reference counts
until DMA stops. We're missing a reliable way to not only trap the
DMA-idle event, but also block new references being taken on pages while
truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
dma-busy pages" presents other options considered, and notes that this
solution can only be viewed as a stop-gap.

Given the need to poll page-reference counts this approach builds on the
removal of 'page-less DAX' support. From the last submission Andrew
asked for clarification on the move to now require pages for DAX.
Quoting "[PATCH v3 02/13] dax: require 'struct page' for filesystem
dax":

    Note that when the initial dax support was being merged a few years
    back there was concern that struct page was unsuitable for use with
    next generation persistent memory devices. The theoretical concern
    was that struct page access, being such a hotly used data structure
    in the kernel, would lead to media wear out. While that was a
    reasonable conservative starting position it has not held true in
    practice. We have long since committed to using
    devm_memremap_pages() to support higher order kernel functionality
    that needs get_user_pages() and pfn_to_page().
 

---

Dan Williams (13):
      dax: quiet bdev_dax_supported()
      dax: require 'struct page' for filesystem dax
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      dax: warn if dma collides with truncate
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      mm: disable get_user_pages_fast() for dax
      fs: use smp_load_acquire in break_{layout,lease}
      dax: handle truncate of dma-busy pages
      xfs: wire up FL_ALLOCATED support


 arch/powerpc/sysdev/axonram.c         |    1 
 drivers/dax/device.c                  |    1 
 drivers/dax/super.c                   |   18 +-
 drivers/infiniband/core/umem.c        |   49 ++++-
 drivers/s390/block/dcssblk.c          |    1 
 fs/Kconfig                            |    1 
 fs/dax.c                              |  296 ++++++++++++++++++++++++++++-----
 fs/ext2/file.c                        |    1 
 fs/ext4/file.c                        |    1 
 fs/locks.c                            |   17 ++
 fs/xfs/xfs_aops.c                     |   24 +++
 fs/xfs/xfs_file.c                     |   66 +++++++
 fs/xfs/xfs_inode.h                    |    1 
 fs/xfs/xfs_ioctl.c                    |    7 -
 include/linux/dax.h                   |   23 +++
 include/linux/fs.h                    |   32 +++-
 include/linux/vma.h                   |   33 ++++
 mm/gup.c                              |   75 ++++----
 mm/huge_memory.c                      |    8 -
 mm/ksm.c                              |    3 
 mm/madvise.c                          |    2 
 mm/memory.c                           |   20 ++
 mm/migrate.c                          |    3 
 mm/mlock.c                            |    5 -
 mm/mmap.c                             |    8 -
 tools/testing/nvdimm/Kbuild           |    1 
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++
 tools/testing/nvdimm/test/nfit_test.h |    1 
 29 files changed, 651 insertions(+), 143 deletions(-)
 create mode 100644 include/linux/vma.h
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  2:38 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:38 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Dave Chinner, J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Michael Ellerman, Jeff Moyer,
	hch-jcswGhMUV9g, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Alex

Changes since v2 [1]:
* Add 'dax: handle truncate of dma-busy pages' which builds on the
  removal of page-less dax to fix a latent bug handling dma vs truncate.
* Disable get_user_pages_fast() for dax
* Disable RDMA memory registrations against filesystem-DAX mappings for
  non-ODP (On Demand Paging / Shared Virtual Memory) hardware.
* Fix a compile error when building with HMM enabled

---
tl;dr: A brute force approach to ensure that truncate waits for any
in-flight DMA before freeing filesystem-DAX blocks to the filesystem's
block allocator.

While reviewing the MAP_DIRECT proposal Christoph noted:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.
    
    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

I was able to trigger the failure with "[PATCH v3 08/13]
tools/testing/nvdimm: add 'bio_delay' mechanism" to keep block i/o pages
busy so a punch-hole operation can truncate the blocks before the DMA
finishes.

The solution presented is not pretty. It creates a stream of leases, one
for each get_user_pages() invocation, and polls page reference counts
until DMA stops. We're missing a reliable way to not only trap the
DMA-idle event, but also block new references being taken on pages while
truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
dma-busy pages" presents other options considered, and notes that this
solution can only be viewed as a stop-gap.

Given the need to poll page-reference counts this approach builds on the
removal of 'page-less DAX' support. From the last submission Andrew
asked for clarification on the move to now require pages for DAX.
Quoting "[PATCH v3 02/13] dax: require 'struct page' for filesystem
dax":

    Note that when the initial dax support was being merged a few years
    back there was concern that struct page was unsuitable for use with
    next generation persistent memory devices. The theoretical concern
    was that struct page access, being such a hotly used data structure
    in the kernel, would lead to media wear out. While that was a
    reasonable conservative starting position it has not held true in
    practice. We have long since committed to using
    devm_memremap_pages() to support higher order kernel functionality
    that needs get_user_pages() and pfn_to_page().
 

---

Dan Williams (13):
      dax: quiet bdev_dax_supported()
      dax: require 'struct page' for filesystem dax
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      dax: warn if dma collides with truncate
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      mm: disable get_user_pages_fast() for dax
      fs: use smp_load_acquire in break_{layout,lease}
      dax: handle truncate of dma-busy pages
      xfs: wire up FL_ALLOCATED support


 arch/powerpc/sysdev/axonram.c         |    1 
 drivers/dax/device.c                  |    1 
 drivers/dax/super.c                   |   18 +-
 drivers/infiniband/core/umem.c        |   49 ++++-
 drivers/s390/block/dcssblk.c          |    1 
 fs/Kconfig                            |    1 
 fs/dax.c                              |  296 ++++++++++++++++++++++++++++-----
 fs/ext2/file.c                        |    1 
 fs/ext4/file.c                        |    1 
 fs/locks.c                            |   17 ++
 fs/xfs/xfs_aops.c                     |   24 +++
 fs/xfs/xfs_file.c                     |   66 +++++++
 fs/xfs/xfs_inode.h                    |    1 
 fs/xfs/xfs_ioctl.c                    |    7 -
 include/linux/dax.h                   |   23 +++
 include/linux/fs.h                    |   32 +++-
 include/linux/vma.h                   |   33 ++++
 mm/gup.c                              |   75 ++++----
 mm/huge_memory.c                      |    8 -
 mm/ksm.c                              |    3 
 mm/madvise.c                          |    2 
 mm/memory.c                           |   20 ++
 mm/migrate.c                          |    3 
 mm/mlock.c                            |    5 -
 mm/mmap.c                             |    8 -
 tools/testing/nvdimm/Kbuild           |    1 
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++
 tools/testing/nvdimm/test/nfit_test.h |    1 
 29 files changed, 651 insertions(+), 143 deletions(-)
 create mode 100644 include/linux/vma.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  2:38 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:38 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Dave Chinner, J. Bruce Fields, linux-mm, Paul Mackerras,
	Sean Hefty, Jeff Layton, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe, Doug Ledford,
	Ross Zwisler, Hal Rosenstock, Heiko Carstens, linux-nvdimm,
	Alexander Viro, Gerald Schaefer, Darrick J. Wong, linux-kernel,
	linux-xfs, Martin Schwidefsky, linux-fsdevel, Kirill A. Shutemov

Changes since v2 [1]:
* Add 'dax: handle truncate of dma-busy pages' which builds on the
  removal of page-less dax to fix a latent bug handling dma vs truncate.
* Disable get_user_pages_fast() for dax
* Disable RDMA memory registrations against filesystem-DAX mappings for
  non-ODP (On Demand Paging / Shared Virtual Memory) hardware.
* Fix a compile error when building with HMM enabled

---
tl;dr: A brute force approach to ensure that truncate waits for any
in-flight DMA before freeing filesystem-DAX blocks to the filesystem's
block allocator.

While reviewing the MAP_DIRECT proposal Christoph noted:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.
    
    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

I was able to trigger the failure with "[PATCH v3 08/13]
tools/testing/nvdimm: add 'bio_delay' mechanism" to keep block i/o pages
busy so a punch-hole operation can truncate the blocks before the DMA
finishes.

The solution presented is not pretty. It creates a stream of leases, one
for each get_user_pages() invocation, and polls page reference counts
until DMA stops. We're missing a reliable way to not only trap the
DMA-idle event, but also block new references being taken on pages while
truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
dma-busy pages" presents other options considered, and notes that this
solution can only be viewed as a stop-gap.

Given the need to poll page-reference counts this approach builds on the
removal of 'page-less DAX' support. From the last submission Andrew
asked for clarification on the move to now require pages for DAX.
Quoting "[PATCH v3 02/13] dax: require 'struct page' for filesystem
dax":

    Note that when the initial dax support was being merged a few years
    back there was concern that struct page was unsuitable for use with
    next generation persistent memory devices. The theoretical concern
    was that struct page access, being such a hotly used data structure
    in the kernel, would lead to media wear out. While that was a
    reasonable conservative starting position it has not held true in
    practice. We have long since committed to using
    devm_memremap_pages() to support higher order kernel functionality
    that needs get_user_pages() and pfn_to_page().
 

---

Dan Williams (13):
      dax: quiet bdev_dax_supported()
      dax: require 'struct page' for filesystem dax
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      dax: warn if dma collides with truncate
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      mm: disable get_user_pages_fast() for dax
      fs: use smp_load_acquire in break_{layout,lease}
      dax: handle truncate of dma-busy pages
      xfs: wire up FL_ALLOCATED support


 arch/powerpc/sysdev/axonram.c         |    1 
 drivers/dax/device.c                  |    1 
 drivers/dax/super.c                   |   18 +-
 drivers/infiniband/core/umem.c        |   49 ++++-
 drivers/s390/block/dcssblk.c          |    1 
 fs/Kconfig                            |    1 
 fs/dax.c                              |  296 ++++++++++++++++++++++++++++-----
 fs/ext2/file.c                        |    1 
 fs/ext4/file.c                        |    1 
 fs/locks.c                            |   17 ++
 fs/xfs/xfs_aops.c                     |   24 +++
 fs/xfs/xfs_file.c                     |   66 +++++++
 fs/xfs/xfs_inode.h                    |    1 
 fs/xfs/xfs_ioctl.c                    |    7 -
 include/linux/dax.h                   |   23 +++
 include/linux/fs.h                    |   32 +++-
 include/linux/vma.h                   |   33 ++++
 mm/gup.c                              |   75 ++++----
 mm/huge_memory.c                      |    8 -
 mm/ksm.c                              |    3 
 mm/madvise.c                          |    2 
 mm/memory.c                           |   20 ++
 mm/migrate.c                          |    3 
 mm/mlock.c                            |    5 -
 mm/mmap.c                             |    8 -
 tools/testing/nvdimm/Kbuild           |    1 
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++
 tools/testing/nvdimm/test/nfit_test.h |    1 
 29 files changed, 651 insertions(+), 143 deletions(-)
 create mode 100644 include/linux/vma.h

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  2:38 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:38 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Dave Chinner, J. Bruce Fields, linux-mm, Paul Mackerras,
	Sean Hefty, Jeff Layton, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe, Doug Ledford,
	Ross Zwisler, Hal Rosenstock, Heiko Carstens, linux-nvdimm,
	Alexander Viro, Gerald Schaefer, Darrick J. Wong, linux-kernel,
	linux-xfs, Martin Schwidefsky, linux-fsdevel, Kirill A. Shutemov

Changes since v2 [1]:
* Add 'dax: handle truncate of dma-busy pages' which builds on the
  removal of page-less dax to fix a latent bug handling dma vs truncate.
* Disable get_user_pages_fast() for dax
* Disable RDMA memory registrations against filesystem-DAX mappings for
  non-ODP (On Demand Paging / Shared Virtual Memory) hardware.
* Fix a compile error when building with HMM enabled

---
tl;dr: A brute force approach to ensure that truncate waits for any
in-flight DMA before freeing filesystem-DAX blocks to the filesystem's
block allocator.

While reviewing the MAP_DIRECT proposal Christoph noted:

    get_user_pages on DAX doesn't give the same guarantees as on
    pagecache or anonymous memory, and that is the problem we need to
    fix. In fact I'm pretty sure if we try hard enough (and we might
    have to try very hard) we can see the same problem with plain direct
    I/O and without any RDMA involved, e.g. do a larger direct I/O write
    to memory that is mmap()ed from a DAX file, then truncate the DAX
    file and reallocate the blocks, and we might corrupt that new file.
    We'll probably need a special setup where there is little other
    chance but to reallocate those used blocks.
    
    So what we need to do first is to fix get_user_pages vs unmapping
    DAX mmap()ed blocks, be that from a hole punch, truncate, COW
    operation, etc.

I was able to trigger the failure with "[PATCH v3 08/13]
tools/testing/nvdimm: add 'bio_delay' mechanism" to keep block i/o pages
busy so a punch-hole operation can truncate the blocks before the DMA
finishes.

The solution presented is not pretty. It creates a stream of leases, one
for each get_user_pages() invocation, and polls page reference counts
until DMA stops. We're missing a reliable way to not only trap the
DMA-idle event, but also block new references being taken on pages while
truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
dma-busy pages" presents other options considered, and notes that this
solution can only be viewed as a stop-gap.

Given the need to poll page-reference counts this approach builds on the
removal of 'page-less DAX' support. From the last submission Andrew
asked for clarification on the move to now require pages for DAX.
Quoting "[PATCH v3 02/13] dax: require 'struct page' for filesystem
dax":

    Note that when the initial dax support was being merged a few years
    back there was concern that struct page was unsuitable for use with
    next generation persistent memory devices. The theoretical concern
    was that struct page access, being such a hotly used data structure
    in the kernel, would lead to media wear out. While that was a
    reasonable conservative starting position it has not held true in
    practice. We have long since committed to using
    devm_memremap_pages() to support higher order kernel functionality
    that needs get_user_pages() and pfn_to_page().
 

---

Dan Williams (13):
      dax: quiet bdev_dax_supported()
      dax: require 'struct page' for filesystem dax
      dax: stop using VM_MIXEDMAP for dax
      dax: stop using VM_HUGEPAGE for dax
      dax: stop requiring a live device for dax_flush()
      dax: store pfns in the radix
      dax: warn if dma collides with truncate
      tools/testing/nvdimm: add 'bio_delay' mechanism
      IB/core: disable memory registration of fileystem-dax vmas
      mm: disable get_user_pages_fast() for dax
      fs: use smp_load_acquire in break_{layout,lease}
      dax: handle truncate of dma-busy pages
      xfs: wire up FL_ALLOCATED support


 arch/powerpc/sysdev/axonram.c         |    1 
 drivers/dax/device.c                  |    1 
 drivers/dax/super.c                   |   18 +-
 drivers/infiniband/core/umem.c        |   49 ++++-
 drivers/s390/block/dcssblk.c          |    1 
 fs/Kconfig                            |    1 
 fs/dax.c                              |  296 ++++++++++++++++++++++++++++-----
 fs/ext2/file.c                        |    1 
 fs/ext4/file.c                        |    1 
 fs/locks.c                            |   17 ++
 fs/xfs/xfs_aops.c                     |   24 +++
 fs/xfs/xfs_file.c                     |   66 +++++++
 fs/xfs/xfs_inode.h                    |    1 
 fs/xfs/xfs_ioctl.c                    |    7 -
 include/linux/dax.h                   |   23 +++
 include/linux/fs.h                    |   32 +++-
 include/linux/vma.h                   |   33 ++++
 mm/gup.c                              |   75 ++++----
 mm/huge_memory.c                      |    8 -
 mm/ksm.c                              |    3 
 mm/madvise.c                          |    2 
 mm/memory.c                           |   20 ++
 mm/migrate.c                          |    3 
 mm/mlock.c                            |    5 -
 mm/mmap.c                             |    8 -
 tools/testing/nvdimm/Kbuild           |    1 
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++
 tools/testing/nvdimm/test/nfit_test.h |    1 
 29 files changed, 651 insertions(+), 143 deletions(-)
 create mode 100644 include/linux/vma.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [PATCH v3 01/13] dax: quiet bdev_dax_supported()
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

Before we add another failure reason, quiet the existing log messages.
Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 557b93703532..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 01/13] dax: quiet bdev_dax_supported()
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, hch

Before we add another failure reason, quiet the existing log messages.
Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 557b93703532..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 01/13] dax: quiet bdev_dax_supported()
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, hch

Before we add another failure reason, quiet the existing log messages.
Leave it to the caller to decide if bdev_dax_supported() failures are
errors worth emitting to the log.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 557b93703532..b0cc8117eebe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -92,21 +92,21 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	long len;
 
 	if (blocksize != PAGE_SIZE) {
-		pr_err("VFS (%s): error: unsupported blocksize for dax\n",
+		pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
 				sb->s_id);
 		return -EINVAL;
 	}
 
 	err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
 	if (err) {
-		pr_err("VFS (%s): error: unaligned partition for dax\n",
+		pr_debug("VFS (%s): error: unaligned partition for dax\n",
 				sb->s_id);
 		return err;
 	}
 
 	dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
 	if (!dax_dev) {
-		pr_err("VFS (%s): error: device does not support dax\n",
+		pr_debug("VFS (%s): error: device does not support dax\n",
 				sb->s_id);
 		return -EOPNOTSUPP;
 	}
@@ -118,7 +118,7 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 	put_dax(dax_dev);
 
 	if (len < 1) {
-		pr_err("VFS (%s): error: dax access failed (%ld)",
+		pr_debug("VFS (%s): error: dax access failed (%ld)\n",
 				sb->s_id, len);
 		return len < 0 ? len : -EIO;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, linux-mm, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel, hch,
	Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages. We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. This impacts
the dax drivers that do not use dev_memremap_pages(): brd, dcssblk, and
axonram.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    1 +
 drivers/dax/super.c           |    7 +++++++
 drivers/s390/block/dcssblk.c  |    1 +
 3 files changed, 9 insertions(+)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index c60e84e4558d..9da64d95e6f1 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static const struct dax_operations axon_ram_dax_ops = {
 	.direct_access = axon_ram_dax_direct_access,
+
 	.copy_from_iter = axon_ram_copy_from_iter,
 };
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..26c324a5aef4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,12 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if (!pfn_t_devmap(pfn)) {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 7abb240847c0..e7e5db07e339 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
 
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
+
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 };
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Ross Zwisler, hch, Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages. We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. This impacts
the dax drivers that do not use dev_memremap_pages(): brd, dcssblk, and
axonram.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    1 +
 drivers/dax/super.c           |    7 +++++++
 drivers/s390/block/dcssblk.c  |    1 +
 3 files changed, 9 insertions(+)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index c60e84e4558d..9da64d95e6f1 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static const struct dax_operations axon_ram_dax_ops = {
 	.direct_access = axon_ram_dax_direct_access,
+
 	.copy_from_iter = axon_ram_copy_from_iter,
 };
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..26c324a5aef4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,12 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if (!pfn_t_devmap(pfn)) {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 7abb240847c0..e7e5db07e339 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
 
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
+
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 };
 

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Ross Zwisler, hch, Gerald Schaefer

If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages. We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. This impacts
the dax drivers that do not use dev_memremap_pages(): brd, dcssblk, and
axonram.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    1 +
 drivers/dax/super.c           |    7 +++++++
 drivers/s390/block/dcssblk.c  |    1 +
 3 files changed, 9 insertions(+)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index c60e84e4558d..9da64d95e6f1 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
 
 static const struct dax_operations axon_ram_dax_ops = {
 	.direct_access = axon_ram_dax_direct_access,
+
 	.copy_from_iter = axon_ram_copy_from_iter,
 };
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index b0cc8117eebe..26c324a5aef4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -15,6 +15,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/genhd.h>
+#include <linux/pfn_t.h>
 #include <linux/cdev.h>
 #include <linux/hash.h>
 #include <linux/slab.h>
@@ -123,6 +124,12 @@ int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		return len < 0 ? len : -EIO;
 	}
 
+	if (!pfn_t_devmap(pfn)) {
+		pr_debug("VFS (%s): error: dax support not enabled\n",
+				sb->s_id);
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 7abb240847c0..e7e5db07e339 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
 
 static const struct dax_operations dcssblk_dax_ops = {
 	.direct_access = dcssblk_dax_direct_access,
+
 	.copy_from_iter = dcssblk_dax_copy_from_iter,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, hch, Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we always have pages for DAX we can stop setting VM_MIXEDMAP.
This also means we no longer need to worry about safely manipulating
vm_flags in a future where we support dynamically changing the dax mode
of a file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/vma.h  |   33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 12 files changed, 71 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..c419c6fdb769 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1134,7 +1134,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..135ad5262cd1
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates checks for VM_SPECIAL,
+ * hugetlb and dax vmas, but note that there are 'special' vmas and
+ * special circumstances beyond these types. In other words this helper
+ * is not exhaustive.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..c69d30e27fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..72f196a36503 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2361,6 +2361,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25bade36e9ca..50513a7a11f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..cab46226eed1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,12 +1900,24 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..13f8748e7cba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -45,6 +45,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2927,7 +2928,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4e20915ddfef 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -22,6 +22,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -519,8 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..c28996f74320 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1722,11 +1723,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, hch,
	Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we always have pages for DAX we can stop setting VM_MIXEDMAP.
This also means we no longer need to worry about safely manipulating
vm_flags in a future where we support dynamically changing the dax mode
of a file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/vma.h  |   33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 12 files changed, 71 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..c419c6fdb769 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1134,7 +1134,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..135ad5262cd1
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates checks for VM_SPECIAL,
+ * hugetlb and dax vmas, but note that there are 'special' vmas and
+ * special circumstances beyond these types. In other words this helper
+ * is not exhaustive.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..c69d30e27fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..72f196a36503 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2361,6 +2361,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25bade36e9ca..50513a7a11f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..cab46226eed1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,12 +1900,24 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..13f8748e7cba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -45,6 +45,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2927,7 +2928,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4e20915ddfef 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -22,6 +22,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -519,8 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..c28996f74320 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1722,11 +1723,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, hch,
	Kirill A. Shutemov

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we always have pages for DAX we can stop setting VM_MIXEDMAP.
This also means we no longer need to worry about safely manipulating
vm_flags in a future where we support dynamically changing the dax mode
of a file.

Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    2 +-
 fs/ext2/file.c       |    1 -
 fs/ext4/file.c       |    2 +-
 fs/xfs/xfs_file.c    |    2 +-
 include/linux/vma.h  |   33 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c     |    8 ++++----
 mm/ksm.c             |    3 +++
 mm/madvise.c         |    2 +-
 mm/memory.c          |   20 ++++++++++++++++++--
 mm/migrate.c         |    3 ++-
 mm/mlock.c           |    5 +++--
 mm/mmap.c            |    8 ++++----
 12 files changed, 71 insertions(+), 18 deletions(-)
 create mode 100644 include/linux/vma.h

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..ed79d006026e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,7 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..70657e8550ed 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -125,7 +125,6 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	file_accessed(file);
 	vma->vm_ops = &ext2_dax_vm_ops;
-	vma->vm_flags |= VM_MIXEDMAP;
 	return 0;
 }
 #else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..0cc9d205bd96 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..c419c6fdb769 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1134,7 +1134,7 @@ xfs_file_mmap(
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/include/linux/vma.h b/include/linux/vma.h
new file mode 100644
index 000000000000..135ad5262cd1
--- /dev/null
+++ b/include/linux/vma.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __VMA_H__
+#define __VMA_H__
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/hugetlb_inline.h>
+
+/*
+ * There are several vma types that have special handling in the
+ * get_user_pages() path and other core mm paths that must not assume
+ * normal pages. vma_is_special() consolidates checks for VM_SPECIAL,
+ * hugetlb and dax vmas, but note that there are 'special' vmas and
+ * special circumstances beyond these types. In other words this helper
+ * is not exhaustive.
+ */
+static inline bool vma_is_special(struct vm_area_struct *vma)
+{
+	return vma && (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)
+			|| vma_is_dax(vma));
+}
+#endif /* __VMA_H__ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..c69d30e27fd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,11 +765,11 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
@@ -824,11 +824,11 @@ int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+	BUG_ON(!((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))
+				|| pfn_t_devmap(pfn)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-	BUG_ON(!pfn_t_devmap(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
diff --git a/mm/ksm.c b/mm/ksm.c
index 6cb60f46cce5..72f196a36503 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2361,6 +2361,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 				 VM_HUGETLB | VM_MIXEDMAP))
 			return 0;		/* just ignore the advice */
 
+		if (vma_is_dax(vma))
+			return 0;
+
 #ifdef VM_SAO
 		if (*vm_flags & VM_SAO)
 			return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 25bade36e9ca..50513a7a11f6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -95,7 +95,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		new_flags |= VM_DONTDUMP;
 		break;
 	case MADV_DODUMP:
-		if (new_flags & VM_SPECIAL) {
+		if (vma_is_dax(vma) || (new_flags & VM_SPECIAL)) {
 			error = -EINVAL;
 			goto out;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..cab46226eed1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -830,6 +830,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
+		if (pte_devmap(pte))
+			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
 
@@ -917,6 +919,8 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
+	if (pmd_devmap(pmd))
+		return NULL;
 	if (is_zero_pfn(pfn))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1227,7 +1231,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * efficient than faulting.
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+			!vma->anon_vma && !vma_is_dax(vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(vma))
@@ -1896,12 +1900,24 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+{
+	/* these checks mirror the abort conditions in vm_normal_page */
+	if (vma->vm_flags & VM_MIXEDMAP)
+		return true;
+	if (pfn_t_devmap(pfn))
+		return true;
+	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+		return true;
+	return false;
+}
+
 static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
-	BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+	BUG_ON(!vm_mixed_ok(vma, pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6954c1435833..13f8748e7cba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -45,6 +45,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/vma.h>
 
 #include <asm/tlbflush.h>
 
@@ -2927,7 +2928,7 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 	/* Sanity check the arguments */
 	start &= PAGE_MASK;
 	end &= PAGE_MASK;
-	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+	if (!vma || vma_is_special(vma))
 		return -EINVAL;
 	if (start < vma->vm_start || start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/mlock.c b/mm/mlock.c
index dfc6f1912176..4e20915ddfef 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -22,6 +22,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/vma.h>
 
 #include "internal.h"
 
@@ -519,8 +520,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	int lock = !!(newflags & VM_LOCKED);
 	vm_flags_t old_flags = vma->vm_flags;
 
-	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
-	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))
+	if (newflags == vma->vm_flags || vma_is_special(vma)
+			|| vma == get_gate_vma(current->mm))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..c28996f74320 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -45,6 +45,7 @@
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
 #include <linux/oom.h>
+#include <linux/vma.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1722,11 +1723,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
-		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
-			mm->locked_vm += (len >> PAGE_SHIFT);
-		else
+		if (vma_is_special(vma) || vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
 	}
 
 	if (file)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 04/13] dax: stop using VM_HUGEPAGE for dax
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, linux-kernel, linux-xfs, linux-mm,
	linux-fsdevel, hch

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c419c6fdb769..c6780743f8ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1133,8 +1133,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 04/13] dax: stop using VM_HUGEPAGE for dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, linux-kernel, linux-xfs, linux-mm,
	Jeff Moyer, linux-fsdevel, Ross Zwisler, hch

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c419c6fdb769..c6780743f8ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1133,8 +1133,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 04/13] dax: stop using VM_HUGEPAGE for dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, linux-kernel, linux-xfs, linux-mm,
	Jeff Moyer, linux-fsdevel, Ross Zwisler, hch

This flag is deprecated in favor of the vma_is_dax() check in
transparent_hugepage_enabled() added in commit baabda261424 "mm: always
enable thp for dax mappings"

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c |    1 -
 fs/ext4/file.c       |    1 -
 fs/xfs/xfs_file.c    |    2 --
 3 files changed, 4 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index ed79d006026e..74a35eb5e6d3 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -450,7 +450,6 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 		return rc;
 
 	vma->vm_ops = &dax_vm_ops;
-	vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0cc9d205bd96..a54e1b4c49f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -352,7 +352,6 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
-		vma->vm_flags |= VM_HUGEPAGE;
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c419c6fdb769..c6780743f8ec 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1133,8 +1133,6 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
-	if (IS_DAX(file_inode(filp)))
-		vma->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 05/13] dax: stop requiring a live device for dax_flush()
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d
"dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the
dax_read_lock() to be held and the device to be alive.  This is in
preparation for switching filesystem-dax to store pfns instead of
sectors in the radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 26c324a5aef4..be65430b4483 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -280,9 +280,6 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_alive(dax_dev)))
-		return;
-
 	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 05/13] dax: stop requiring a live device for dax_flush()
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d
"dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the
dax_read_lock() to be held and the device to be alive.  This is in
preparation for switching filesystem-dax to store pfns instead of
sectors in the radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 26c324a5aef4..be65430b4483 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -280,9 +280,6 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_alive(dax_dev)))
-		return;
-
 	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 05/13] dax: stop requiring a live device for dax_flush()
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

Now that dax_flush() is no longer a driver callback (commit c3ca015fab6d
"dax: remove the pmem_dax_ops->flush abstraction"), stop requiring the
dax_read_lock() to be held and the device to be alive.  This is in
preparation for switching filesystem-dax to store pfns instead of
sectors in the radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 26c324a5aef4..be65430b4483 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -280,9 +280,6 @@ EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_alive(dax_dev)))
-		return;
-
 	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 06/13] dax: store pfns in the radix
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, Matthew Wilcox, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   75 +++++++++++++++++++++++---------------------------------------
 1 file changed, 28 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..ac6497dcfebd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -653,17 +653,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -682,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,29 +709,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -745,8 +724,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -804,8 +781,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -843,7 +820,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 	}
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -852,6 +829,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		return vm_insert_mixed(vma, vaddr, pfn);
+	return rc;
 }
 
 /*
@@ -869,6 +847,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -876,14 +855,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1250,8 +1230,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto unlock_fallback;
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
-			RADIX_DAX_PMD);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, RADIX_DAX_PMD);
 	if (IS_ERR(ret))
 		goto fallback;
 
@@ -1276,13 +1255,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(ret))
 		goto fallback;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 06/13] dax: store pfns in the radix
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   75 +++++++++++++++++++++++---------------------------------------
 1 file changed, 28 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..ac6497dcfebd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -653,17 +653,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -682,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,29 +709,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -745,8 +724,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -804,8 +781,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -843,7 +820,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 	}
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -852,6 +829,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		return vm_insert_mixed(vma, vaddr, pfn);
+	return rc;
 }
 
 /*
@@ -869,6 +847,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -876,14 +855,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1250,8 +1230,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto unlock_fallback;
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
-			RADIX_DAX_PMD);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, RADIX_DAX_PMD);
 	if (IS_ERR(ret))
 		goto fallback;
 
@@ -1276,13 +1255,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(ret))
 		goto fallback;

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 06/13] dax: store pfns in the radix
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, hch

In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   75 +++++++++++++++++++++++---------------------------------------
 1 file changed, 28 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..ac6497dcfebd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -72,16 +72,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE	(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY		(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
 	return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-			((unsigned long)sector << RADIX_DAX_SHIFT) |
-			RADIX_DAX_ENTRY_LOCK);
+			(pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -525,12 +524,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
 				      struct vm_fault *vmf,
-				      void *entry, sector_t sector,
+				      void *entry, pfn_t pfn_t,
 				      unsigned long flags)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *new_entry;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
 	pgoff_t index = vmf->pgoff;
+	void *new_entry;
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -547,7 +547,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	}
 
 	spin_lock_irq(&mapping->tree_lock);
-	new_entry = dax_radix_locked_entry(sector, flags);
+	new_entry = dax_radix_locked_entry(pfn, flags);
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
@@ -653,17 +653,14 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-		struct dax_device *dax_dev, struct address_space *mapping,
-		pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	void *entry2, **slot, *kaddr;
-	long ret = 0, id;
-	sector_t sector;
-	pgoff_t pgoff;
+	void *entry2, **slot;
+	unsigned long pfn;
+	long ret = 0;
 	size_t size;
-	pfn_t pfn;
 
 	/*
 	 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -682,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * compare sectors as we must not bail out due to difference in lockbit
 	 * or entry type.
 	 */
-	if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+	if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
 		goto put_unlocked;
 	if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
 				dax_is_zero_entry(entry))) {
@@ -712,29 +709,11 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * 'entry'.  This allows us to flush for PMD_SIZE and not have to
 	 * worry about partial PMD writebacks.
 	 */
-	sector = dax_radix_sector(entry);
+	pfn = dax_radix_pfn(entry);
 	size = PAGE_SIZE << dax_radix_order(entry);
 
-	id = dax_read_lock();
-	ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
-	if (ret)
-		goto dax_unlock;
-
-	/*
-	 * dax_direct_access() may sleep, so cannot hold tree_lock over
-	 * its invocation.
-	 */
-	ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
-	if (ret < 0)
-		goto dax_unlock;
-
-	if (WARN_ON_ONCE(ret < size / PAGE_SIZE)) {
-		ret = -EIO;
-		goto dax_unlock;
-	}
-
-	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, kaddr, size);
+	dax_mapping_entry_mkclean(mapping, index, pfn);
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -745,8 +724,6 @@ static int dax_writeback_one(struct block_device *bdev,
 	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
- dax_unlock:
-	dax_read_unlock(id);
 	put_locked_mapping_entry(mapping, index);
 	return ret;
 
@@ -804,8 +781,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 				break;
 			}
 
-			ret = dax_writeback_one(bdev, dax_dev, mapping,
-					indices[i], pvec.pages[i]);
+			ret = dax_writeback_one(dax_dev, mapping, indices[i],
+					pvec.pages[i]);
 			if (ret < 0) {
 				mapping_set_error(mapping, ret);
 				goto out;
@@ -843,7 +820,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 	}
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, 0);
 	if (IS_ERR(ret))
 		return PTR_ERR(ret);
 
@@ -852,6 +829,7 @@ static int dax_insert_mapping(struct address_space *mapping,
 		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		return vm_insert_mixed(vma, vaddr, pfn);
+	return rc;
 }
 
 /*
@@ -869,6 +847,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 	int ret = VM_FAULT_NOPAGE;
 	struct page *zero_page;
 	void *entry2;
+	pfn_t pfn;
 
 	zero_page = ZERO_PAGE(0);
 	if (unlikely(!zero_page)) {
@@ -876,14 +855,15 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
 		goto out;
 	}
 
-	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	entry2 = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(entry2)) {
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
 
-	vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page));
+	vm_insert_mixed(vmf->vma, vaddr, pfn);
 out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
@@ -1250,8 +1230,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto unlock_fallback;
 	dax_read_unlock(id);
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
-			RADIX_DAX_PMD);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn, RADIX_DAX_PMD);
 	if (IS_ERR(ret))
 		goto fallback;
 
@@ -1276,13 +1255,15 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	void *ret = NULL;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
 
 	if (unlikely(!zero_page))
 		goto fallback;
 
-	ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
+	pfn = page_to_pfn_t(zero_page);
+	ret = dax_insert_mapping_entry(mapping, vmf, entry, pfn,
 			RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
 	if (IS_ERR(ret))
 		goto fallback;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 07/13] dax: warn if dma collides with truncate
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, Matthew Wilcox, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, hch

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..b03f547b36e7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -437,6 +437,38 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+static void dax_check_truncate(void *entry)
+{
+	unsigned long pfn = dax_radix_pfn(entry);
+	unsigned long size = dax_entry_size(entry);
+	unsigned long end_pfn;
+
+	if (!size)
+		return;
+	end_pfn = pfn + size / PAGE_SIZE;
+	for (; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * devmap pages are idle when their count is 1 and the
+		 * only path that increases their count is
+		 * get_user_pages().
+		 */
+		WARN_ONCE(page_ref_count(page) > 1,
+				"dax-dma truncate collision\n");
+	}
+}
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -452,6 +484,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_check_truncate(entry);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 07/13] dax: warn if dma collides with truncate
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, hch

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..b03f547b36e7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -437,6 +437,38 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+static void dax_check_truncate(void *entry)
+{
+	unsigned long pfn = dax_radix_pfn(entry);
+	unsigned long size = dax_entry_size(entry);
+	unsigned long end_pfn;
+
+	if (!size)
+		return;
+	end_pfn = pfn + size / PAGE_SIZE;
+	for (; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * devmap pages are idle when their count is 1 and the
+		 * only path that increases their count is
+		 * get_user_pages().
+		 */
+		WARN_ONCE(page_ref_count(page) > 1,
+				"dax-dma truncate collision\n");
+	}
+}
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -452,6 +484,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_check_truncate(entry);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 07/13] dax: warn if dma collides with truncate
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, Matthew Wilcox, linux-nvdimm, linux-kernel, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler, hch

Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..b03f547b36e7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -437,6 +437,38 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+static void dax_check_truncate(void *entry)
+{
+	unsigned long pfn = dax_radix_pfn(entry);
+	unsigned long size = dax_entry_size(entry);
+	unsigned long end_pfn;
+
+	if (!size)
+		return;
+	end_pfn = pfn + size / PAGE_SIZE;
+	for (; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * devmap pages are idle when their count is 1 and the
+		 * only path that increases their count is
+		 * get_user_pages().
+		 */
+		WARN_ONCE(page_ref_count(page) > 1,
+				"dax-dma truncate collision\n");
+	}
+}
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -452,6 +484,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_check_truncate(entry);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..5946cf3afe74 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1f5d7182ca9c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -388,4 +390,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bef419d4266d..2c871c8b4a56 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -656,6 +656,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -1905,6 +1938,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index d3d63dd5ed38..0d818d2adaf7 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -46,4 +46,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..5946cf3afe74 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1f5d7182ca9c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -388,4 +390,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bef419d4266d..2c871c8b4a56 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -656,6 +656,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -1905,6 +1938,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index d3d63dd5ed38..0d818d2adaf7 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -46,4 +46,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-nvdimm, linux-kernel, linux-xfs, linux-mm, linux-fsdevel, hch

In support of testing truncate colliding with dma add a mechanism that
delays the completion of block I/O requests by a programmable number of
seconds. This allows a truncate operation to be issued while page
references are held for direct-I/O.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild           |    1 +
 tools/testing/nvdimm/test/iomap.c     |   62 +++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/test/nfit.c      |   34 ++++++++++++++++++
 tools/testing/nvdimm/test/nfit_test.h |    1 +
 4 files changed, 98 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..5946cf3afe74 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=bio_endio
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1f5d7182ca9c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/workqueue.h>
 #include <linux/memremap.h>
 #include <linux/rculist.h>
 #include <linux/export.h>
@@ -18,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/acpi.h>
+#include <linux/bio.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include "nfit_test.h"
@@ -388,4 +390,64 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+static DEFINE_SPINLOCK(bio_lock);
+static struct bio *biolist;
+int bio_do_queue;
+
+static void run_bio(struct work_struct *work)
+{
+	struct delayed_work *dw = container_of(work, typeof(*dw), work);
+	struct bio *bio, *next;
+
+	pr_info("%s\n", __func__);
+	spin_lock(&bio_lock);
+	bio_do_queue = 0;
+	bio = biolist;
+	biolist = NULL;
+	spin_unlock(&bio_lock);
+
+	while (bio) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio_endio(bio);
+		bio = next;
+	}
+	kfree(dw);
+}
+
+void nfit_test_inject_bio_delay(int sec)
+{
+	struct delayed_work *dw = kzalloc(sizeof(*dw), GFP_KERNEL);
+
+	spin_lock(&bio_lock);
+	if (!bio_do_queue) {
+		pr_info("%s: %d seconds\n", __func__, sec);
+		INIT_DELAYED_WORK(dw, run_bio);
+		bio_do_queue = 1;
+		schedule_delayed_work(dw, sec * HZ);
+		dw = NULL;
+	}
+	spin_unlock(&bio_lock);
+}
+EXPORT_SYMBOL_GPL(nfit_test_inject_bio_delay);
+
+void __wrap_bio_endio(struct bio *bio)
+{
+	int did_q = 0;
+
+	spin_lock(&bio_lock);
+	if (bio_do_queue) {
+		bio->bi_next = biolist;
+		biolist = bio;
+		did_q = 1;
+	}
+	spin_unlock(&bio_lock);
+
+	if (did_q)
+		return;
+
+	bio_endio(bio);
+}
+EXPORT_SYMBOL_GPL(__wrap_bio_endio);
+
 MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index bef419d4266d..2c871c8b4a56 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -656,6 +656,39 @@ static const struct attribute_group *nfit_test_dimm_attribute_groups[] = {
 	NULL,
 };
 
+static ssize_t bio_delay_show(struct device_driver *drv, char *buf)
+{
+	return sprintf(buf, "0\n");
+}
+
+static ssize_t bio_delay_store(struct device_driver *drv, const char *buf,
+		size_t count)
+{
+	unsigned long delay;
+	int rc = kstrtoul(buf, 0, &delay);
+
+	if (rc < 0)
+		return rc;
+
+	nfit_test_inject_bio_delay(delay);
+	return count;
+}
+DRIVER_ATTR_RW(bio_delay);
+
+static struct attribute *nfit_test_driver_attributes[] = {
+	&driver_attr_bio_delay.attr,
+	NULL,
+};
+
+static struct attribute_group nfit_test_driver_attribute_group = {
+	.attrs = nfit_test_driver_attributes,
+};
+
+static const struct attribute_group *nfit_test_driver_attribute_groups[] = {
+	&nfit_test_driver_attribute_group,
+	NULL,
+};
+
 static int nfit_test0_alloc(struct nfit_test *t)
 {
 	size_t nfit_size = sizeof(struct acpi_nfit_system_address) * NUM_SPA
@@ -1905,6 +1938,7 @@ static struct platform_driver nfit_test_driver = {
 	.remove = nfit_test_remove,
 	.driver = {
 		.name = KBUILD_MODNAME,
+		.groups = nfit_test_driver_attribute_groups,
 	},
 	.id_table = nfit_test_id,
 };
diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h
index d3d63dd5ed38..0d818d2adaf7 100644
--- a/tools/testing/nvdimm/test/nfit_test.h
+++ b/tools/testing/nvdimm/test/nfit_test.h
@@ -46,4 +46,5 @@ void nfit_test_setup(nfit_test_lookup_fn lookup,
 		nfit_test_evaluate_dsm_fn evaluate);
 void nfit_test_teardown(void);
 struct nfit_test_resource *get_nfit_res(resource_size_t resource);
+void nfit_test_inject_bio_delay(int sec);
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Jason Gunthorpe, Doug Ledford, linux-nvdimm, linux-rdma,
	linux-kernel, Hal Rosenstock, linux-xfs, linux-mm, linux-fsdevel,
	Sean Hefty, hch

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Sean Hefty, linux-xfs, linux-nvdimm, linux-rdma, linux-kernel,
	Jeff Moyer, hch, Jason Gunthorpe, linux-mm, Doug Ledford,
	linux-fsdevel, Ross Zwisler, Hal Rosenstock

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Sean Hefty, linux-xfs, linux-nvdimm, linux-rdma, linux-kernel,
	Jeff Moyer, hch, Jason Gunthorpe, linux-mm, Doug Ledford,
	linux-fsdevel, Ross Zwisler, Hal Rosenstock

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas. Device-dax vmas do not have this problem and
are explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented, and is limited to non-ODP (On Demand
Paging) capable RDMA devices.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: <linux-rdma@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   49 +++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..c30d286c1f24 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -147,19 +147,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to protect against
+	 * registering filesystem-dax memory. Otherwise we can tolerate
+	 * a failure to allocate the vma_list and just assume that all
+	 * vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_FS_DAX))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +201,34 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			struct vm_area_struct *vma;
+			struct inode *inode;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			/* device-dax is safe for rdma... */
+			inode = file_inode(vma->vm_file);
+			if (inode->i_mode == S_IFCHR)
+				continue;
+
+			/* ...filesystem-dax is not. */
+			ret = -EOPNOTSUPP;
+			goto out;
 		}
 
 		/* preparing for next loop */
@@ -242,6 +263,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, linux-nvdimm, Dave Hansen, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, hch, Kirill A. Shutemov

In preparation for solving the dax-dma vs truncate race, disable
get_user_pages_fast(). The race fix relies on the vma being available.

We can still support get_user_pages_fast() for 1GB (pud) 'devmap'
mappings since those are only implemented for device-dax, everything
else needs the vma and the gup-slow-path in case it might be a
filesytem-dax mapping.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/gup.c |   48 +++++++++++++-----------------------------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b2b4d4263768..308be897d22a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1290,22 +1290,12 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 }
 #endif
 
-static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
-{
-	while ((*nr) - nr_start) {
-		struct page *page = pages[--(*nr)];
-
-		ClearPageReferenced(page);
-		put_page(page);
-	}
-}
-
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 int write, struct page **pages, int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	int nr_start = *nr, ret = 0;
+	int ret = 0;
 	pte_t *ptep, *ptem;
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
@@ -1323,13 +1313,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (!pte_access_permitted(pte, write))
 			goto pte_unmap;
 
-		if (pte_devmap(pte)) {
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
-				goto pte_unmap;
-			}
-		} else if (pte_special(pte))
+		if (pte_devmap(pte) || (pte_special(pte)))
 			goto pte_unmap;
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -1378,6 +1362,16 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 #endif /* __HAVE_ARCH_PTE_SPECIAL */
 
 #if defined(__HAVE_ARCH_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
+{
+	while ((*nr) - nr_start) {
+		struct page *page = pages[--(*nr)];
+
+		ClearPageReferenced(page);
+		put_page(page);
+	}
+}
+
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1402,15 +1396,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 	return 1;
 }
 
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
-{
-	unsigned long fault_pfn;
-
-	fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
-}
-
 static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1420,13 +1405,6 @@ static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
 }
 #else
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-
 static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1445,7 +1423,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (pmd_devmap(orig))
-		return __gup_device_huge_pmd(orig, addr, end, pages, nr);
+		return 0;
 
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, linux-nvdimm, Dave Hansen, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, hch, Kirill A. Shutemov

In preparation for solving the dax-dma vs truncate race, disable
get_user_pages_fast(). The race fix relies on the vma being available.

We can still support get_user_pages_fast() for 1GB (pud) 'devmap'
mappings since those are only implemented for device-dax, everything
else needs the vma and the gup-slow-path in case it might be a
filesytem-dax mapping.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/gup.c |   48 +++++++++++++-----------------------------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b2b4d4263768..308be897d22a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1290,22 +1290,12 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 }
 #endif
 
-static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
-{
-	while ((*nr) - nr_start) {
-		struct page *page = pages[--(*nr)];
-
-		ClearPageReferenced(page);
-		put_page(page);
-	}
-}
-
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 int write, struct page **pages, int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	int nr_start = *nr, ret = 0;
+	int ret = 0;
 	pte_t *ptep, *ptem;
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
@@ -1323,13 +1313,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (!pte_access_permitted(pte, write))
 			goto pte_unmap;
 
-		if (pte_devmap(pte)) {
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
-				goto pte_unmap;
-			}
-		} else if (pte_special(pte))
+		if (pte_devmap(pte) || (pte_special(pte)))
 			goto pte_unmap;
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -1378,6 +1362,16 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 #endif /* __HAVE_ARCH_PTE_SPECIAL */
 
 #if defined(__HAVE_ARCH_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
+{
+	while ((*nr) - nr_start) {
+		struct page *page = pages[--(*nr)];
+
+		ClearPageReferenced(page);
+		put_page(page);
+	}
+}
+
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1402,15 +1396,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 	return 1;
 }
 
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
-{
-	unsigned long fault_pfn;
-
-	fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
-}
-
 static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1420,13 +1405,6 @@ static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
 }
 #else
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-
 static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1445,7 +1423,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (pmd_devmap(orig))
-		return __gup_device_huge_pmd(orig, addr, end, pages, nr);
+		return 0;
 
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, linux-nvdimm, Dave Hansen, linux-kernel, linux-xfs,
	linux-mm, linux-fsdevel, hch, Kirill A. Shutemov

In preparation for solving the dax-dma vs truncate race, disable
get_user_pages_fast(). The race fix relies on the vma being available.

We can still support get_user_pages_fast() for 1GB (pud) 'devmap'
mappings since those are only implemented for device-dax, everything
else needs the vma and the gup-slow-path in case it might be a
filesytem-dax mapping.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/gup.c |   48 +++++++++++++-----------------------------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b2b4d4263768..308be897d22a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1290,22 +1290,12 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 }
 #endif
 
-static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
-{
-	while ((*nr) - nr_start) {
-		struct page *page = pages[--(*nr)];
-
-		ClearPageReferenced(page);
-		put_page(page);
-	}
-}
-
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 int write, struct page **pages, int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	int nr_start = *nr, ret = 0;
+	int ret = 0;
 	pte_t *ptep, *ptem;
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
@@ -1323,13 +1313,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (!pte_access_permitted(pte, write))
 			goto pte_unmap;
 
-		if (pte_devmap(pte)) {
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
-				goto pte_unmap;
-			}
-		} else if (pte_special(pte))
+		if (pte_devmap(pte) || (pte_special(pte)))
 			goto pte_unmap;
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -1378,6 +1362,16 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 #endif /* __HAVE_ARCH_PTE_SPECIAL */
 
 #if defined(__HAVE_ARCH_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
+{
+	while ((*nr) - nr_start) {
+		struct page *page = pages[--(*nr)];
+
+		ClearPageReferenced(page);
+		put_page(page);
+	}
+}
+
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1402,15 +1396,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 	return 1;
 }
 
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
-{
-	unsigned long fault_pfn;
-
-	fault_pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
-}
-
 static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1420,13 +1405,6 @@ static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 	return __gup_device_huge(fault_pfn, addr, end, pages, nr);
 }
 #else
-static int __gup_device_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-
 static int __gup_device_huge_pud(pud_t pud, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr)
 {
@@ -1445,7 +1423,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (pmd_devmap(orig))
-		return __gup_device_huge_pmd(orig, addr, end, pages, nr);
+		return 0;
 
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:39   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: J. Bruce Fields, linux-nvdimm, linux-kernel, linux-xfs, linux-mm,
	Alexander Viro, linux-fsdevel, Jeff Layton, hch

Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
checks of inode->i_flctx to use smp_load_acquire(), but it did not
convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
should be no functional difference since __break_lease repeats the
sequence, but this is a clean up to unify all ->i_flctx lookups on a
common pattern.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/fs.h |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..eace2c5396a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
 	 * could end up racing with tasks trying to set a new lease on this
 	 * file.
 	 */
-	smp_mb();
-	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
+	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
+
+	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode, mode, FL_LEASE);
 	return 0;
 }
@@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
 
 static inline int break_layout(struct inode *inode, bool wait)
 {
-	smp_mb();
-	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
+	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
+
+	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode,
 				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
 				FL_LAYOUT);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: linux-xfs, linux-nvdimm, linux-kernel, hch, J. Bruce Fields,
	linux-mm, Alexander Viro, linux-fsdevel, Jeff Layton,
	Ross Zwisler

Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
checks of inode->i_flctx to use smp_load_acquire(), but it did not
convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
should be no functional difference since __break_lease repeats the
sequence, but this is a clean up to unify all ->i_flctx lookups on a
common pattern.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/fs.h |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..eace2c5396a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
 	 * could end up racing with tasks trying to set a new lease on this
 	 * file.
 	 */
-	smp_mb();
-	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
+	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
+
+	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode, mode, FL_LEASE);
 	return 0;
 }
@@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
 
 static inline int break_layout(struct inode *inode, bool wait)
 {
-	smp_mb();
-	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
+	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
+
+	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode,
 				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
 				FL_LAYOUT);

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
@ 2017-10-20  2:39   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:39 UTC (permalink / raw)
  To: akpm
  Cc: linux-xfs, linux-nvdimm, linux-kernel, hch, J. Bruce Fields,
	linux-mm, Alexander Viro, linux-fsdevel, Jeff Layton,
	Ross Zwisler

Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
checks of inode->i_flctx to use smp_load_acquire(), but it did not
convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
should be no functional difference since __break_lease repeats the
sequence, but this is a clean up to unify all ->i_flctx lookups on a
common pattern.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/fs.h |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..eace2c5396a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
 	 * could end up racing with tasks trying to set a new lease on this
 	 * file.
 	 */
-	smp_mb();
-	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
+	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
+
+	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode, mode, FL_LEASE);
 	return 0;
 }
@@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
 
 static inline int break_layout(struct inode *inode, bool wait)
 {
-	smp_mb();
-	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
+	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
+
+	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode,
 				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
 				FL_LAYOUT);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:40   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:40 UTC (permalink / raw)
  To: akpm
  Cc: linux-xfs, Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, hch, J. Bruce Fields, linux-mm, Alexander Viro,
	linux-fsdevel, Jeff Layton, Darrick J. Wong, linux-nvdimm

get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore.

The assumptions of the truncate-page-cache model are broken by DAX where
the target DMA page *is* the filesystem block. Leaving the page pinned
for DMA, but truncating the file block out of the file, means that the
filesytem is free to reallocate a block under active DMA to another
file!

Here are some possible options for fixing this situation ('truncate' and
'fallocate(punch hole)' are synonymous below):

    1/ Fail truncate while any file blocks might be under dma

    2/ Block (sleep-wait) truncate while any file blocks might be under
       dma

    3/ Remap file blocks to a "lost+found"-like file-inode where
       dma can continue and we might see what inbound data from DMA was
       mapped out of the original file. Blocks in this file could be
       freed back to the filesystem when dma eventually ends.

    4/ Disable dax until option 3 or another long term solution has been
       implemented. However, filesystem-dax is still marked experimental
       for concerns like this.

Option 1 will throw failures where userspace has never expected them
before, option 2 might hang the truncating process indefinitely, and
option 3 requires per filesystem enabling to remap blocks from one inode
to another.  Option 2 is implemented in this patch for the DAX path with
the expectation that non-transient users of get_user_pages() (RDMA) are
disallowed from setting up dax mappings and that the potential delay
introduced to the truncate path is acceptable compared to the response
time of the page cache case. This can only be seen as a stop-gap until
we can solve the problem of safely sequestering unallocated filesystem
blocks under active dma.

The solution introduces a new FL_ALLOCATED lease to pin the allocated
blocks in a dax file while dma might be accessing them. It behaves
identically to an FL_LAYOUT lease save for the fact that it is
immediately sheduled to be reaped, and that the only path that waits for
its removal is the truncate path. We can not reuse FL_LAYOUT directly
since that would deadlock in the case where userspace did a direct-I/O
operation with a target buffer backed by an mmap range of the same file.

Credit / inspiration for option 3 goes to Dave Hansen, who proposed
something similar as an alternative way to solve the problem that
MAP_DIRECT was trying to solve.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig          |    1 
 fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/locks.c          |   17 ++++-
 include/linux/dax.h |   23 ++++++
 include/linux/fs.h  |   22 +++++-
 mm/gup.c            |   27 ++++++-
 6 files changed, 268 insertions(+), 10 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/dax.c b/fs/dax.c
index b03f547b36e7..e0a3958fc5f2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -22,6 +22,7 @@
 #include <linux/genhd.h>
 #include <linux/highmem.h>
 #include <linux/memcontrol.h>
+#include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pagevec.h>
@@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	}
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+enum dax_lease_flags {
+	DAX_LEASE_PAGES,
+	DAX_LEASE_BREAK,
+};
+
+struct dax_lease {
+	struct page **dl_pages;
+	unsigned long dl_nr_pages;
+	unsigned long dl_state;
+	struct file *dl_file;
+	atomic_t dl_count;
+	/*
+	 * Once the lease is taken and the pages have references we
+	 * start the reap_work to poll for lease release while acquiring
+	 * fs locks that synchronize with truncate. So, either reap_work
+	 * cleans up the dax_lease instances or truncate itself.
+	 *
+	 * The break_work sleepily polls for DMA completion and then
+	 * unlocks/removes the lease.
+	 */
+	struct delayed_work dl_reap_work;
+	struct delayed_work dl_break_work;
+};
+
+static void put_dax_lease(struct dax_lease *dl)
+{
+	if (atomic_dec_and_test(&dl->dl_count)) {
+		fput(dl->dl_file);
+		kfree(dl);
+	}
+}
+
+static void dax_lease_unlock_one(struct work_struct *work)
+{
+	struct dax_lease *dl = container_of(work, typeof(*dl),
+			dl_break_work.work);
+	unsigned long i;
+
+	/* wait for the gup path to finish recording pages in the lease */
+	if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
+		schedule_delayed_work(&dl->dl_break_work, HZ);
+		return;
+	}
+
+	/* barrier pairs with dax_lease_set_pages() */
+	smp_mb__after_atomic();
+
+	/*
+	 * If we see all pages idle at least once we can remove the
+	 * lease. If we happen to race with someone else taking a
+	 * reference on a page they will have their own lease to protect
+	 * against truncate.
+	 */
+	for (i = 0; i < dl->dl_nr_pages; i++)
+		if (page_ref_count(dl->dl_pages[i]) > 1) {
+			schedule_delayed_work(&dl->dl_break_work, HZ);
+			return;
+		}
+	vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
+	put_dax_lease(dl);
+}
+
+static void dax_lease_reap_all(struct work_struct *work)
+{
+	struct dax_lease *dl = container_of(work, typeof(*dl),
+			dl_reap_work.work);
+	struct file *file = dl->dl_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+
+	if (mapping->a_ops->dax_flush_dma) {
+		mapping->a_ops->dax_flush_dma(inode);
+	} else {
+		/* FIXME: dax-filesystem needs to add dax-dma support */
+		break_allocated(inode, true);
+	}
+	put_dax_lease(dl);
+}
+
+static bool dax_lease_lm_break(struct file_lock *fl)
+{
+	struct dax_lease *dl = fl->fl_owner;
+
+	if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
+		atomic_inc(&dl->dl_count);
+		schedule_delayed_work(&dl->dl_break_work, HZ);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int dax_lease_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct dax_lease *dl;
+	int rc;
+
+	WARN_ON(!(arg & F_UNLCK));
+	dl = fl->fl_owner;
+	rc = lease_modify(fl, arg, dispose);
+	put_dax_lease(dl);
+	return rc;
+}
+
+static const struct lock_manager_operations dax_lease_lm_ops = {
+	.lm_break = dax_lease_lm_break,
+	.lm_change = dax_lease_lm_change,
+};
+
+struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct dax_lease *dl;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+
+	if (!vma_is_dax(vma))
+		return NULL;
+
+	/* device-dax can not be truncated */
+	if (!S_ISREG(inode->i_mode))
+		return NULL;
+
+	dl = kzalloc(sizeof(*dl) + sizeof(struct page *) * nr_pages, GFP_KERNEL);
+	if (!dl)
+		return ERR_PTR(-ENOMEM);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	dl->dl_pages = (struct page **)(dl + 1);
+	INIT_DELAYED_WORK(&dl->dl_break_work, dax_lease_unlock_one);
+	INIT_DELAYED_WORK(&dl->dl_reap_work, dax_lease_reap_all);
+	dl->dl_file = get_file(file);
+	/* need dl alive until dax_lease_set_pages() and final put */
+	atomic_set(&dl->dl_count, 2);
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &dax_lease_lm_ops;
+	fl->fl_flags = FL_ALLOCATED;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = dl;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(fl->fl_file, fl->fl_type, &fl, (void **) &dl);
+	if (rc)
+		goto err_setlease;
+	return dl;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(dl);
+	return ERR_PTR(rc);
+}
+
+void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
+		long nr_pages)
+{
+	if (IS_ERR_OR_NULL(dl))
+		return;
+
+	if (nr_pages <= 0) {
+		dl->dl_nr_pages = 0;
+		smp_mb__before_atomic();
+		set_bit(DAX_LEASE_PAGES, &dl->dl_state);
+		vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
+		flush_delayed_work(&dl->dl_break_work);
+		put_dax_lease(dl);
+		return;
+	}
+
+	dl->dl_nr_pages = nr_pages;
+	memcpy(dl->dl_pages, pages, sizeof(struct page *) * nr_pages);
+	smp_mb__before_atomic();
+	set_bit(DAX_LEASE_PAGES, &dl->dl_state);
+	queue_delayed_work(system_long_wq, &dl->dl_reap_work, HZ);
+}
+EXPORT_SYMBOL_GPL(dax_lease_set_pages);
diff --git a/fs/locks.c b/fs/locks.c
index 1bd71c4d663a..0a7841590b35 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -135,7 +135,7 @@
 
 #define IS_POSIX(fl)	(fl->fl_flags & FL_POSIX)
 #define IS_FLOCK(fl)	(fl->fl_flags & FL_FLOCK)
-#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
+#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT|FL_ALLOCATED))
 #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
 #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
 
@@ -1414,7 +1414,9 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
 
 static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
 {
-	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
+	/* FL_LAYOUT and FL_ALLOCATED only conflict with each other */
+	if (!!(breaker->fl_flags & (FL_LAYOUT|FL_ALLOCATED))
+			!= !!(lease->fl_flags & (FL_LAYOUT|FL_ALLOCATED)))
 		return false;
 	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
 		return false;
@@ -1653,7 +1655,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
 	int ret = 0;
 	struct inode *inode = dentry->d_inode;
 
-	if (flags & FL_LAYOUT)
+	if (flags & (FL_LAYOUT|FL_ALLOCATED))
 		return 0;
 
 	if ((arg == F_RDLCK) &&
@@ -1733,6 +1735,15 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
 		 */
 		if (arg == F_WRLCK)
 			goto out;
+
+		/*
+		 * Taking out a new FL_ALLOCATED lease while a previous
+		 * one is being locked is expected since each instance
+		 * may be responsible for a distinct range of pages.
+		 */
+		if (fl->fl_flags & FL_ALLOCATED)
+			continue;
+
 		/*
 		 * Modifying our existing lease is OK, but no getting a
 		 * new lease if someone else is opening for write:
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..3ff61dc6241e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
+struct dax_lease;
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
+struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages);
+void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
+		long nr_pages);
 #else
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
@@ -111,8 +116,26 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 {
 	return -ENXIO;
 }
+static inline struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	return NULL;
+}
+
+static inline void dax_lease_set_pages(struct dax_lease *dl,
+		struct page **pages, long nr_pages)
+{
+}
 #endif
 
+static inline struct dax_lease *dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	if (!vma_is_dax(vma))
+		return NULL;
+	return __dax_truncate_lease(vma, nr_pages);
+}
+
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eace2c5396a7..a3ed74833919 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -371,6 +371,9 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+
+	/* dax dma support */
+	void (*dax_flush_dma)(struct inode *inode);
 };
 
 extern const struct address_space_operations empty_aops;
@@ -927,6 +930,7 @@ static inline struct file *get_file(struct file *f)
 #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
 #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
 #define FL_LAYOUT	2048	/* outstanding pNFS layout */
+#define FL_ALLOCATED	4096	/* pin allocated dax blocks against dma */
 
 #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
 
@@ -2324,17 +2328,27 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
 	return ret;
 }
 
-static inline int break_layout(struct inode *inode, bool wait)
+static inline int __break_layout(struct inode *inode, bool wait,
+		unsigned int type)
 {
 	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
 
 	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode,
 				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
-				FL_LAYOUT);
+				type);
 	return 0;
 }
 
+static inline int break_layout(struct inode *inode, bool wait)
+{
+	return __break_layout(inode, wait, FL_LAYOUT);
+}
+
+static inline int break_allocated(struct inode *inode, bool wait)
+{
+	return __break_layout(inode, wait, FL_LAYOUT|FL_ALLOCATED);
+}
 #else /* !CONFIG_FILE_LOCKING */
 static inline int break_lease(struct inode *inode, unsigned int mode)
 {
@@ -2362,6 +2376,10 @@ static inline int break_layout(struct inode *inode, bool wait)
 	return 0;
 }
 
+static inline int break_allocated(struct inode *inode, bool wait)
+{
+	return 0;
+}
 #endif /* CONFIG_FILE_LOCKING */
 
 /* fs/open.c */
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..6a7cf371e656 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -9,6 +9,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/dax.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -640,9 +641,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
-	long i = 0;
+	long i = 0, result = 0;
+	int dax_lease_once = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
+	struct dax_lease *dax_lease = NULL;
 
 	if (!nr_pages)
 		return 0;
@@ -693,6 +696,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		if (pages && !dax_lease_once) {
+			dax_lease_once = 1;
+			dax_lease = dax_truncate_lease(vma, nr_pages);
+			if (IS_ERR(dax_lease)) {
+				result = PTR_ERR(dax_lease);
+				goto out;
+			}
+		}
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
 			int ret;
@@ -704,9 +715,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				result = i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				result = i;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -718,7 +731,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			result = i ? i : PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -738,7 +752,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+	result = i;
+out:
+	dax_lease_set_pages(dax_lease, pages, result);
+	return result;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20  2:40   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:40 UTC (permalink / raw)
  To: akpm
  Cc: Jeff Layton, Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, J. Bruce Fields, linux-mm, Jeff Moyer,
	Alexander Viro, linux-fsdevel, Darrick J. Wong, Ross Zwisler,
	linux-xfs, hch, linux-nvdimm

get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore.

The assumptions of the truncate-page-cache model are broken by DAX where
the target DMA page *is* the filesystem block. Leaving the page pinned
for DMA, but truncating the file block out of the file, means that the
filesytem is free to reallocate a block under active DMA to another
file!

Here are some possible options for fixing this situation ('truncate' and
'fallocate(punch hole)' are synonymous below):

    1/ Fail truncate while any file blocks might be under dma

    2/ Block (sleep-wait) truncate while any file blocks might be under
       dma

    3/ Remap file blocks to a "lost+found"-like file-inode where
       dma can continue and we might see what inbound data from DMA was
       mapped out of the original file. Blocks in this file could be
       freed back to the filesystem when dma eventually ends.

    4/ Disable dax until option 3 or another long term solution has been
       implemented. However, filesystem-dax is still marked experimental
       for concerns like this.

Option 1 will throw failures where userspace has never expected them
before, option 2 might hang the truncating process indefinitely, and
option 3 requires per filesystem enabling to remap blocks from one inode
to another.  Option 2 is implemented in this patch for the DAX path with
the expectation that non-transient users of get_user_pages() (RDMA) are
disallowed from setting up dax mappings and that the potential delay
introduced to the truncate path is acceptable compared to the response
time of the page cache case. This can only be seen as a stop-gap until
we can solve the problem of safely sequestering unallocated filesystem
blocks under active dma.

The solution introduces a new FL_ALLOCATED lease to pin the allocated
blocks in a dax file while dma might be accessing them. It behaves
identically to an FL_LAYOUT lease save for the fact that it is
immediately sheduled to be reaped, and that the only path that waits for
its removal is the truncate path. We can not reuse FL_LAYOUT directly
since that would deadlock in the case where userspace did a direct-I/O
operation with a target buffer backed by an mmap range of the same file.

Credit / inspiration for option 3 goes to Dave Hansen, who proposed
something similar as an alternative way to solve the problem that
MAP_DIRECT was trying to solve.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig          |    1 
 fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/locks.c          |   17 ++++-
 include/linux/dax.h |   23 ++++++
 include/linux/fs.h  |   22 +++++-
 mm/gup.c            |   27 ++++++-
 6 files changed, 268 insertions(+), 10 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/dax.c b/fs/dax.c
index b03f547b36e7..e0a3958fc5f2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -22,6 +22,7 @@
 #include <linux/genhd.h>
 #include <linux/highmem.h>
 #include <linux/memcontrol.h>
+#include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pagevec.h>
@@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	}
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+enum dax_lease_flags {
+	DAX_LEASE_PAGES,
+	DAX_LEASE_BREAK,
+};
+
+struct dax_lease {
+	struct page **dl_pages;
+	unsigned long dl_nr_pages;
+	unsigned long dl_state;
+	struct file *dl_file;
+	atomic_t dl_count;
+	/*
+	 * Once the lease is taken and the pages have references we
+	 * start the reap_work to poll for lease release while acquiring
+	 * fs locks that synchronize with truncate. So, either reap_work
+	 * cleans up the dax_lease instances or truncate itself.
+	 *
+	 * The break_work sleepily polls for DMA completion and then
+	 * unlocks/removes the lease.
+	 */
+	struct delayed_work dl_reap_work;
+	struct delayed_work dl_break_work;
+};
+
+static void put_dax_lease(struct dax_lease *dl)
+{
+	if (atomic_dec_and_test(&dl->dl_count)) {
+		fput(dl->dl_file);
+		kfree(dl);
+	}
+}
+
+static void dax_lease_unlock_one(struct work_struct *work)
+{
+	struct dax_lease *dl = container_of(work, typeof(*dl),
+			dl_break_work.work);
+	unsigned long i;
+
+	/* wait for the gup path to finish recording pages in the lease */
+	if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
+		schedule_delayed_work(&dl->dl_break_work, HZ);
+		return;
+	}
+
+	/* barrier pairs with dax_lease_set_pages() */
+	smp_mb__after_atomic();
+
+	/*
+	 * If we see all pages idle at least once we can remove the
+	 * lease. If we happen to race with someone else taking a
+	 * reference on a page they will have their own lease to protect
+	 * against truncate.
+	 */
+	for (i = 0; i < dl->dl_nr_pages; i++)
+		if (page_ref_count(dl->dl_pages[i]) > 1) {
+			schedule_delayed_work(&dl->dl_break_work, HZ);
+			return;
+		}
+	vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
+	put_dax_lease(dl);
+}
+
+static void dax_lease_reap_all(struct work_struct *work)
+{
+	struct dax_lease *dl = container_of(work, typeof(*dl),
+			dl_reap_work.work);
+	struct file *file = dl->dl_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+
+	if (mapping->a_ops->dax_flush_dma) {
+		mapping->a_ops->dax_flush_dma(inode);
+	} else {
+		/* FIXME: dax-filesystem needs to add dax-dma support */
+		break_allocated(inode, true);
+	}
+	put_dax_lease(dl);
+}
+
+static bool dax_lease_lm_break(struct file_lock *fl)
+{
+	struct dax_lease *dl = fl->fl_owner;
+
+	if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
+		atomic_inc(&dl->dl_count);
+		schedule_delayed_work(&dl->dl_break_work, HZ);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int dax_lease_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct dax_lease *dl;
+	int rc;
+
+	WARN_ON(!(arg & F_UNLCK));
+	dl = fl->fl_owner;
+	rc = lease_modify(fl, arg, dispose);
+	put_dax_lease(dl);
+	return rc;
+}
+
+static const struct lock_manager_operations dax_lease_lm_ops = {
+	.lm_break = dax_lease_lm_break,
+	.lm_change = dax_lease_lm_change,
+};
+
+struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct dax_lease *dl;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+
+	if (!vma_is_dax(vma))
+		return NULL;
+
+	/* device-dax can not be truncated */
+	if (!S_ISREG(inode->i_mode))
+		return NULL;
+
+	dl = kzalloc(sizeof(*dl) + sizeof(struct page *) * nr_pages, GFP_KERNEL);
+	if (!dl)
+		return ERR_PTR(-ENOMEM);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	dl->dl_pages = (struct page **)(dl + 1);
+	INIT_DELAYED_WORK(&dl->dl_break_work, dax_lease_unlock_one);
+	INIT_DELAYED_WORK(&dl->dl_reap_work, dax_lease_reap_all);
+	dl->dl_file = get_file(file);
+	/* need dl alive until dax_lease_set_pages() and final put */
+	atomic_set(&dl->dl_count, 2);
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &dax_lease_lm_ops;
+	fl->fl_flags = FL_ALLOCATED;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = dl;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(fl->fl_file, fl->fl_type, &fl, (void **) &dl);
+	if (rc)
+		goto err_setlease;
+	return dl;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(dl);
+	return ERR_PTR(rc);
+}
+
+void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
+		long nr_pages)
+{
+	if (IS_ERR_OR_NULL(dl))
+		return;
+
+	if (nr_pages <= 0) {
+		dl->dl_nr_pages = 0;
+		smp_mb__before_atomic();
+		set_bit(DAX_LEASE_PAGES, &dl->dl_state);
+		vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
+		flush_delayed_work(&dl->dl_break_work);
+		put_dax_lease(dl);
+		return;
+	}
+
+	dl->dl_nr_pages = nr_pages;
+	memcpy(dl->dl_pages, pages, sizeof(struct page *) * nr_pages);
+	smp_mb__before_atomic();
+	set_bit(DAX_LEASE_PAGES, &dl->dl_state);
+	queue_delayed_work(system_long_wq, &dl->dl_reap_work, HZ);
+}
+EXPORT_SYMBOL_GPL(dax_lease_set_pages);
diff --git a/fs/locks.c b/fs/locks.c
index 1bd71c4d663a..0a7841590b35 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -135,7 +135,7 @@
 
 #define IS_POSIX(fl)	(fl->fl_flags & FL_POSIX)
 #define IS_FLOCK(fl)	(fl->fl_flags & FL_FLOCK)
-#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
+#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT|FL_ALLOCATED))
 #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
 #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
 
@@ -1414,7 +1414,9 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
 
 static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
 {
-	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
+	/* FL_LAYOUT and FL_ALLOCATED only conflict with each other */
+	if (!!(breaker->fl_flags & (FL_LAYOUT|FL_ALLOCATED))
+			!= !!(lease->fl_flags & (FL_LAYOUT|FL_ALLOCATED)))
 		return false;
 	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
 		return false;
@@ -1653,7 +1655,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
 	int ret = 0;
 	struct inode *inode = dentry->d_inode;
 
-	if (flags & FL_LAYOUT)
+	if (flags & (FL_LAYOUT|FL_ALLOCATED))
 		return 0;
 
 	if ((arg == F_RDLCK) &&
@@ -1733,6 +1735,15 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
 		 */
 		if (arg == F_WRLCK)
 			goto out;
+
+		/*
+		 * Taking out a new FL_ALLOCATED lease while a previous
+		 * one is being locked is expected since each instance
+		 * may be responsible for a distinct range of pages.
+		 */
+		if (fl->fl_flags & FL_ALLOCATED)
+			continue;
+
 		/*
 		 * Modifying our existing lease is OK, but no getting a
 		 * new lease if someone else is opening for write:
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..3ff61dc6241e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
+struct dax_lease;
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
+struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages);
+void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
+		long nr_pages);
 #else
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
@@ -111,8 +116,26 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 {
 	return -ENXIO;
 }
+static inline struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	return NULL;
+}
+
+static inline void dax_lease_set_pages(struct dax_lease *dl,
+		struct page **pages, long nr_pages)
+{
+}
 #endif
 
+static inline struct dax_lease *dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	if (!vma_is_dax(vma))
+		return NULL;
+	return __dax_truncate_lease(vma, nr_pages);
+}
+
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eace2c5396a7..a3ed74833919 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -371,6 +371,9 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+
+	/* dax dma support */
+	void (*dax_flush_dma)(struct inode *inode);
 };
 
 extern const struct address_space_operations empty_aops;
@@ -927,6 +930,7 @@ static inline struct file *get_file(struct file *f)
 #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
 #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
 #define FL_LAYOUT	2048	/* outstanding pNFS layout */
+#define FL_ALLOCATED	4096	/* pin allocated dax blocks against dma */
 
 #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
 
@@ -2324,17 +2328,27 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
 	return ret;
 }
 
-static inline int break_layout(struct inode *inode, bool wait)
+static inline int __break_layout(struct inode *inode, bool wait,
+		unsigned int type)
 {
 	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
 
 	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode,
 				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
-				FL_LAYOUT);
+				type);
 	return 0;
 }
 
+static inline int break_layout(struct inode *inode, bool wait)
+{
+	return __break_layout(inode, wait, FL_LAYOUT);
+}
+
+static inline int break_allocated(struct inode *inode, bool wait)
+{
+	return __break_layout(inode, wait, FL_LAYOUT|FL_ALLOCATED);
+}
 #else /* !CONFIG_FILE_LOCKING */
 static inline int break_lease(struct inode *inode, unsigned int mode)
 {
@@ -2362,6 +2376,10 @@ static inline int break_layout(struct inode *inode, bool wait)
 	return 0;
 }
 
+static inline int break_allocated(struct inode *inode, bool wait)
+{
+	return 0;
+}
 #endif /* CONFIG_FILE_LOCKING */
 
 /* fs/open.c */
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..6a7cf371e656 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -9,6 +9,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/dax.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -640,9 +641,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
-	long i = 0;
+	long i = 0, result = 0;
+	int dax_lease_once = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
+	struct dax_lease *dax_lease = NULL;
 
 	if (!nr_pages)
 		return 0;
@@ -693,6 +696,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		if (pages && !dax_lease_once) {
+			dax_lease_once = 1;
+			dax_lease = dax_truncate_lease(vma, nr_pages);
+			if (IS_ERR(dax_lease)) {
+				result = PTR_ERR(dax_lease);
+				goto out;
+			}
+		}
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
 			int ret;
@@ -704,9 +715,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				result = i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				result = i;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -718,7 +731,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			result = i ? i : PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -738,7 +752,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+	result = i;
+out:
+	dax_lease_set_pages(dax_lease, pages, result);
+	return result;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20  2:40   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:40 UTC (permalink / raw)
  To: akpm
  Cc: Jeff Layton, Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, J. Bruce Fields, linux-mm, Jeff Moyer,
	Alexander Viro, linux-fsdevel, Darrick J. Wong, Ross Zwisler,
	linux-xfs, hch, linux-nvdimm

get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore.

The assumptions of the truncate-page-cache model are broken by DAX where
the target DMA page *is* the filesystem block. Leaving the page pinned
for DMA, but truncating the file block out of the file, means that the
filesytem is free to reallocate a block under active DMA to another
file!

Here are some possible options for fixing this situation ('truncate' and
'fallocate(punch hole)' are synonymous below):

    1/ Fail truncate while any file blocks might be under dma

    2/ Block (sleep-wait) truncate while any file blocks might be under
       dma

    3/ Remap file blocks to a "lost+found"-like file-inode where
       dma can continue and we might see what inbound data from DMA was
       mapped out of the original file. Blocks in this file could be
       freed back to the filesystem when dma eventually ends.

    4/ Disable dax until option 3 or another long term solution has been
       implemented. However, filesystem-dax is still marked experimental
       for concerns like this.

Option 1 will throw failures where userspace has never expected them
before, option 2 might hang the truncating process indefinitely, and
option 3 requires per filesystem enabling to remap blocks from one inode
to another.  Option 2 is implemented in this patch for the DAX path with
the expectation that non-transient users of get_user_pages() (RDMA) are
disallowed from setting up dax mappings and that the potential delay
introduced to the truncate path is acceptable compared to the response
time of the page cache case. This can only be seen as a stop-gap until
we can solve the problem of safely sequestering unallocated filesystem
blocks under active dma.

The solution introduces a new FL_ALLOCATED lease to pin the allocated
blocks in a dax file while dma might be accessing them. It behaves
identically to an FL_LAYOUT lease save for the fact that it is
immediately sheduled to be reaped, and that the only path that waits for
its removal is the truncate path. We can not reuse FL_LAYOUT directly
since that would deadlock in the case where userspace did a direct-I/O
operation with a target buffer backed by an mmap range of the same file.

Credit / inspiration for option 3 goes to Dave Hansen, who proposed
something similar as an alternative way to solve the problem that
MAP_DIRECT was trying to solve.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig          |    1 
 fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/locks.c          |   17 ++++-
 include/linux/dax.h |   23 ++++++
 include/linux/fs.h  |   22 +++++-
 mm/gup.c            |   27 ++++++-
 6 files changed, 268 insertions(+), 10 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/dax.c b/fs/dax.c
index b03f547b36e7..e0a3958fc5f2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -22,6 +22,7 @@
 #include <linux/genhd.h>
 #include <linux/highmem.h>
 #include <linux/memcontrol.h>
+#include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pagevec.h>
@@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	}
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+enum dax_lease_flags {
+	DAX_LEASE_PAGES,
+	DAX_LEASE_BREAK,
+};
+
+struct dax_lease {
+	struct page **dl_pages;
+	unsigned long dl_nr_pages;
+	unsigned long dl_state;
+	struct file *dl_file;
+	atomic_t dl_count;
+	/*
+	 * Once the lease is taken and the pages have references we
+	 * start the reap_work to poll for lease release while acquiring
+	 * fs locks that synchronize with truncate. So, either reap_work
+	 * cleans up the dax_lease instances or truncate itself.
+	 *
+	 * The break_work sleepily polls for DMA completion and then
+	 * unlocks/removes the lease.
+	 */
+	struct delayed_work dl_reap_work;
+	struct delayed_work dl_break_work;
+};
+
+static void put_dax_lease(struct dax_lease *dl)
+{
+	if (atomic_dec_and_test(&dl->dl_count)) {
+		fput(dl->dl_file);
+		kfree(dl);
+	}
+}
+
+static void dax_lease_unlock_one(struct work_struct *work)
+{
+	struct dax_lease *dl = container_of(work, typeof(*dl),
+			dl_break_work.work);
+	unsigned long i;
+
+	/* wait for the gup path to finish recording pages in the lease */
+	if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
+		schedule_delayed_work(&dl->dl_break_work, HZ);
+		return;
+	}
+
+	/* barrier pairs with dax_lease_set_pages() */
+	smp_mb__after_atomic();
+
+	/*
+	 * If we see all pages idle at least once we can remove the
+	 * lease. If we happen to race with someone else taking a
+	 * reference on a page they will have their own lease to protect
+	 * against truncate.
+	 */
+	for (i = 0; i < dl->dl_nr_pages; i++)
+		if (page_ref_count(dl->dl_pages[i]) > 1) {
+			schedule_delayed_work(&dl->dl_break_work, HZ);
+			return;
+		}
+	vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
+	put_dax_lease(dl);
+}
+
+static void dax_lease_reap_all(struct work_struct *work)
+{
+	struct dax_lease *dl = container_of(work, typeof(*dl),
+			dl_reap_work.work);
+	struct file *file = dl->dl_file;
+	struct inode *inode = file_inode(file);
+	struct address_space *mapping = inode->i_mapping;
+
+	if (mapping->a_ops->dax_flush_dma) {
+		mapping->a_ops->dax_flush_dma(inode);
+	} else {
+		/* FIXME: dax-filesystem needs to add dax-dma support */
+		break_allocated(inode, true);
+	}
+	put_dax_lease(dl);
+}
+
+static bool dax_lease_lm_break(struct file_lock *fl)
+{
+	struct dax_lease *dl = fl->fl_owner;
+
+	if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
+		atomic_inc(&dl->dl_count);
+		schedule_delayed_work(&dl->dl_break_work, HZ);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int dax_lease_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct dax_lease *dl;
+	int rc;
+
+	WARN_ON(!(arg & F_UNLCK));
+	dl = fl->fl_owner;
+	rc = lease_modify(fl, arg, dispose);
+	put_dax_lease(dl);
+	return rc;
+}
+
+static const struct lock_manager_operations dax_lease_lm_ops = {
+	.lm_break = dax_lease_lm_break,
+	.lm_change = dax_lease_lm_change,
+};
+
+struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct dax_lease *dl;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+
+	if (!vma_is_dax(vma))
+		return NULL;
+
+	/* device-dax can not be truncated */
+	if (!S_ISREG(inode->i_mode))
+		return NULL;
+
+	dl = kzalloc(sizeof(*dl) + sizeof(struct page *) * nr_pages, GFP_KERNEL);
+	if (!dl)
+		return ERR_PTR(-ENOMEM);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	dl->dl_pages = (struct page **)(dl + 1);
+	INIT_DELAYED_WORK(&dl->dl_break_work, dax_lease_unlock_one);
+	INIT_DELAYED_WORK(&dl->dl_reap_work, dax_lease_reap_all);
+	dl->dl_file = get_file(file);
+	/* need dl alive until dax_lease_set_pages() and final put */
+	atomic_set(&dl->dl_count, 2);
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &dax_lease_lm_ops;
+	fl->fl_flags = FL_ALLOCATED;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = dl;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(fl->fl_file, fl->fl_type, &fl, (void **) &dl);
+	if (rc)
+		goto err_setlease;
+	return dl;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(dl);
+	return ERR_PTR(rc);
+}
+
+void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
+		long nr_pages)
+{
+	if (IS_ERR_OR_NULL(dl))
+		return;
+
+	if (nr_pages <= 0) {
+		dl->dl_nr_pages = 0;
+		smp_mb__before_atomic();
+		set_bit(DAX_LEASE_PAGES, &dl->dl_state);
+		vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
+		flush_delayed_work(&dl->dl_break_work);
+		put_dax_lease(dl);
+		return;
+	}
+
+	dl->dl_nr_pages = nr_pages;
+	memcpy(dl->dl_pages, pages, sizeof(struct page *) * nr_pages);
+	smp_mb__before_atomic();
+	set_bit(DAX_LEASE_PAGES, &dl->dl_state);
+	queue_delayed_work(system_long_wq, &dl->dl_reap_work, HZ);
+}
+EXPORT_SYMBOL_GPL(dax_lease_set_pages);
diff --git a/fs/locks.c b/fs/locks.c
index 1bd71c4d663a..0a7841590b35 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -135,7 +135,7 @@
 
 #define IS_POSIX(fl)	(fl->fl_flags & FL_POSIX)
 #define IS_FLOCK(fl)	(fl->fl_flags & FL_FLOCK)
-#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
+#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT|FL_ALLOCATED))
 #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
 #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
 
@@ -1414,7 +1414,9 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
 
 static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
 {
-	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
+	/* FL_LAYOUT and FL_ALLOCATED only conflict with each other */
+	if (!!(breaker->fl_flags & (FL_LAYOUT|FL_ALLOCATED))
+			!= !!(lease->fl_flags & (FL_LAYOUT|FL_ALLOCATED)))
 		return false;
 	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
 		return false;
@@ -1653,7 +1655,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
 	int ret = 0;
 	struct inode *inode = dentry->d_inode;
 
-	if (flags & FL_LAYOUT)
+	if (flags & (FL_LAYOUT|FL_ALLOCATED))
 		return 0;
 
 	if ((arg == F_RDLCK) &&
@@ -1733,6 +1735,15 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
 		 */
 		if (arg == F_WRLCK)
 			goto out;
+
+		/*
+		 * Taking out a new FL_ALLOCATED lease while a previous
+		 * one is being locked is expected since each instance
+		 * may be responsible for a distinct range of pages.
+		 */
+		if (fl->fl_flags & FL_ALLOCATED)
+			continue;
+
 		/*
 		 * Modifying our existing lease is OK, but no getting a
 		 * new lease if someone else is opening for write:
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..3ff61dc6241e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
+struct dax_lease;
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
+struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages);
+void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
+		long nr_pages);
 #else
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
@@ -111,8 +116,26 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
 {
 	return -ENXIO;
 }
+static inline struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	return NULL;
+}
+
+static inline void dax_lease_set_pages(struct dax_lease *dl,
+		struct page **pages, long nr_pages)
+{
+}
 #endif
 
+static inline struct dax_lease *dax_truncate_lease(struct vm_area_struct *vma,
+		long nr_pages)
+{
+	if (!vma_is_dax(vma))
+		return NULL;
+	return __dax_truncate_lease(vma, nr_pages);
+}
+
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eace2c5396a7..a3ed74833919 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -371,6 +371,9 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+
+	/* dax dma support */
+	void (*dax_flush_dma)(struct inode *inode);
 };
 
 extern const struct address_space_operations empty_aops;
@@ -927,6 +930,7 @@ static inline struct file *get_file(struct file *f)
 #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
 #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
 #define FL_LAYOUT	2048	/* outstanding pNFS layout */
+#define FL_ALLOCATED	4096	/* pin allocated dax blocks against dma */
 
 #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
 
@@ -2324,17 +2328,27 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
 	return ret;
 }
 
-static inline int break_layout(struct inode *inode, bool wait)
+static inline int __break_layout(struct inode *inode, bool wait,
+		unsigned int type)
 {
 	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
 
 	if (ctx && !list_empty_careful(&ctx->flc_lease))
 		return __break_lease(inode,
 				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
-				FL_LAYOUT);
+				type);
 	return 0;
 }
 
+static inline int break_layout(struct inode *inode, bool wait)
+{
+	return __break_layout(inode, wait, FL_LAYOUT);
+}
+
+static inline int break_allocated(struct inode *inode, bool wait)
+{
+	return __break_layout(inode, wait, FL_LAYOUT|FL_ALLOCATED);
+}
 #else /* !CONFIG_FILE_LOCKING */
 static inline int break_lease(struct inode *inode, unsigned int mode)
 {
@@ -2362,6 +2376,10 @@ static inline int break_layout(struct inode *inode, bool wait)
 	return 0;
 }
 
+static inline int break_allocated(struct inode *inode, bool wait)
+{
+	return 0;
+}
 #endif /* CONFIG_FILE_LOCKING */
 
 /* fs/open.c */
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..6a7cf371e656 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -9,6 +9,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/dax.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -640,9 +641,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *nonblocking)
 {
-	long i = 0;
+	long i = 0, result = 0;
+	int dax_lease_once = 0;
 	unsigned int page_mask;
 	struct vm_area_struct *vma = NULL;
+	struct dax_lease *dax_lease = NULL;
 
 	if (!nr_pages)
 		return 0;
@@ -693,6 +696,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		if (pages && !dax_lease_once) {
+			dax_lease_once = 1;
+			dax_lease = dax_truncate_lease(vma, nr_pages);
+			if (IS_ERR(dax_lease)) {
+				result = PTR_ERR(dax_lease);
+				goto out;
+			}
+		}
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
 		if (!page) {
 			int ret;
@@ -704,9 +715,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			case -EFAULT:
 			case -ENOMEM:
 			case -EHWPOISON:
-				return i ? i : ret;
+				result = i ? i : ret;
+				goto out;
 			case -EBUSY:
-				return i;
+				result = i;
+				goto out;
 			case -ENOENT:
 				goto next_page;
 			}
@@ -718,7 +731,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			 */
 			goto next_page;
 		} else if (IS_ERR(page)) {
-			return i ? i : PTR_ERR(page);
+			result = i ? i : PTR_ERR(page);
+			goto out;
 		}
 		if (pages) {
 			pages[i] = page;
@@ -738,7 +752,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		start += page_increm * PAGE_SIZE;
 		nr_pages -= page_increm;
 	} while (nr_pages);
-	return i;
+	result = i;
+out:
+	dax_lease_set_pages(dax_lease, pages, result);
+	return result;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support
  2017-10-20  2:38 ` Dan Williams
  (?)
@ 2017-10-20  2:40   ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:40 UTC (permalink / raw)
  To: akpm
  Cc: J. Bruce Fields, Jan Kara, linux-nvdimm, Darrick J. Wong,
	Dave Chinner, linux-kernel, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, hch

Before xfs can be sure that it is safe to truncate it needs to hold
XFS_MMAP_LOCK_EXCL and flush any FL_ALLOCATED leases.  Introduce
xfs_break_allocated() modeled after xfs_break_layouts() for use in the
file space deletion path.

We also use a new address_space_operation for the fs/dax core to
coordinate reaping these leases in the case where there is no active
truncate process to reap them.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_aops.c  |   24 ++++++++++++++++++++
 fs/xfs/xfs_file.c  |   64 ++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_inode.h |    1 +
 fs/xfs/xfs_ioctl.c |    7 ++----
 4 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f18e5932aec4..00da08d0d6db 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1455,6 +1455,29 @@ xfs_vm_set_page_dirty(
 	return newly_dirty;
 }
 
+/*
+ * Reap any in-flight FL_ALLOCATE leases when the pages represented by
+ * that lease are no longer under dma. We hold XFS_MMAPLOCK_EXCL to
+ * synchronize with the file space deletion path that may be doing the
+ * same operation.
+ */
+static void
+xfs_vm_dax_flush_dma(
+	struct inode		*inode)
+{
+	uint			iolock = XFS_MMAPLOCK_EXCL;
+
+	/*
+	 * try to catch cases where the inode dax mode was changed
+	 * without first synchronizing leases
+	 */
+	WARN_ON_ONCE(!IS_DAX(inode));
+
+	xfs_ilock(XFS_I(inode), iolock);
+	xfs_break_allocated(inode, &iolock);
+	xfs_iunlock(XFS_I(inode), iolock);
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
@@ -1468,4 +1491,5 @@ const struct address_space_operations xfs_address_space_operations = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.dax_flush_dma		= xfs_vm_dax_flush_dma,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..5bc72f1da301 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,6 +40,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/dax.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
@@ -746,6 +747,39 @@ xfs_file_write_iter(
 	return ret;
 }
 
+/*
+ * DAX breaks the traditional truncate model that assumes in-flight DMA
+ * to a file-backed page can continue until the final put of the page
+ * regardless of that page's relationship to the file. In the case of
+ * DAX the page has 1:1 relationship with filesytem blocks. We need to
+ * hold off truncate while any DMA might be in-flight. This assumes that
+ * all DMA usage is transient, any non-transient usages of
+ * get_user_pages must be disallowed for DAX files.
+ *
+ * This also unlocks FL_LAYOUT leases.
+ */
+int
+xfs_break_allocated(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL
+				| XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL));
+
+	while ((error = break_allocated(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_allocated(inode, true);
+		*iolock &= ~XFS_MMAPLOCK_SHARED|XFS_IOLOCK_SHARED;
+		*iolock |= XFS_MMAPLOCK_EXCL|XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
@@ -762,7 +796,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -772,13 +806,10 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_allocated(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
@@ -1136,6 +1167,28 @@ xfs_file_mmap(
 	return 0;
 }
 
+/*
+ * Any manipulation of FL_ALLOCATED leases need to be coordinated with
+ * XFS_MMAPLOCK_EXCL to synchronize get_user_pages() + DMA vs truncate.
+ */
+static int
+xfs_file_setlease(
+	struct file		*filp,
+	long			arg,
+	struct file_lock	**flp,
+	void			**priv)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode 	*ip = XFS_I(inode);
+	uint			iolock = XFS_MMAPLOCK_EXCL;
+	int			error;
+
+	xfs_ilock(ip, iolock);
+	error = generic_setlease(filp, arg, flp, priv);
+	xfs_iunlock(ip, iolock);
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1154,6 +1207,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.clone_file_range = xfs_file_clone_range,
 	.dedupe_file_range = xfs_file_dedupe_range,
+	.setlease	= xfs_file_setlease,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..e0d421884fe4 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -445,6 +445,7 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
+int	xfs_break_allocated(struct inode *inode, uint *iolock);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..5be60c74bede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -642,13 +642,10 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_allocated(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support
@ 2017-10-20  2:40   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:40 UTC (permalink / raw)
  To: akpm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, hch, J. Bruce Fields, linux-mm, Jeff Moyer,
	linux-fsdevel, Jeff Layton, Ross Zwisler

Before xfs can be sure that it is safe to truncate it needs to hold
XFS_MMAP_LOCK_EXCL and flush any FL_ALLOCATED leases.  Introduce
xfs_break_allocated() modeled after xfs_break_layouts() for use in the
file space deletion path.

We also use a new address_space_operation for the fs/dax core to
coordinate reaping these leases in the case where there is no active
truncate process to reap them.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_aops.c  |   24 ++++++++++++++++++++
 fs/xfs/xfs_file.c  |   64 ++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_inode.h |    1 +
 fs/xfs/xfs_ioctl.c |    7 ++----
 4 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f18e5932aec4..00da08d0d6db 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1455,6 +1455,29 @@ xfs_vm_set_page_dirty(
 	return newly_dirty;
 }
 
+/*
+ * Reap any in-flight FL_ALLOCATE leases when the pages represented by
+ * that lease are no longer under dma. We hold XFS_MMAPLOCK_EXCL to
+ * synchronize with the file space deletion path that may be doing the
+ * same operation.
+ */
+static void
+xfs_vm_dax_flush_dma(
+	struct inode		*inode)
+{
+	uint			iolock = XFS_MMAPLOCK_EXCL;
+
+	/*
+	 * try to catch cases where the inode dax mode was changed
+	 * without first synchronizing leases
+	 */
+	WARN_ON_ONCE(!IS_DAX(inode));
+
+	xfs_ilock(XFS_I(inode), iolock);
+	xfs_break_allocated(inode, &iolock);
+	xfs_iunlock(XFS_I(inode), iolock);
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
@@ -1468,4 +1491,5 @@ const struct address_space_operations xfs_address_space_operations = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.dax_flush_dma		= xfs_vm_dax_flush_dma,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..5bc72f1da301 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,6 +40,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/dax.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
@@ -746,6 +747,39 @@ xfs_file_write_iter(
 	return ret;
 }
 
+/*
+ * DAX breaks the traditional truncate model that assumes in-flight DMA
+ * to a file-backed page can continue until the final put of the page
+ * regardless of that page's relationship to the file. In the case of
+ * DAX the page has 1:1 relationship with filesytem blocks. We need to
+ * hold off truncate while any DMA might be in-flight. This assumes that
+ * all DMA usage is transient, any non-transient usages of
+ * get_user_pages must be disallowed for DAX files.
+ *
+ * This also unlocks FL_LAYOUT leases.
+ */
+int
+xfs_break_allocated(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL
+				| XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL));
+
+	while ((error = break_allocated(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_allocated(inode, true);
+		*iolock &= ~XFS_MMAPLOCK_SHARED|XFS_IOLOCK_SHARED;
+		*iolock |= XFS_MMAPLOCK_EXCL|XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
@@ -762,7 +796,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -772,13 +806,10 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_allocated(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
@@ -1136,6 +1167,28 @@ xfs_file_mmap(
 	return 0;
 }
 
+/*
+ * Any manipulation of FL_ALLOCATED leases need to be coordinated with
+ * XFS_MMAPLOCK_EXCL to synchronize get_user_pages() + DMA vs truncate.
+ */
+static int
+xfs_file_setlease(
+	struct file		*filp,
+	long			arg,
+	struct file_lock	**flp,
+	void			**priv)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode 	*ip = XFS_I(inode);
+	uint			iolock = XFS_MMAPLOCK_EXCL;
+	int			error;
+
+	xfs_ilock(ip, iolock);
+	error = generic_setlease(filp, arg, flp, priv);
+	xfs_iunlock(ip, iolock);
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1154,6 +1207,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.clone_file_range = xfs_file_clone_range,
 	.dedupe_file_range = xfs_file_dedupe_range,
+	.setlease	= xfs_file_setlease,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..e0d421884fe4 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -445,6 +445,7 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
+int	xfs_break_allocated(struct inode *inode, uint *iolock);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..5be60c74bede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -642,13 +642,10 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_allocated(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support
@ 2017-10-20  2:40   ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20  2:40 UTC (permalink / raw)
  To: akpm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, hch, J. Bruce Fields, linux-mm, Jeff Moyer,
	linux-fsdevel, Jeff Layton, Ross Zwisler

Before xfs can be sure that it is safe to truncate it needs to hold
XFS_MMAP_LOCK_EXCL and flush any FL_ALLOCATED leases.  Introduce
xfs_break_allocated() modeled after xfs_break_layouts() for use in the
file space deletion path.

We also use a new address_space_operation for the fs/dax core to
coordinate reaping these leases in the case where there is no active
truncate process to reap them.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_aops.c  |   24 ++++++++++++++++++++
 fs/xfs/xfs_file.c  |   64 ++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_inode.h |    1 +
 fs/xfs/xfs_ioctl.c |    7 ++----
 4 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f18e5932aec4..00da08d0d6db 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1455,6 +1455,29 @@ xfs_vm_set_page_dirty(
 	return newly_dirty;
 }
 
+/*
+ * Reap any in-flight FL_ALLOCATE leases when the pages represented by
+ * that lease are no longer under dma. We hold XFS_MMAPLOCK_EXCL to
+ * synchronize with the file space deletion path that may be doing the
+ * same operation.
+ */
+static void
+xfs_vm_dax_flush_dma(
+	struct inode		*inode)
+{
+	uint			iolock = XFS_MMAPLOCK_EXCL;
+
+	/*
+	 * try to catch cases where the inode dax mode was changed
+	 * without first synchronizing leases
+	 */
+	WARN_ON_ONCE(!IS_DAX(inode));
+
+	xfs_ilock(XFS_I(inode), iolock);
+	xfs_break_allocated(inode, &iolock);
+	xfs_iunlock(XFS_I(inode), iolock);
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
@@ -1468,4 +1491,5 @@ const struct address_space_operations xfs_address_space_operations = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.dax_flush_dma		= xfs_vm_dax_flush_dma,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..5bc72f1da301 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,6 +40,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/dax.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
@@ -746,6 +747,39 @@ xfs_file_write_iter(
 	return ret;
 }
 
+/*
+ * DAX breaks the traditional truncate model that assumes in-flight DMA
+ * to a file-backed page can continue until the final put of the page
+ * regardless of that page's relationship to the file. In the case of
+ * DAX the page has 1:1 relationship with filesytem blocks. We need to
+ * hold off truncate while any DMA might be in-flight. This assumes that
+ * all DMA usage is transient, any non-transient usages of
+ * get_user_pages must be disallowed for DAX files.
+ *
+ * This also unlocks FL_LAYOUT leases.
+ */
+int
+xfs_break_allocated(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL
+				| XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL));
+
+	while ((error = break_allocated(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_allocated(inode, true);
+		*iolock &= ~XFS_MMAPLOCK_SHARED|XFS_IOLOCK_SHARED;
+		*iolock |= XFS_MMAPLOCK_EXCL|XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
+
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
@@ -762,7 +796,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -772,13 +806,10 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_allocated(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
@@ -1136,6 +1167,28 @@ xfs_file_mmap(
 	return 0;
 }
 
+/*
+ * Any manipulation of FL_ALLOCATED leases need to be coordinated with
+ * XFS_MMAPLOCK_EXCL to synchronize get_user_pages() + DMA vs truncate.
+ */
+static int
+xfs_file_setlease(
+	struct file		*filp,
+	long			arg,
+	struct file_lock	**flp,
+	void			**priv)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode 	*ip = XFS_I(inode);
+	uint			iolock = XFS_MMAPLOCK_EXCL;
+	int			error;
+
+	xfs_ilock(ip, iolock);
+	error = generic_setlease(filp, arg, flp, priv);
+	xfs_iunlock(ip, iolock);
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1154,6 +1207,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.clone_file_range = xfs_file_clone_range,
 	.dedupe_file_range = xfs_file_dedupe_range,
+	.setlease	= xfs_file_setlease,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..e0d421884fe4 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -445,6 +445,7 @@ int	xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 		     xfs_fsize_t isize, bool *did_zeroing);
 int	xfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
 		bool *did_zero);
+int	xfs_break_allocated(struct inode *inode, uint *iolock);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..5be60c74bede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 	int			error;
 
 	/*
@@ -642,13 +642,10 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
+	error = xfs_break_allocated(inode, &iolock);
 	if (error)
 		goto out_unlock;
 
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-20  2:38 ` Dan Williams
  (?)
  (?)
@ 2017-10-20  7:47   ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, hch, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, Sean Hefty,
	Dave Chinner, linux-fsdevel, Alexander Viro, Gerald Schaefer,
	linux-nvdimm, linux-kernel, linux-xfs, Martin Schwidefsky, akpm,
	Darrick J. Wong, Kirill A. Shutemov

> The solution presented is not pretty. It creates a stream of leases, one
> for each get_user_pages() invocation, and polls page reference counts
> until DMA stops. We're missing a reliable way to not only trap the
> DMA-idle event, but also block new references being taken on pages while
> truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
> dma-busy pages" presents other options considered, and notes that this
> solution can only be viewed as a stop-gap.

I'd like to brainstorm how we can do something better.

How about:

If we hit a page with an elevated refcount in truncate / hole puch
etc for a DAX file system we do not free the blocks in the file system,
but add it to the extent busy list.  We mark the page as delayed
free (e.g. page flag?) so that when it finally hits refcount zero we
call back into the file system to remove it from the busy list.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  7:47   ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt,
	Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe,
	Doug Ledford, Ross Zwisler, Hal Rosenstock, Heiko Carstens,
	linux-nvdimm, Alexander Viro, Gerald Schaefer, Darri

> The solution presented is not pretty. It creates a stream of leases, one
> for each get_user_pages() invocation, and polls page reference counts
> until DMA stops. We're missing a reliable way to not only trap the
> DMA-idle event, but also block new references being taken on pages while
> truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
> dma-busy pages" presents other options considered, and notes that this
> solution can only be viewed as a stop-gap.

I'd like to brainstorm how we can do something better.

How about:

If we hit a page with an elevated refcount in truncate / hole puch
etc for a DAX file system we do not free the blocks in the file system,
but add it to the extent busy list.  We mark the page as delayed
free (e.g. page flag?) so that when it finally hits refcount zero we
call back into the file system to remove it from the busy list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  7:47   ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt,
	Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe,
	Doug Ledford, Ross Zwisler, Hal Rosenstock, Heiko Carstens,
	linux-nvdimm, Alexander Viro, Gerald Schaefer, Darrick J. Wong,
	linux-kernel, linux-xfs, Martin Schwidefsky, linux-fsdevel,
	Kirill A. Shutemov

> The solution presented is not pretty. It creates a stream of leases, one
> for each get_user_pages() invocation, and polls page reference counts
> until DMA stops. We're missing a reliable way to not only trap the
> DMA-idle event, but also block new references being taken on pages while
> truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
> dma-busy pages" presents other options considered, and notes that this
> solution can only be viewed as a stop-gap.

I'd like to brainstorm how we can do something better.

How about:

If we hit a page with an elevated refcount in truncate / hole puch
etc for a DAX file system we do not free the blocks in the file system,
but add it to the extent busy list.  We mark the page as delayed
free (e.g. page flag?) so that when it finally hits refcount zero we
call back into the file system to remove it from the busy list.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  7:47   ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt,
	Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe,
	Doug Ledford, Ross Zwisler, Hal Rosenstock, Heiko Carstens,
	linux-nvdimm, Alexander Viro, Gerald Schaefer, Darrick J. Wong,
	linux-kernel, linux-xfs, Martin Schwidefsky, linux-fsdevel,
	Kirill A. Shutemov

> The solution presented is not pretty. It creates a stream of leases, one
> for each get_user_pages() invocation, and polls page reference counts
> until DMA stops. We're missing a reliable way to not only trap the
> DMA-idle event, but also block new references being taken on pages while
> truncate is allowed to progress. "[PATCH v3 12/13] dax: handle truncate of
> dma-busy pages" presents other options considered, and notes that this
> solution can only be viewed as a stop-gap.

I'd like to brainstorm how we can do something better.

How about:

If we hit a page with an elevated refcount in truncate / hole puch
etc for a DAX file system we do not free the blocks in the file system,
but add it to the extent busy list.  We mark the page as delayed
free (e.g. page flag?) so that when it finally hits refcount zero we
call back into the file system to remove it from the busy list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20  2:39   ` Dan Williams
@ 2017-10-20  7:57     ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jan Kara, linux-nvdimm, Benjamin Herrenschmidt,
	Heiko Carstens, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Ross Zwisler, hch, Gerald Schaefer

> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  
>  static const struct dax_operations axon_ram_dax_ops = {
>  	.direct_access = axon_ram_dax_direct_access,
> +
>  	.copy_from_iter = axon_ram_copy_from_iter,

Unrelated whitespace change.  That being said - I don't think axonram has
devmap support in any form, so this basically becomes dead code, doesn't
it?

> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 7abb240847c0..e7e5db07e339 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
>  
>  static const struct dax_operations dcssblk_dax_ops = {
>  	.direct_access = dcssblk_dax_direct_access,
> +
>  	.copy_from_iter = dcssblk_dax_copy_from_iter,

Same comments apply here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20  7:57     ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  7:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jan Kara, linux-nvdimm, Benjamin Herrenschmidt,
	Heiko Carstens, linux-kernel, linux-xfs, linux-mm, Jeff Moyer,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Ross Zwisler, hch, Gerald Schaefer

> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>  
>  static const struct dax_operations axon_ram_dax_ops = {
>  	.direct_access = axon_ram_dax_direct_access,
> +
>  	.copy_from_iter = axon_ram_copy_from_iter,

Unrelated whitespace change.  That being said - I don't think axonram has
devmap support in any form, so this basically becomes dead code, doesn't
it?

> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 7abb240847c0..e7e5db07e339 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
>  
>  static const struct dax_operations dcssblk_dax_ops = {
>  	.direct_access = dcssblk_dax_direct_access,
> +
>  	.copy_from_iter = dcssblk_dax_copy_from_iter,

Same comments apply here.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-20  7:47   ` Christoph Hellwig
  (?)
@ 2017-10-20  9:31     ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  9:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt,
	Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe,
	Doug Ledford, Ross Zwisler, Hal Rosenstock, Heiko Carstens,
	linux-nvdimm, Alexander Viro, Gerald Schaefer, Darrick J. Wong,
	linux-kernel, linux-xfs, Martin Schwidefsky, linux-fsdevel,
	Kirill A. Shutemov

On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> I'd like to brainstorm how we can do something better.
> 
> How about:
> 
> If we hit a page with an elevated refcount in truncate / hole puch
> etc for a DAX file system we do not free the blocks in the file system,
> but add it to the extent busy list.  We mark the page as delayed
> free (e.g. page flag?) so that when it finally hits refcount zero we
> call back into the file system to remove it from the busy list.

Brainstorming some more:

Given that on a DAX file there shouldn't be any long-term page
references after we unmap it from the page table and don't allow
get_user_pages calls why not wait for the references for all
DAX pages to go away first?  E.g. if we find a DAX page in
truncate_inode_pages_range that has an elevated refcount we set
a new flag to prevent new references from showing up, and then
simply wait for it to go away.  Instead of a busy way we can
do this through a few hashed waitqueued in dev_pagemap.  And in
fact put_zone_device_page already gets called when putting the
last page so we can handle the wakeup from there.

In fact if we can't find a page flag for the stop new callers
things we could probably come up with a way to do that through
dev_pagemap somehow, but I'm not sure how efficient that would
be.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  9:31     ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  9:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt,
	Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe,
	Doug Ledford, Ross Zwisler, Hal Rosenstock, Heiko Carstens,
	linux-nvdimm, Alexander Viro, Gerald Schaefer, Darri

On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> I'd like to brainstorm how we can do something better.
> 
> How about:
> 
> If we hit a page with an elevated refcount in truncate / hole puch
> etc for a DAX file system we do not free the blocks in the file system,
> but add it to the extent busy list.  We mark the page as delayed
> free (e.g. page flag?) so that when it finally hits refcount zero we
> call back into the file system to remove it from the busy list.

Brainstorming some more:

Given that on a DAX file there shouldn't be any long-term page
references after we unmap it from the page table and don't allow
get_user_pages calls why not wait for the references for all
DAX pages to go away first?  E.g. if we find a DAX page in
truncate_inode_pages_range that has an elevated refcount we set
a new flag to prevent new references from showing up, and then
simply wait for it to go away.  Instead of a busy way we can
do this through a few hashed waitqueued in dev_pagemap.  And in
fact put_zone_device_page already gets called when putting the
last page so we can handle the wakeup from there.

In fact if we can't find a page flag for the stop new callers
things we could probably come up with a way to do that through
dev_pagemap somehow, but I'm not sure how efficient that would
be.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-20  9:31     ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20  9:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Michal Hocko, Jan Kara, Benjamin Herrenschmidt,
	Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jeff Moyer, hch, Jason Gunthorpe,
	Doug Ledford, Ross Zwisler, Hal Rosenstock, Heiko Carstens,
	linux-nvdimm, Alexander Viro, Gerald Schaefer, Darrick J. Wong,
	linux-kernel, linux-xfs, Martin Schwidefsky, linux-fsdevel,
	Kirill A. Shutemov

On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> I'd like to brainstorm how we can do something better.
> 
> How about:
> 
> If we hit a page with an elevated refcount in truncate / hole puch
> etc for a DAX file system we do not free the blocks in the file system,
> but add it to the extent busy list.  We mark the page as delayed
> free (e.g. page flag?) so that when it finally hits refcount zero we
> call back into the file system to remove it from the busy list.

Brainstorming some more:

Given that on a DAX file there shouldn't be any long-term page
references after we unmap it from the page table and don't allow
get_user_pages calls why not wait for the references for all
DAX pages to go away first?  E.g. if we find a DAX page in
truncate_inode_pages_range that has an elevated refcount we set
a new flag to prevent new references from showing up, and then
simply wait for it to go away.  Instead of a busy way we can
do this through a few hashed waitqueued in dev_pagemap.  And in
fact put_zone_device_page already gets called when putting the
last page so we can handle the wakeup from there.

In fact if we can't find a page flag for the stop new callers
things we could probably come up with a way to do that through
dev_pagemap somehow, but I'm not sure how efficient that would
be.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
  2017-10-20  2:39   ` Dan Williams
  (?)
  (?)
@ 2017-10-20 12:39     ` Jeffrey Layton
  -1 siblings, 0 replies; 143+ messages in thread
From: Jeffrey Layton @ 2017-10-20 12:39 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: J. Bruce Fields, linux-nvdimm, linux-kernel, linux-xfs, linux-mm,
	Alexander Viro, linux-fsdevel, hch

On Thu, 2017-10-19 at 19:39 -0700, Dan Williams wrote:
> Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
> checks of inode->i_flctx to use smp_load_acquire(), but it did not
> convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
> should be no functional difference since __break_lease repeats the
> sequence, but this is a clean up to unify all ->i_flctx lookups on a
> common pattern.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/fs.h |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 13dab191a23e..eace2c5396a7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
>  	 * could end up racing with tasks trying to set a new lease on this
>  	 * file.
>  	 */
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode, mode, FL_LEASE);
>  	return 0;
>  }
> @@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  
>  static inline int break_layout(struct inode *inode, bool wait)
>  {
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
>  				FL_LAYOUT);
> 

Nice catch. This can go in independently of the rest of the patches in
the series, I think. I'll assume Andrew is picking this up since he's in
the "To:", but let me know if you need me to get it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
@ 2017-10-20 12:39     ` Jeffrey Layton
  0 siblings, 0 replies; 143+ messages in thread
From: Jeffrey Layton @ 2017-10-20 12:39 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: linux-xfs, linux-nvdimm, linux-kernel, hch, J. Bruce Fields,
	linux-mm, Alexander Viro, linux-fsdevel, Ross Zwisler

On Thu, 2017-10-19 at 19:39 -0700, Dan Williams wrote:
> Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
> checks of inode->i_flctx to use smp_load_acquire(), but it did not
> convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
> should be no functional difference since __break_lease repeats the
> sequence, but this is a clean up to unify all ->i_flctx lookups on a
> common pattern.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/fs.h |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 13dab191a23e..eace2c5396a7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
>  	 * could end up racing with tasks trying to set a new lease on this
>  	 * file.
>  	 */
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode, mode, FL_LEASE);
>  	return 0;
>  }
> @@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  
>  static inline int break_layout(struct inode *inode, bool wait)
>  {
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
>  				FL_LAYOUT);
> 

Nice catch. This can go in independently of the rest of the patches in
the series, I think. I'll assume Andrew is picking this up since he's in
the "To:", but let me know if you need me to get it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
@ 2017-10-20 12:39     ` Jeffrey Layton
  0 siblings, 0 replies; 143+ messages in thread
From: Jeffrey Layton @ 2017-10-20 12:39 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: linux-xfs, linux-nvdimm, linux-kernel, hch, J. Bruce Fields,
	linux-mm, Alexander Viro, linux-fsdevel, Ross Zwisler

On Thu, 2017-10-19 at 19:39 -0700, Dan Williams wrote:
> Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
> checks of inode->i_flctx to use smp_load_acquire(), but it did not
> convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
> should be no functional difference since __break_lease repeats the
> sequence, but this is a clean up to unify all ->i_flctx lookups on a
> common pattern.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/fs.h |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 13dab191a23e..eace2c5396a7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
>  	 * could end up racing with tasks trying to set a new lease on this
>  	 * file.
>  	 */
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode, mode, FL_LEASE);
>  	return 0;
>  }
> @@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  
>  static inline int break_layout(struct inode *inode, bool wait)
>  {
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
>  				FL_LAYOUT);
> 

Nice catch. This can go in independently of the rest of the patches in
the series, I think. I'll assume Andrew is picking this up since he's in
the "To:", but let me know if you need me to get it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease}
@ 2017-10-20 12:39     ` Jeffrey Layton
  0 siblings, 0 replies; 143+ messages in thread
From: Jeffrey Layton @ 2017-10-20 12:39 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: linux-xfs, linux-nvdimm, linux-kernel, hch, J. Bruce Fields,
	linux-mm, Alexander Viro, linux-fsdevel, Ross Zwisler

On Thu, 2017-10-19 at 19:39 -0700, Dan Williams wrote:
> Commit 128a37852234 "fs: fix data races on inode->i_flctx" converted
> checks of inode->i_flctx to use smp_load_acquire(), but it did not
> convert break_layout(). smp_load_acquire() includes a READ_ONCE(). There
> should be no functional difference since __break_lease repeats the
> sequence, but this is a clean up to unify all ->i_flctx lookups on a
> common pattern.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/fs.h |   10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 13dab191a23e..eace2c5396a7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2281,8 +2281,9 @@ static inline int break_lease(struct inode *inode, unsigned int mode)
>  	 * could end up racing with tasks trying to set a new lease on this
>  	 * file.
>  	 */
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode, mode, FL_LEASE);
>  	return 0;
>  }
> @@ -2325,8 +2326,9 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  
>  static inline int break_layout(struct inode *inode, bool wait)
>  {
> -	smp_mb();
> -	if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease))
> +	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
> +
> +	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
>  				FL_LAYOUT);
> 

Nice catch. This can go in independently of the rest of the patches in
the series, I think. I'll assume Andrew is picking this up since he's in
the "To:", but let me know if you need me to get it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20  2:40   ` Dan Williams
  (?)
@ 2017-10-20 13:05     ` Jeff Layton
  -1 siblings, 0 replies; 143+ messages in thread
From: Jeff Layton @ 2017-10-20 13:05 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: linux-xfs, Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, hch, J. Bruce Fields, linux-mm, Alexander Viro,
	linux-fsdevel, Darrick J. Wong, linux-nvdimm

On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
> get_user_pages() pins file backed memory pages for access by dma
> devices. However, it only pins the memory pages not the page-to-file
> offset association. If a file is truncated the pages are mapped out of
> the file and dma may continue indefinitely into a page that is owned by
> a device driver. This breaks coherency of the file vs dma, but the
> assumption is that if userspace wants the file-space truncated it does
> not matter what data is inbound from the device, it is not relevant
> anymore.
> 
> The assumptions of the truncate-page-cache model are broken by DAX where
> the target DMA page *is* the filesystem block. Leaving the page pinned
> for DMA, but truncating the file block out of the file, means that the
> filesytem is free to reallocate a block under active DMA to another
> file!
> 
> Here are some possible options for fixing this situation ('truncate' and
> 'fallocate(punch hole)' are synonymous below):
> 
>     1/ Fail truncate while any file blocks might be under dma
> 
>     2/ Block (sleep-wait) truncate while any file blocks might be under
>        dma
> 
>     3/ Remap file blocks to a "lost+found"-like file-inode where
>        dma can continue and we might see what inbound data from DMA was
>        mapped out of the original file. Blocks in this file could be
>        freed back to the filesystem when dma eventually ends.
> 
>     4/ Disable dax until option 3 or another long term solution has been
>        implemented. However, filesystem-dax is still marked experimental
>        for concerns like this.
> 
> Option 1 will throw failures where userspace has never expected them
> before, option 2 might hang the truncating process indefinitely, and
> option 3 requires per filesystem enabling to remap blocks from one inode
> to another.  Option 2 is implemented in this patch for the DAX path with
> the expectation that non-transient users of get_user_pages() (RDMA) are
> disallowed from setting up dax mappings and that the potential delay
> introduced to the truncate path is acceptable compared to the response
> time of the page cache case. This can only be seen as a stop-gap until
> we can solve the problem of safely sequestering unallocated filesystem
> blocks under active dma.
> 

FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
more work though, so no objection to this as a stop-gap fix.


> The solution introduces a new FL_ALLOCATED lease to pin the allocated
> blocks in a dax file while dma might be accessing them. It behaves
> identically to an FL_LAYOUT lease save for the fact that it is
> immediately sheduled to be reaped, and that the only path that waits for
> its removal is the truncate path. We can not reuse FL_LAYOUT directly
> since that would deadlock in the case where userspace did a direct-I/O
> operation with a target buffer backed by an mmap range of the same file.
> 
> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
> something similar as an alternative way to solve the problem that
> MAP_DIRECT was trying to solve.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/Kconfig          |    1 
>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/locks.c          |   17 ++++-
>  include/linux/dax.h |   23 ++++++
>  include/linux/fs.h  |   22 +++++-
>  mm/gup.c            |   27 ++++++-
>  6 files changed, 268 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7aee6d699fd6..a7b31a96a753 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>  config FS_DAX
>  	bool "Direct Access (DAX) support"
>  	depends on MMU
> +	depends on FILE_LOCKING
>  	depends on !(ARM || MIPS || SPARC)
>  	select FS_IOMAP
>  	select DAX
> diff --git a/fs/dax.c b/fs/dax.c
> index b03f547b36e7..e0a3958fc5f2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -22,6 +22,7 @@
>  #include <linux/genhd.h>
>  #include <linux/highmem.h>
>  #include <linux/memcontrol.h>
> +#include <linux/file.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
>  #include <linux/pagevec.h>
> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  	}
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +enum dax_lease_flags {
> +	DAX_LEASE_PAGES,
> +	DAX_LEASE_BREAK,
> +};
> +
> +struct dax_lease {
> +	struct page **dl_pages;
> +	unsigned long dl_nr_pages;
> +	unsigned long dl_state;
> +	struct file *dl_file;
> +	atomic_t dl_count;
> +	/*
> +	 * Once the lease is taken and the pages have references we
> +	 * start the reap_work to poll for lease release while acquiring
> +	 * fs locks that synchronize with truncate. So, either reap_work
> +	 * cleans up the dax_lease instances or truncate itself.
> +	 *
> +	 * The break_work sleepily polls for DMA completion and then
> +	 * unlocks/removes the lease.
> +	 */
> +	struct delayed_work dl_reap_work;
> +	struct delayed_work dl_break_work;
> +};
> +
> +static void put_dax_lease(struct dax_lease *dl)
> +{
> +	if (atomic_dec_and_test(&dl->dl_count)) {
> +		fput(dl->dl_file);
> +		kfree(dl);
> +	}
> +}

Any reason not to use the new refcount_t type for dl_count? Seems like a
good place for it.

> +
> +static void dax_lease_unlock_one(struct work_struct *work)
> +{
> +	struct dax_lease *dl = container_of(work, typeof(*dl),
> +			dl_break_work.work);
> +	unsigned long i;
> +
> +	/* wait for the gup path to finish recording pages in the lease */
> +	if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
> +		schedule_delayed_work(&dl->dl_break_work, HZ);
> +		return;
> +	}
> +
> +	/* barrier pairs with dax_lease_set_pages() */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * If we see all pages idle at least once we can remove the
> +	 * lease. If we happen to race with someone else taking a
> +	 * reference on a page they will have their own lease to protect
> +	 * against truncate.
> +	 */
> +	for (i = 0; i < dl->dl_nr_pages; i++)
> +		if (page_ref_count(dl->dl_pages[i]) > 1) {
> +			schedule_delayed_work(&dl->dl_break_work, HZ);
> +			return;
> +		}
> +	vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
> +	put_dax_lease(dl);
> +}
> +
> +static void dax_lease_reap_all(struct work_struct *work)
> +{
> +	struct dax_lease *dl = container_of(work, typeof(*dl),
> +			dl_reap_work.work);
> +	struct file *file = dl->dl_file;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	if (mapping->a_ops->dax_flush_dma) {
> +		mapping->a_ops->dax_flush_dma(inode);
> +	} else {
> +		/* FIXME: dax-filesystem needs to add dax-dma support */
> +		break_allocated(inode, true);
> +	}
> +	put_dax_lease(dl);
> +}
> +
> +static bool dax_lease_lm_break(struct file_lock *fl)
> +{
> +	struct dax_lease *dl = fl->fl_owner;
> +
> +	if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
> +		atomic_inc(&dl->dl_count);
> +		schedule_delayed_work(&dl->dl_break_work, HZ);
> +	}
> +

I haven't gone over this completely, but what prevents you from doing a
0->1 transition on the dl_count here, and possibly doing a use-after
free?

Ahh ok...I guess we know that we hold a reference since this is on the
flc_lease list? Fair enough. Still, might be worth a comment there as to
why that's safe.


> +	/* Tell the core lease code to wait for delayed work completion */
> +	fl->fl_break_time = 0;
> +
> +	return false;
> +}
> +
> +static int dax_lease_lm_change(struct file_lock *fl, int arg,
> +		struct list_head *dispose)
> +{
> +	struct dax_lease *dl;
> +	int rc;
> +
> +	WARN_ON(!(arg & F_UNLCK));
> +	dl = fl->fl_owner;
> +	rc = lease_modify(fl, arg, dispose);
> +	put_dax_lease(dl);
> +	return rc;
> +}
> +
> +static const struct lock_manager_operations dax_lease_lm_ops = {
> +	.lm_break = dax_lease_lm_break,
> +	.lm_change = dax_lease_lm_change,
> +};
> +
> +struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file_inode(file);
> +	struct dax_lease *dl;
> +	struct file_lock *fl;
> +	int rc = -ENOMEM;
> +
> +	if (!vma_is_dax(vma))
> +		return NULL;
> +
> +	/* device-dax can not be truncated */
> +	if (!S_ISREG(inode->i_mode))
> +		return NULL;
> +
> +	dl = kzalloc(sizeof(*dl) + sizeof(struct page *) * nr_pages, GFP_KERNEL);
> +	if (!dl)
> +		return ERR_PTR(-ENOMEM);
> +
> +	fl = locks_alloc_lock();
> +	if (!fl)
> +		goto err_lock_alloc;
> +
> +	dl->dl_pages = (struct page **)(dl + 1);
> +	INIT_DELAYED_WORK(&dl->dl_break_work, dax_lease_unlock_one);
> +	INIT_DELAYED_WORK(&dl->dl_reap_work, dax_lease_reap_all);
> +	dl->dl_file = get_file(file);
> +	/* need dl alive until dax_lease_set_pages() and final put */
> +	atomic_set(&dl->dl_count, 2);
> +
> +	locks_init_lock(fl);
> +	fl->fl_lmops = &dax_lease_lm_ops;
> +	fl->fl_flags = FL_ALLOCATED;
> +	fl->fl_type = F_RDLCK;
> +	fl->fl_end = OFFSET_MAX;
> +	fl->fl_owner = dl;
> +	fl->fl_pid = current->tgid;
> +	fl->fl_file = file;
> +
> +	rc = vfs_setlease(fl->fl_file, fl->fl_type, &fl, (void **) &dl);
> +	if (rc)
> +		goto err_setlease;
> +	return dl;
> +err_setlease:
> +	locks_free_lock(fl);
> +err_lock_alloc:
> +	kfree(dl);
> +	return ERR_PTR(rc);
> +}
> +
> +void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
> +		long nr_pages)
> +{
> +	if (IS_ERR_OR_NULL(dl))
> +		return;
> +
> +	if (nr_pages <= 0) {
> +		dl->dl_nr_pages = 0;
> +		smp_mb__before_atomic();
> +		set_bit(DAX_LEASE_PAGES, &dl->dl_state);
> +		vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
> +		flush_delayed_work(&dl->dl_break_work);
> +		put_dax_lease(dl);
> +		return;
> +	}
> +
> +	dl->dl_nr_pages = nr_pages;
> +	memcpy(dl->dl_pages, pages, sizeof(struct page *) * nr_pages);
> +	smp_mb__before_atomic();
> +	set_bit(DAX_LEASE_PAGES, &dl->dl_state);
> +	queue_delayed_work(system_long_wq, &dl->dl_reap_work, HZ);
> +}
> +EXPORT_SYMBOL_GPL(dax_lease_set_pages);
> diff --git a/fs/locks.c b/fs/locks.c
> index 1bd71c4d663a..0a7841590b35 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -135,7 +135,7 @@
>  
>  #define IS_POSIX(fl)	(fl->fl_flags & FL_POSIX)
>  #define IS_FLOCK(fl)	(fl->fl_flags & FL_FLOCK)
> -#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> +#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT|FL_ALLOCATED))
>  #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
>  #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
>  
> @@ -1414,7 +1414,9 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>  
>  static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
>  {
> -	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
> +	/* FL_LAYOUT and FL_ALLOCATED only conflict with each other */
> +	if (!!(breaker->fl_flags & (FL_LAYOUT|FL_ALLOCATED))
> +			!= !!(lease->fl_flags & (FL_LAYOUT|FL_ALLOCATED)))
>  		return false;
>  	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
>  		return false;
> @@ -1653,7 +1655,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
>  	int ret = 0;
>  	struct inode *inode = dentry->d_inode;
>  
> -	if (flags & FL_LAYOUT)
> +	if (flags & (FL_LAYOUT|FL_ALLOCATED))
>  		return 0;
>  
>  	if ((arg == F_RDLCK) &&
> @@ -1733,6 +1735,15 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
>  		 */
>  		if (arg == F_WRLCK)
>  			goto out;
> +
> +		/*
> +		 * Taking out a new FL_ALLOCATED lease while a previous
> +		 * one is being locked is expected since each instance
> +		 * may be responsible for a distinct range of pages.
> +		 */
> +		if (fl->fl_flags & FL_ALLOCATED)
> +			continue;
> +
>  		/*
>  		 * Modifying our existing lease is OK, but no getting a
>  		 * new lease if someone else is opening for write:
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 122197124b9d..3ff61dc6241e 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -100,10 +100,15 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
>  
> +struct dax_lease;
>  #ifdef CONFIG_FS_DAX
>  int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
>  		unsigned int offset, unsigned int length);
> +struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages);
> +void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
> +		long nr_pages);
>  #else
>  static inline int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
> @@ -111,8 +116,26 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
>  {
>  	return -ENXIO;
>  }
> +static inline struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	return NULL;
> +}
> +
> +static inline void dax_lease_set_pages(struct dax_lease *dl,
> +		struct page **pages, long nr_pages)
> +{
> +}
>  #endif
>  
> +static inline struct dax_lease *dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	if (!vma_is_dax(vma))
> +		return NULL;
> +	return __dax_truncate_lease(vma, nr_pages);
> +}
> +
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index eace2c5396a7..a3ed74833919 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -371,6 +371,9 @@ struct address_space_operations {
>  	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
>  				sector_t *span);
>  	void (*swap_deactivate)(struct file *file);
> +
> +	/* dax dma support */
> +	void (*dax_flush_dma)(struct inode *inode);
>  };
>  
>  extern const struct address_space_operations empty_aops;
> @@ -927,6 +930,7 @@ static inline struct file *get_file(struct file *f)
>  #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>  #define FL_LAYOUT	2048	/* outstanding pNFS layout */
> +#define FL_ALLOCATED	4096	/* pin allocated dax blocks against dma */
>  
>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>  
> @@ -2324,17 +2328,27 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  	return ret;
>  }
>  
> -static inline int break_layout(struct inode *inode, bool wait)
> +static inline int __break_layout(struct inode *inode, bool wait,
> +		unsigned int type)
>  {
>  	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
>  
>  	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
> -				FL_LAYOUT);
> +				type);
>  	return 0;
>  }
>  
> +static inline int break_layout(struct inode *inode, bool wait)
> +{
> +	return __break_layout(inode, wait, FL_LAYOUT);
> +}
> +
> +static inline int break_allocated(struct inode *inode, bool wait)
> +{
> +	return __break_layout(inode, wait, FL_LAYOUT|FL_ALLOCATED);
> +}
>  #else /* !CONFIG_FILE_LOCKING */
>  static inline int break_lease(struct inode *inode, unsigned int mode)
>  {
> @@ -2362,6 +2376,10 @@ static inline int break_layout(struct inode *inode, bool wait)
>  	return 0;
>  }
>  
> +static inline int break_allocated(struct inode *inode, bool wait)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_FILE_LOCKING */
>  
>  /* fs/open.c */
> diff --git a/mm/gup.c b/mm/gup.c
> index 308be897d22a..6a7cf371e656 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -9,6 +9,7 @@
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/dax.h>
>  
>  #include <linux/sched/signal.h>
>  #include <linux/rwsem.h>
> @@ -640,9 +641,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *nonblocking)
>  {
> -	long i = 0;
> +	long i = 0, result = 0;
> +	int dax_lease_once = 0;
>  	unsigned int page_mask;
>  	struct vm_area_struct *vma = NULL;
> +	struct dax_lease *dax_lease = NULL;
>  
>  	if (!nr_pages)
>  		return 0;
> @@ -693,6 +696,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		if (unlikely(fatal_signal_pending(current)))
>  			return i ? i : -ERESTARTSYS;
>  		cond_resched();
> +		if (pages && !dax_lease_once) {
> +			dax_lease_once = 1;
> +			dax_lease = dax_truncate_lease(vma, nr_pages);
> +			if (IS_ERR(dax_lease)) {
> +				result = PTR_ERR(dax_lease);
> +				goto out;
> +			}
> +		}
>  		page = follow_page_mask(vma, start, foll_flags, &page_mask);
>  		if (!page) {
>  			int ret;
> @@ -704,9 +715,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			case -EFAULT:
>  			case -ENOMEM:
>  			case -EHWPOISON:
> -				return i ? i : ret;
> +				result = i ? i : ret;
> +				goto out;
>  			case -EBUSY:
> -				return i;
> +				result = i;
> +				goto out;
>  			case -ENOENT:
>  				goto next_page;
>  			}
> @@ -718,7 +731,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			 */
>  			goto next_page;
>  		} else if (IS_ERR(page)) {
> -			return i ? i : PTR_ERR(page);
> +			result = i ? i : PTR_ERR(page);
> +			goto out;
>  		}
>  		if (pages) {
>  			pages[i] = page;
> @@ -738,7 +752,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		start += page_increm * PAGE_SIZE;
>  		nr_pages -= page_increm;
>  	} while (nr_pages);
> -	return i;
> +	result = i;
> +out:
> +	dax_lease_set_pages(dax_lease, pages, result);
> +	return result;
>  }
>  
>  static bool vma_permits_fault(struct vm_area_struct *vma,
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 13:05     ` Jeff Layton
  0 siblings, 0 replies; 143+ messages in thread
From: Jeff Layton @ 2017-10-20 13:05 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, J. Bruce Fields, linux-mm, Jeff Moyer,
	Alexander Viro, linux-fsdevel, Darrick J. Wong, Ross Zwisler,
	linux-xfs, hch, linux-nvdimm

On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
> get_user_pages() pins file backed memory pages for access by dma
> devices. However, it only pins the memory pages not the page-to-file
> offset association. If a file is truncated the pages are mapped out of
> the file and dma may continue indefinitely into a page that is owned by
> a device driver. This breaks coherency of the file vs dma, but the
> assumption is that if userspace wants the file-space truncated it does
> not matter what data is inbound from the device, it is not relevant
> anymore.
> 
> The assumptions of the truncate-page-cache model are broken by DAX where
> the target DMA page *is* the filesystem block. Leaving the page pinned
> for DMA, but truncating the file block out of the file, means that the
> filesytem is free to reallocate a block under active DMA to another
> file!
> 
> Here are some possible options for fixing this situation ('truncate' and
> 'fallocate(punch hole)' are synonymous below):
> 
>     1/ Fail truncate while any file blocks might be under dma
> 
>     2/ Block (sleep-wait) truncate while any file blocks might be under
>        dma
> 
>     3/ Remap file blocks to a "lost+found"-like file-inode where
>        dma can continue and we might see what inbound data from DMA was
>        mapped out of the original file. Blocks in this file could be
>        freed back to the filesystem when dma eventually ends.
> 
>     4/ Disable dax until option 3 or another long term solution has been
>        implemented. However, filesystem-dax is still marked experimental
>        for concerns like this.
> 
> Option 1 will throw failures where userspace has never expected them
> before, option 2 might hang the truncating process indefinitely, and
> option 3 requires per filesystem enabling to remap blocks from one inode
> to another.  Option 2 is implemented in this patch for the DAX path with
> the expectation that non-transient users of get_user_pages() (RDMA) are
> disallowed from setting up dax mappings and that the potential delay
> introduced to the truncate path is acceptable compared to the response
> time of the page cache case. This can only be seen as a stop-gap until
> we can solve the problem of safely sequestering unallocated filesystem
> blocks under active dma.
> 

FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
more work though, so no objection to this as a stop-gap fix.


> The solution introduces a new FL_ALLOCATED lease to pin the allocated
> blocks in a dax file while dma might be accessing them. It behaves
> identically to an FL_LAYOUT lease save for the fact that it is
> immediately sheduled to be reaped, and that the only path that waits for
> its removal is the truncate path. We can not reuse FL_LAYOUT directly
> since that would deadlock in the case where userspace did a direct-I/O
> operation with a target buffer backed by an mmap range of the same file.
> 
> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
> something similar as an alternative way to solve the problem that
> MAP_DIRECT was trying to solve.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/Kconfig          |    1 
>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/locks.c          |   17 ++++-
>  include/linux/dax.h |   23 ++++++
>  include/linux/fs.h  |   22 +++++-
>  mm/gup.c            |   27 ++++++-
>  6 files changed, 268 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7aee6d699fd6..a7b31a96a753 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>  config FS_DAX
>  	bool "Direct Access (DAX) support"
>  	depends on MMU
> +	depends on FILE_LOCKING
>  	depends on !(ARM || MIPS || SPARC)
>  	select FS_IOMAP
>  	select DAX
> diff --git a/fs/dax.c b/fs/dax.c
> index b03f547b36e7..e0a3958fc5f2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -22,6 +22,7 @@
>  #include <linux/genhd.h>
>  #include <linux/highmem.h>
>  #include <linux/memcontrol.h>
> +#include <linux/file.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
>  #include <linux/pagevec.h>
> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  	}
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +enum dax_lease_flags {
> +	DAX_LEASE_PAGES,
> +	DAX_LEASE_BREAK,
> +};
> +
> +struct dax_lease {
> +	struct page **dl_pages;
> +	unsigned long dl_nr_pages;
> +	unsigned long dl_state;
> +	struct file *dl_file;
> +	atomic_t dl_count;
> +	/*
> +	 * Once the lease is taken and the pages have references we
> +	 * start the reap_work to poll for lease release while acquiring
> +	 * fs locks that synchronize with truncate. So, either reap_work
> +	 * cleans up the dax_lease instances or truncate itself.
> +	 *
> +	 * The break_work sleepily polls for DMA completion and then
> +	 * unlocks/removes the lease.
> +	 */
> +	struct delayed_work dl_reap_work;
> +	struct delayed_work dl_break_work;
> +};
> +
> +static void put_dax_lease(struct dax_lease *dl)
> +{
> +	if (atomic_dec_and_test(&dl->dl_count)) {
> +		fput(dl->dl_file);
> +		kfree(dl);
> +	}
> +}

Any reason not to use the new refcount_t type for dl_count? Seems like a
good place for it.

> +
> +static void dax_lease_unlock_one(struct work_struct *work)
> +{
> +	struct dax_lease *dl = container_of(work, typeof(*dl),
> +			dl_break_work.work);
> +	unsigned long i;
> +
> +	/* wait for the gup path to finish recording pages in the lease */
> +	if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
> +		schedule_delayed_work(&dl->dl_break_work, HZ);
> +		return;
> +	}
> +
> +	/* barrier pairs with dax_lease_set_pages() */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * If we see all pages idle at least once we can remove the
> +	 * lease. If we happen to race with someone else taking a
> +	 * reference on a page they will have their own lease to protect
> +	 * against truncate.
> +	 */
> +	for (i = 0; i < dl->dl_nr_pages; i++)
> +		if (page_ref_count(dl->dl_pages[i]) > 1) {
> +			schedule_delayed_work(&dl->dl_break_work, HZ);
> +			return;
> +		}
> +	vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
> +	put_dax_lease(dl);
> +}
> +
> +static void dax_lease_reap_all(struct work_struct *work)
> +{
> +	struct dax_lease *dl = container_of(work, typeof(*dl),
> +			dl_reap_work.work);
> +	struct file *file = dl->dl_file;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	if (mapping->a_ops->dax_flush_dma) {
> +		mapping->a_ops->dax_flush_dma(inode);
> +	} else {
> +		/* FIXME: dax-filesystem needs to add dax-dma support */
> +		break_allocated(inode, true);
> +	}
> +	put_dax_lease(dl);
> +}
> +
> +static bool dax_lease_lm_break(struct file_lock *fl)
> +{
> +	struct dax_lease *dl = fl->fl_owner;
> +
> +	if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
> +		atomic_inc(&dl->dl_count);
> +		schedule_delayed_work(&dl->dl_break_work, HZ);
> +	}
> +

I haven't gone over this completely, but what prevents you from doing a
0->1 transition on the dl_count here, and possibly doing a use-after
free?

Ahh ok...I guess we know that we hold a reference since this is on the
flc_lease list? Fair enough. Still, might be worth a comment there as to
why that's safe.


> +	/* Tell the core lease code to wait for delayed work completion */
> +	fl->fl_break_time = 0;
> +
> +	return false;
> +}
> +
> +static int dax_lease_lm_change(struct file_lock *fl, int arg,
> +		struct list_head *dispose)
> +{
> +	struct dax_lease *dl;
> +	int rc;
> +
> +	WARN_ON(!(arg & F_UNLCK));
> +	dl = fl->fl_owner;
> +	rc = lease_modify(fl, arg, dispose);
> +	put_dax_lease(dl);
> +	return rc;
> +}
> +
> +static const struct lock_manager_operations dax_lease_lm_ops = {
> +	.lm_break = dax_lease_lm_break,
> +	.lm_change = dax_lease_lm_change,
> +};
> +
> +struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file_inode(file);
> +	struct dax_lease *dl;
> +	struct file_lock *fl;
> +	int rc = -ENOMEM;
> +
> +	if (!vma_is_dax(vma))
> +		return NULL;
> +
> +	/* device-dax can not be truncated */
> +	if (!S_ISREG(inode->i_mode))
> +		return NULL;
> +
> +	dl = kzalloc(sizeof(*dl) + sizeof(struct page *) * nr_pages, GFP_KERNEL);
> +	if (!dl)
> +		return ERR_PTR(-ENOMEM);
> +
> +	fl = locks_alloc_lock();
> +	if (!fl)
> +		goto err_lock_alloc;
> +
> +	dl->dl_pages = (struct page **)(dl + 1);
> +	INIT_DELAYED_WORK(&dl->dl_break_work, dax_lease_unlock_one);
> +	INIT_DELAYED_WORK(&dl->dl_reap_work, dax_lease_reap_all);
> +	dl->dl_file = get_file(file);
> +	/* need dl alive until dax_lease_set_pages() and final put */
> +	atomic_set(&dl->dl_count, 2);
> +
> +	locks_init_lock(fl);
> +	fl->fl_lmops = &dax_lease_lm_ops;
> +	fl->fl_flags = FL_ALLOCATED;
> +	fl->fl_type = F_RDLCK;
> +	fl->fl_end = OFFSET_MAX;
> +	fl->fl_owner = dl;
> +	fl->fl_pid = current->tgid;
> +	fl->fl_file = file;
> +
> +	rc = vfs_setlease(fl->fl_file, fl->fl_type, &fl, (void **) &dl);
> +	if (rc)
> +		goto err_setlease;
> +	return dl;
> +err_setlease:
> +	locks_free_lock(fl);
> +err_lock_alloc:
> +	kfree(dl);
> +	return ERR_PTR(rc);
> +}
> +
> +void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
> +		long nr_pages)
> +{
> +	if (IS_ERR_OR_NULL(dl))
> +		return;
> +
> +	if (nr_pages <= 0) {
> +		dl->dl_nr_pages = 0;
> +		smp_mb__before_atomic();
> +		set_bit(DAX_LEASE_PAGES, &dl->dl_state);
> +		vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
> +		flush_delayed_work(&dl->dl_break_work);
> +		put_dax_lease(dl);
> +		return;
> +	}
> +
> +	dl->dl_nr_pages = nr_pages;
> +	memcpy(dl->dl_pages, pages, sizeof(struct page *) * nr_pages);
> +	smp_mb__before_atomic();
> +	set_bit(DAX_LEASE_PAGES, &dl->dl_state);
> +	queue_delayed_work(system_long_wq, &dl->dl_reap_work, HZ);
> +}
> +EXPORT_SYMBOL_GPL(dax_lease_set_pages);
> diff --git a/fs/locks.c b/fs/locks.c
> index 1bd71c4d663a..0a7841590b35 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -135,7 +135,7 @@
>  
>  #define IS_POSIX(fl)	(fl->fl_flags & FL_POSIX)
>  #define IS_FLOCK(fl)	(fl->fl_flags & FL_FLOCK)
> -#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> +#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT|FL_ALLOCATED))
>  #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
>  #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
>  
> @@ -1414,7 +1414,9 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>  
>  static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
>  {
> -	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
> +	/* FL_LAYOUT and FL_ALLOCATED only conflict with each other */
> +	if (!!(breaker->fl_flags & (FL_LAYOUT|FL_ALLOCATED))
> +			!= !!(lease->fl_flags & (FL_LAYOUT|FL_ALLOCATED)))
>  		return false;
>  	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
>  		return false;
> @@ -1653,7 +1655,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
>  	int ret = 0;
>  	struct inode *inode = dentry->d_inode;
>  
> -	if (flags & FL_LAYOUT)
> +	if (flags & (FL_LAYOUT|FL_ALLOCATED))
>  		return 0;
>  
>  	if ((arg == F_RDLCK) &&
> @@ -1733,6 +1735,15 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
>  		 */
>  		if (arg == F_WRLCK)
>  			goto out;
> +
> +		/*
> +		 * Taking out a new FL_ALLOCATED lease while a previous
> +		 * one is being locked is expected since each instance
> +		 * may be responsible for a distinct range of pages.
> +		 */
> +		if (fl->fl_flags & FL_ALLOCATED)
> +			continue;
> +
>  		/*
>  		 * Modifying our existing lease is OK, but no getting a
>  		 * new lease if someone else is opening for write:
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 122197124b9d..3ff61dc6241e 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -100,10 +100,15 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
>  
> +struct dax_lease;
>  #ifdef CONFIG_FS_DAX
>  int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
>  		unsigned int offset, unsigned int length);
> +struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages);
> +void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
> +		long nr_pages);
>  #else
>  static inline int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
> @@ -111,8 +116,26 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
>  {
>  	return -ENXIO;
>  }
> +static inline struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	return NULL;
> +}
> +
> +static inline void dax_lease_set_pages(struct dax_lease *dl,
> +		struct page **pages, long nr_pages)
> +{
> +}
>  #endif
>  
> +static inline struct dax_lease *dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	if (!vma_is_dax(vma))
> +		return NULL;
> +	return __dax_truncate_lease(vma, nr_pages);
> +}
> +
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index eace2c5396a7..a3ed74833919 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -371,6 +371,9 @@ struct address_space_operations {
>  	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
>  				sector_t *span);
>  	void (*swap_deactivate)(struct file *file);
> +
> +	/* dax dma support */
> +	void (*dax_flush_dma)(struct inode *inode);
>  };
>  
>  extern const struct address_space_operations empty_aops;
> @@ -927,6 +930,7 @@ static inline struct file *get_file(struct file *f)
>  #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>  #define FL_LAYOUT	2048	/* outstanding pNFS layout */
> +#define FL_ALLOCATED	4096	/* pin allocated dax blocks against dma */
>  
>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>  
> @@ -2324,17 +2328,27 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  	return ret;
>  }
>  
> -static inline int break_layout(struct inode *inode, bool wait)
> +static inline int __break_layout(struct inode *inode, bool wait,
> +		unsigned int type)
>  {
>  	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
>  
>  	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
> -				FL_LAYOUT);
> +				type);
>  	return 0;
>  }
>  
> +static inline int break_layout(struct inode *inode, bool wait)
> +{
> +	return __break_layout(inode, wait, FL_LAYOUT);
> +}
> +
> +static inline int break_allocated(struct inode *inode, bool wait)
> +{
> +	return __break_layout(inode, wait, FL_LAYOUT|FL_ALLOCATED);
> +}
>  #else /* !CONFIG_FILE_LOCKING */
>  static inline int break_lease(struct inode *inode, unsigned int mode)
>  {
> @@ -2362,6 +2376,10 @@ static inline int break_layout(struct inode *inode, bool wait)
>  	return 0;
>  }
>  
> +static inline int break_allocated(struct inode *inode, bool wait)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_FILE_LOCKING */
>  
>  /* fs/open.c */
> diff --git a/mm/gup.c b/mm/gup.c
> index 308be897d22a..6a7cf371e656 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -9,6 +9,7 @@
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/dax.h>
>  
>  #include <linux/sched/signal.h>
>  #include <linux/rwsem.h>
> @@ -640,9 +641,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *nonblocking)
>  {
> -	long i = 0;
> +	long i = 0, result = 0;
> +	int dax_lease_once = 0;
>  	unsigned int page_mask;
>  	struct vm_area_struct *vma = NULL;
> +	struct dax_lease *dax_lease = NULL;
>  
>  	if (!nr_pages)
>  		return 0;
> @@ -693,6 +696,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		if (unlikely(fatal_signal_pending(current)))
>  			return i ? i : -ERESTARTSYS;
>  		cond_resched();
> +		if (pages && !dax_lease_once) {
> +			dax_lease_once = 1;
> +			dax_lease = dax_truncate_lease(vma, nr_pages);
> +			if (IS_ERR(dax_lease)) {
> +				result = PTR_ERR(dax_lease);
> +				goto out;
> +			}
> +		}
>  		page = follow_page_mask(vma, start, foll_flags, &page_mask);
>  		if (!page) {
>  			int ret;
> @@ -704,9 +715,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			case -EFAULT:
>  			case -ENOMEM:
>  			case -EHWPOISON:
> -				return i ? i : ret;
> +				result = i ? i : ret;
> +				goto out;
>  			case -EBUSY:
> -				return i;
> +				result = i;
> +				goto out;
>  			case -ENOENT:
>  				goto next_page;
>  			}
> @@ -718,7 +731,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			 */
>  			goto next_page;
>  		} else if (IS_ERR(page)) {
> -			return i ? i : PTR_ERR(page);
> +			result = i ? i : PTR_ERR(page);
> +			goto out;
>  		}
>  		if (pages) {
>  			pages[i] = page;
> @@ -738,7 +752,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		start += page_increm * PAGE_SIZE;
>  		nr_pages -= page_increm;
>  	} while (nr_pages);
> -	return i;
> +	result = i;
> +out:
> +	dax_lease_set_pages(dax_lease, pages, result);
> +	return result;
>  }
>  
>  static bool vma_permits_fault(struct vm_area_struct *vma,
> 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 13:05     ` Jeff Layton
  0 siblings, 0 replies; 143+ messages in thread
From: Jeff Layton @ 2017-10-20 13:05 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, J. Bruce Fields, linux-mm, Jeff Moyer,
	Alexander Viro, linux-fsdevel, Darrick J. Wong, Ross Zwisler,
	linux-xfs, hch, linux-nvdimm

On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
> get_user_pages() pins file backed memory pages for access by dma
> devices. However, it only pins the memory pages not the page-to-file
> offset association. If a file is truncated the pages are mapped out of
> the file and dma may continue indefinitely into a page that is owned by
> a device driver. This breaks coherency of the file vs dma, but the
> assumption is that if userspace wants the file-space truncated it does
> not matter what data is inbound from the device, it is not relevant
> anymore.
> 
> The assumptions of the truncate-page-cache model are broken by DAX where
> the target DMA page *is* the filesystem block. Leaving the page pinned
> for DMA, but truncating the file block out of the file, means that the
> filesytem is free to reallocate a block under active DMA to another
> file!
> 
> Here are some possible options for fixing this situation ('truncate' and
> 'fallocate(punch hole)' are synonymous below):
> 
>     1/ Fail truncate while any file blocks might be under dma
> 
>     2/ Block (sleep-wait) truncate while any file blocks might be under
>        dma
> 
>     3/ Remap file blocks to a "lost+found"-like file-inode where
>        dma can continue and we might see what inbound data from DMA was
>        mapped out of the original file. Blocks in this file could be
>        freed back to the filesystem when dma eventually ends.
> 
>     4/ Disable dax until option 3 or another long term solution has been
>        implemented. However, filesystem-dax is still marked experimental
>        for concerns like this.
> 
> Option 1 will throw failures where userspace has never expected them
> before, option 2 might hang the truncating process indefinitely, and
> option 3 requires per filesystem enabling to remap blocks from one inode
> to another.  Option 2 is implemented in this patch for the DAX path with
> the expectation that non-transient users of get_user_pages() (RDMA) are
> disallowed from setting up dax mappings and that the potential delay
> introduced to the truncate path is acceptable compared to the response
> time of the page cache case. This can only be seen as a stop-gap until
> we can solve the problem of safely sequestering unallocated filesystem
> blocks under active dma.
> 

FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
more work though, so no objection to this as a stop-gap fix.


> The solution introduces a new FL_ALLOCATED lease to pin the allocated
> blocks in a dax file while dma might be accessing them. It behaves
> identically to an FL_LAYOUT lease save for the fact that it is
> immediately sheduled to be reaped, and that the only path that waits for
> its removal is the truncate path. We can not reuse FL_LAYOUT directly
> since that would deadlock in the case where userspace did a direct-I/O
> operation with a target buffer backed by an mmap range of the same file.
> 
> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
> something similar as an alternative way to solve the problem that
> MAP_DIRECT was trying to solve.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/Kconfig          |    1 
>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/locks.c          |   17 ++++-
>  include/linux/dax.h |   23 ++++++
>  include/linux/fs.h  |   22 +++++-
>  mm/gup.c            |   27 ++++++-
>  6 files changed, 268 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7aee6d699fd6..a7b31a96a753 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>  config FS_DAX
>  	bool "Direct Access (DAX) support"
>  	depends on MMU
> +	depends on FILE_LOCKING
>  	depends on !(ARM || MIPS || SPARC)
>  	select FS_IOMAP
>  	select DAX
> diff --git a/fs/dax.c b/fs/dax.c
> index b03f547b36e7..e0a3958fc5f2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -22,6 +22,7 @@
>  #include <linux/genhd.h>
>  #include <linux/highmem.h>
>  #include <linux/memcontrol.h>
> +#include <linux/file.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
>  #include <linux/pagevec.h>
> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  	}
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +enum dax_lease_flags {
> +	DAX_LEASE_PAGES,
> +	DAX_LEASE_BREAK,
> +};
> +
> +struct dax_lease {
> +	struct page **dl_pages;
> +	unsigned long dl_nr_pages;
> +	unsigned long dl_state;
> +	struct file *dl_file;
> +	atomic_t dl_count;
> +	/*
> +	 * Once the lease is taken and the pages have references we
> +	 * start the reap_work to poll for lease release while acquiring
> +	 * fs locks that synchronize with truncate. So, either reap_work
> +	 * cleans up the dax_lease instances or truncate itself.
> +	 *
> +	 * The break_work sleepily polls for DMA completion and then
> +	 * unlocks/removes the lease.
> +	 */
> +	struct delayed_work dl_reap_work;
> +	struct delayed_work dl_break_work;
> +};
> +
> +static void put_dax_lease(struct dax_lease *dl)
> +{
> +	if (atomic_dec_and_test(&dl->dl_count)) {
> +		fput(dl->dl_file);
> +		kfree(dl);
> +	}
> +}

Any reason not to use the new refcount_t type for dl_count? Seems like a
good place for it.

> +
> +static void dax_lease_unlock_one(struct work_struct *work)
> +{
> +	struct dax_lease *dl = container_of(work, typeof(*dl),
> +			dl_break_work.work);
> +	unsigned long i;
> +
> +	/* wait for the gup path to finish recording pages in the lease */
> +	if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
> +		schedule_delayed_work(&dl->dl_break_work, HZ);
> +		return;
> +	}
> +
> +	/* barrier pairs with dax_lease_set_pages() */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * If we see all pages idle at least once we can remove the
> +	 * lease. If we happen to race with someone else taking a
> +	 * reference on a page they will have their own lease to protect
> +	 * against truncate.
> +	 */
> +	for (i = 0; i < dl->dl_nr_pages; i++)
> +		if (page_ref_count(dl->dl_pages[i]) > 1) {
> +			schedule_delayed_work(&dl->dl_break_work, HZ);
> +			return;
> +		}
> +	vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
> +	put_dax_lease(dl);
> +}
> +
> +static void dax_lease_reap_all(struct work_struct *work)
> +{
> +	struct dax_lease *dl = container_of(work, typeof(*dl),
> +			dl_reap_work.work);
> +	struct file *file = dl->dl_file;
> +	struct inode *inode = file_inode(file);
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	if (mapping->a_ops->dax_flush_dma) {
> +		mapping->a_ops->dax_flush_dma(inode);
> +	} else {
> +		/* FIXME: dax-filesystem needs to add dax-dma support */
> +		break_allocated(inode, true);
> +	}
> +	put_dax_lease(dl);
> +}
> +
> +static bool dax_lease_lm_break(struct file_lock *fl)
> +{
> +	struct dax_lease *dl = fl->fl_owner;
> +
> +	if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
> +		atomic_inc(&dl->dl_count);
> +		schedule_delayed_work(&dl->dl_break_work, HZ);
> +	}
> +

I haven't gone over this completely, but what prevents you from doing a
0->1 transition on the dl_count here, and possibly doing a use-after
free?

Ahh ok...I guess we know that we hold a reference since this is on the
flc_lease list? Fair enough. Still, might be worth a comment there as to
why that's safe.


> +	/* Tell the core lease code to wait for delayed work completion */
> +	fl->fl_break_time = 0;
> +
> +	return false;
> +}
> +
> +static int dax_lease_lm_change(struct file_lock *fl, int arg,
> +		struct list_head *dispose)
> +{
> +	struct dax_lease *dl;
> +	int rc;
> +
> +	WARN_ON(!(arg & F_UNLCK));
> +	dl = fl->fl_owner;
> +	rc = lease_modify(fl, arg, dispose);
> +	put_dax_lease(dl);
> +	return rc;
> +}
> +
> +static const struct lock_manager_operations dax_lease_lm_ops = {
> +	.lm_break = dax_lease_lm_break,
> +	.lm_change = dax_lease_lm_change,
> +};
> +
> +struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file_inode(file);
> +	struct dax_lease *dl;
> +	struct file_lock *fl;
> +	int rc = -ENOMEM;
> +
> +	if (!vma_is_dax(vma))
> +		return NULL;
> +
> +	/* device-dax can not be truncated */
> +	if (!S_ISREG(inode->i_mode))
> +		return NULL;
> +
> +	dl = kzalloc(sizeof(*dl) + sizeof(struct page *) * nr_pages, GFP_KERNEL);
> +	if (!dl)
> +		return ERR_PTR(-ENOMEM);
> +
> +	fl = locks_alloc_lock();
> +	if (!fl)
> +		goto err_lock_alloc;
> +
> +	dl->dl_pages = (struct page **)(dl + 1);
> +	INIT_DELAYED_WORK(&dl->dl_break_work, dax_lease_unlock_one);
> +	INIT_DELAYED_WORK(&dl->dl_reap_work, dax_lease_reap_all);
> +	dl->dl_file = get_file(file);
> +	/* need dl alive until dax_lease_set_pages() and final put */
> +	atomic_set(&dl->dl_count, 2);
> +
> +	locks_init_lock(fl);
> +	fl->fl_lmops = &dax_lease_lm_ops;
> +	fl->fl_flags = FL_ALLOCATED;
> +	fl->fl_type = F_RDLCK;
> +	fl->fl_end = OFFSET_MAX;
> +	fl->fl_owner = dl;
> +	fl->fl_pid = current->tgid;
> +	fl->fl_file = file;
> +
> +	rc = vfs_setlease(fl->fl_file, fl->fl_type, &fl, (void **) &dl);
> +	if (rc)
> +		goto err_setlease;
> +	return dl;
> +err_setlease:
> +	locks_free_lock(fl);
> +err_lock_alloc:
> +	kfree(dl);
> +	return ERR_PTR(rc);
> +}
> +
> +void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
> +		long nr_pages)
> +{
> +	if (IS_ERR_OR_NULL(dl))
> +		return;
> +
> +	if (nr_pages <= 0) {
> +		dl->dl_nr_pages = 0;
> +		smp_mb__before_atomic();
> +		set_bit(DAX_LEASE_PAGES, &dl->dl_state);
> +		vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
> +		flush_delayed_work(&dl->dl_break_work);
> +		put_dax_lease(dl);
> +		return;
> +	}
> +
> +	dl->dl_nr_pages = nr_pages;
> +	memcpy(dl->dl_pages, pages, sizeof(struct page *) * nr_pages);
> +	smp_mb__before_atomic();
> +	set_bit(DAX_LEASE_PAGES, &dl->dl_state);
> +	queue_delayed_work(system_long_wq, &dl->dl_reap_work, HZ);
> +}
> +EXPORT_SYMBOL_GPL(dax_lease_set_pages);
> diff --git a/fs/locks.c b/fs/locks.c
> index 1bd71c4d663a..0a7841590b35 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -135,7 +135,7 @@
>  
>  #define IS_POSIX(fl)	(fl->fl_flags & FL_POSIX)
>  #define IS_FLOCK(fl)	(fl->fl_flags & FL_FLOCK)
> -#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
> +#define IS_LEASE(fl)	(fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT|FL_ALLOCATED))
>  #define IS_OFDLCK(fl)	(fl->fl_flags & FL_OFDLCK)
>  #define IS_REMOTELCK(fl)	(fl->fl_pid <= 0)
>  
> @@ -1414,7 +1414,9 @@ static void time_out_leases(struct inode *inode, struct list_head *dispose)
>  
>  static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker)
>  {
> -	if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT))
> +	/* FL_LAYOUT and FL_ALLOCATED only conflict with each other */
> +	if (!!(breaker->fl_flags & (FL_LAYOUT|FL_ALLOCATED))
> +			!= !!(lease->fl_flags & (FL_LAYOUT|FL_ALLOCATED)))
>  		return false;
>  	if ((breaker->fl_flags & FL_DELEG) && (lease->fl_flags & FL_LEASE))
>  		return false;
> @@ -1653,7 +1655,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
>  	int ret = 0;
>  	struct inode *inode = dentry->d_inode;
>  
> -	if (flags & FL_LAYOUT)
> +	if (flags & (FL_LAYOUT|FL_ALLOCATED))
>  		return 0;
>  
>  	if ((arg == F_RDLCK) &&
> @@ -1733,6 +1735,15 @@ generic_add_lease(struct file *filp, long arg, struct file_lock **flp, void **pr
>  		 */
>  		if (arg == F_WRLCK)
>  			goto out;
> +
> +		/*
> +		 * Taking out a new FL_ALLOCATED lease while a previous
> +		 * one is being locked is expected since each instance
> +		 * may be responsible for a distinct range of pages.
> +		 */
> +		if (fl->fl_flags & FL_ALLOCATED)
> +			continue;
> +
>  		/*
>  		 * Modifying our existing lease is OK, but no getting a
>  		 * new lease if someone else is opening for write:
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 122197124b9d..3ff61dc6241e 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -100,10 +100,15 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
>  
> +struct dax_lease;
>  #ifdef CONFIG_FS_DAX
>  int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
>  		unsigned int offset, unsigned int length);
> +struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages);
> +void dax_lease_set_pages(struct dax_lease *dl, struct page **pages,
> +		long nr_pages);
>  #else
>  static inline int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
> @@ -111,8 +116,26 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
>  {
>  	return -ENXIO;
>  }
> +static inline struct dax_lease *__dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	return NULL;
> +}
> +
> +static inline void dax_lease_set_pages(struct dax_lease *dl,
> +		struct page **pages, long nr_pages)
> +{
> +}
>  #endif
>  
> +static inline struct dax_lease *dax_truncate_lease(struct vm_area_struct *vma,
> +		long nr_pages)
> +{
> +	if (!vma_is_dax(vma))
> +		return NULL;
> +	return __dax_truncate_lease(vma, nr_pages);
> +}
> +
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index eace2c5396a7..a3ed74833919 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -371,6 +371,9 @@ struct address_space_operations {
>  	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
>  				sector_t *span);
>  	void (*swap_deactivate)(struct file *file);
> +
> +	/* dax dma support */
> +	void (*dax_flush_dma)(struct inode *inode);
>  };
>  
>  extern const struct address_space_operations empty_aops;
> @@ -927,6 +930,7 @@ static inline struct file *get_file(struct file *f)
>  #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>  #define FL_LAYOUT	2048	/* outstanding pNFS layout */
> +#define FL_ALLOCATED	4096	/* pin allocated dax blocks against dma */
>  
>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>  
> @@ -2324,17 +2328,27 @@ static inline int break_deleg_wait(struct inode **delegated_inode)
>  	return ret;
>  }
>  
> -static inline int break_layout(struct inode *inode, bool wait)
> +static inline int __break_layout(struct inode *inode, bool wait,
> +		unsigned int type)
>  {
>  	struct file_lock_context *ctx = smp_load_acquire(&inode->i_flctx);
>  
>  	if (ctx && !list_empty_careful(&ctx->flc_lease))
>  		return __break_lease(inode,
>  				wait ? O_WRONLY : O_WRONLY | O_NONBLOCK,
> -				FL_LAYOUT);
> +				type);
>  	return 0;
>  }
>  
> +static inline int break_layout(struct inode *inode, bool wait)
> +{
> +	return __break_layout(inode, wait, FL_LAYOUT);
> +}
> +
> +static inline int break_allocated(struct inode *inode, bool wait)
> +{
> +	return __break_layout(inode, wait, FL_LAYOUT|FL_ALLOCATED);
> +}
>  #else /* !CONFIG_FILE_LOCKING */
>  static inline int break_lease(struct inode *inode, unsigned int mode)
>  {
> @@ -2362,6 +2376,10 @@ static inline int break_layout(struct inode *inode, bool wait)
>  	return 0;
>  }
>  
> +static inline int break_allocated(struct inode *inode, bool wait)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_FILE_LOCKING */
>  
>  /* fs/open.c */
> diff --git a/mm/gup.c b/mm/gup.c
> index 308be897d22a..6a7cf371e656 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -9,6 +9,7 @@
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/dax.h>
>  
>  #include <linux/sched/signal.h>
>  #include <linux/rwsem.h>
> @@ -640,9 +641,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *nonblocking)
>  {
> -	long i = 0;
> +	long i = 0, result = 0;
> +	int dax_lease_once = 0;
>  	unsigned int page_mask;
>  	struct vm_area_struct *vma = NULL;
> +	struct dax_lease *dax_lease = NULL;
>  
>  	if (!nr_pages)
>  		return 0;
> @@ -693,6 +696,14 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		if (unlikely(fatal_signal_pending(current)))
>  			return i ? i : -ERESTARTSYS;
>  		cond_resched();
> +		if (pages && !dax_lease_once) {
> +			dax_lease_once = 1;
> +			dax_lease = dax_truncate_lease(vma, nr_pages);
> +			if (IS_ERR(dax_lease)) {
> +				result = PTR_ERR(dax_lease);
> +				goto out;
> +			}
> +		}
>  		page = follow_page_mask(vma, start, foll_flags, &page_mask);
>  		if (!page) {
>  			int ret;
> @@ -704,9 +715,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			case -EFAULT:
>  			case -ENOMEM:
>  			case -EHWPOISON:
> -				return i ? i : ret;
> +				result = i ? i : ret;
> +				goto out;
>  			case -EBUSY:
> -				return i;
> +				result = i;
> +				goto out;
>  			case -ENOENT:
>  				goto next_page;
>  			}
> @@ -718,7 +731,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			 */
>  			goto next_page;
>  		} else if (IS_ERR(page)) {
> -			return i ? i : PTR_ERR(page);
> +			result = i ? i : PTR_ERR(page);
> +			goto out;
>  		}
>  		if (pages) {
>  			pages[i] = page;
> @@ -738,7 +752,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		start += page_increm * PAGE_SIZE;
>  		nr_pages -= page_increm;
>  	} while (nr_pages);
> -	return i;
> +	result = i;
> +out:
> +	dax_lease_set_pages(dax_lease, pages, result);
> +	return result;
>  }
>  
>  static bool vma_permits_fault(struct vm_area_struct *vma,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20  7:57     ` Christoph Hellwig
  (?)
@ 2017-10-20 15:23       ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Andrew Morton, Gerald Schaefer

On Fri, Oct 20, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
>> --- a/arch/powerpc/sysdev/axonram.c
>> +++ b/arch/powerpc/sysdev/axonram.c
>> @@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>
>>  static const struct dax_operations axon_ram_dax_ops = {
>>       .direct_access = axon_ram_dax_direct_access,
>> +
>>       .copy_from_iter = axon_ram_copy_from_iter,
>
> Unrelated whitespace change.  That being said - I don't think axonram has
> devmap support in any form, so this basically becomes dead code, doesn't
> it?
>
>> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
>> index 7abb240847c0..e7e5db07e339 100644
>> --- a/drivers/s390/block/dcssblk.c
>> +++ b/drivers/s390/block/dcssblk.c
>> @@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
>>
>>  static const struct dax_operations dcssblk_dax_ops = {
>>       .direct_access = dcssblk_dax_direct_access,
>> +
>>       .copy_from_iter = dcssblk_dax_copy_from_iter,
>
> Same comments apply here.

Yes, however it seems these drivers / platforms have been living with
the lack of struct page for a long time. So they either don't use DAX,
or they have a constrained use case that never triggers
get_user_pages(). If it is the latter then they could introduce a new
configuration option that bypasses the pfn_t_devmap() check in
bdev_dax_supported() and fix up the get_user_pages() paths to fail.
So, I'd like to understand how these drivers have been using DAX
support without struct page to see if we need a workaround or we can
go ahead delete this support. If the usage is limited to
execute-in-place perhaps we can do a constrained ->direct_access() for
just that case.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 15:23       ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, linux-nvdimm, Benjamin Herrenschmidt,
	Heiko Carstens, linux-kernel, linux-xfs, Linux MM, Jeff Moyer,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
>> --- a/arch/powerpc/sysdev/axonram.c
>> +++ b/arch/powerpc/sysdev/axonram.c
>> @@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>
>>  static const struct dax_operations axon_ram_dax_ops = {
>>       .direct_access = axon_ram_dax_direct_access,
>> +
>>       .copy_from_iter = axon_ram_copy_from_iter,
>
> Unrelated whitespace change.  That being said - I don't think axonram has
> devmap support in any form, so this basically becomes dead code, doesn't
> it?
>
>> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
>> index 7abb240847c0..e7e5db07e339 100644
>> --- a/drivers/s390/block/dcssblk.c
>> +++ b/drivers/s390/block/dcssblk.c
>> @@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
>>
>>  static const struct dax_operations dcssblk_dax_ops = {
>>       .direct_access = dcssblk_dax_direct_access,
>> +
>>       .copy_from_iter = dcssblk_dax_copy_from_iter,
>
> Same comments apply here.

Yes, however it seems these drivers / platforms have been living with
the lack of struct page for a long time. So they either don't use DAX,
or they have a constrained use case that never triggers
get_user_pages(). If it is the latter then they could introduce a new
configuration option that bypasses the pfn_t_devmap() check in
bdev_dax_supported() and fix up the get_user_pages() paths to fail.
So, I'd like to understand how these drivers have been using DAX
support without struct page to see if we need a workaround or we can
go ahead delete this support. If the usage is limited to
execute-in-place perhaps we can do a constrained ->direct_access() for
just that case.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 15:23       ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, linux-nvdimm, Benjamin Herrenschmidt,
	Heiko Carstens, linux-kernel, linux-xfs, Linux MM, Jeff Moyer,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 12:57 AM, Christoph Hellwig <hch@lst.de> wrote:
>> --- a/arch/powerpc/sysdev/axonram.c
>> +++ b/arch/powerpc/sysdev/axonram.c
>> @@ -172,6 +172,7 @@ static size_t axon_ram_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>
>>  static const struct dax_operations axon_ram_dax_ops = {
>>       .direct_access = axon_ram_dax_direct_access,
>> +
>>       .copy_from_iter = axon_ram_copy_from_iter,
>
> Unrelated whitespace change.  That being said - I don't think axonram has
> devmap support in any form, so this basically becomes dead code, doesn't
> it?
>
>> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
>> index 7abb240847c0..e7e5db07e339 100644
>> --- a/drivers/s390/block/dcssblk.c
>> +++ b/drivers/s390/block/dcssblk.c
>> @@ -52,6 +52,7 @@ static size_t dcssblk_dax_copy_from_iter(struct dax_device *dax_dev,
>>
>>  static const struct dax_operations dcssblk_dax_ops = {
>>       .direct_access = dcssblk_dax_direct_access,
>> +
>>       .copy_from_iter = dcssblk_dax_copy_from_iter,
>
> Same comments apply here.

Yes, however it seems these drivers / platforms have been living with
the lack of struct page for a long time. So they either don't use DAX,
or they have a constrained use case that never triggers
get_user_pages(). If it is the latter then they could introduce a new
configuration option that bypasses the pfn_t_devmap() check in
bdev_dax_supported() and fix up the get_user_pages() paths to fail.
So, I'd like to understand how these drivers have been using DAX
support without struct page to see if we need a workaround or we can
go ahead delete this support. If the usage is limited to
execute-in-place perhaps we can do a constrained ->direct_access() for
just that case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20 13:05     ` Jeff Layton
  (?)
@ 2017-10-20 15:42       ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 15:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-xfs, Jan Kara, Matthew Wilcox, Dave Hansen, Dave Chinner,
	linux-kernel, Christoph Hellwig, J. Bruce Fields, Linux MM,
	Alexander Viro, linux-fsdevel, Andrew Morton, Darrick J. Wong,
	linux-nvdimm

On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>>     1/ Fail truncate while any file blocks might be under dma
>>
>>     2/ Block (sleep-wait) truncate while any file blocks might be under
>>        dma
>>
>>     3/ Remap file blocks to a "lost+found"-like file-inode where
>>        dma can continue and we might see what inbound data from DMA was
>>        mapped out of the original file. Blocks in this file could be
>>        freed back to the filesystem when dma eventually ends.
>>
>>     4/ Disable dax until option 3 or another long term solution has been
>>        implemented. However, filesystem-dax is still marked experimental
>>        for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another.  Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.

I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.

>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/Kconfig          |    1
>>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/locks.c          |   17 ++++-
>>  include/linux/dax.h |   23 ++++++
>>  include/linux/fs.h  |   22 +++++-
>>  mm/gup.c            |   27 ++++++-
>>  6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>>  config FS_DAX
>>       bool "Direct Access (DAX) support"
>>       depends on MMU
>> +     depends on FILE_LOCKING
>>       depends on !(ARM || MIPS || SPARC)
>>       select FS_IOMAP
>>       select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/genhd.h>
>>  #include <linux/highmem.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>>  #include <linux/mm.h>
>>  #include <linux/mutex.h>
>>  #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> +     DAX_LEASE_PAGES,
>> +     DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> +     struct page **dl_pages;
>> +     unsigned long dl_nr_pages;
>> +     unsigned long dl_state;
>> +     struct file *dl_file;
>> +     atomic_t dl_count;
>> +     /*
>> +      * Once the lease is taken and the pages have references we
>> +      * start the reap_work to poll for lease release while acquiring
>> +      * fs locks that synchronize with truncate. So, either reap_work
>> +      * cleans up the dax_lease instances or truncate itself.
>> +      *
>> +      * The break_work sleepily polls for DMA completion and then
>> +      * unlocks/removes the lease.
>> +      */
>> +     struct delayed_work dl_reap_work;
>> +     struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> +     if (atomic_dec_and_test(&dl->dl_count)) {
>> +             fput(dl->dl_file);
>> +             kfree(dl);
>> +     }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.

I'll take a look.

>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_break_work.work);
>> +     unsigned long i;
>> +
>> +     /* wait for the gup path to finish recording pages in the lease */
>> +     if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +             return;
>> +     }
>> +
>> +     /* barrier pairs with dax_lease_set_pages() */
>> +     smp_mb__after_atomic();
>> +
>> +     /*
>> +      * If we see all pages idle at least once we can remove the
>> +      * lease. If we happen to race with someone else taking a
>> +      * reference on a page they will have their own lease to protect
>> +      * against truncate.
>> +      */
>> +     for (i = 0; i < dl->dl_nr_pages; i++)
>> +             if (page_ref_count(dl->dl_pages[i]) > 1) {
>> +                     schedule_delayed_work(&dl->dl_break_work, HZ);
>> +                     return;
>> +             }
>> +     vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_reap_work.work);
>> +     struct file *file = dl->dl_file;
>> +     struct inode *inode = file_inode(file);
>> +     struct address_space *mapping = inode->i_mapping;
>> +
>> +     if (mapping->a_ops->dax_flush_dma) {
>> +             mapping->a_ops->dax_flush_dma(inode);
>> +     } else {
>> +             /* FIXME: dax-filesystem needs to add dax-dma support */
>> +             break_allocated(inode, true);
>> +     }
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> +     struct dax_lease *dl = fl->fl_owner;
>> +
>> +     if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> +             atomic_inc(&dl->dl_count);
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +     }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.

Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 15:42       ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 15:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Andrew Morton, Jan Kara, Matthew Wilcox, Dave Hansen,
	Dave Chinner, linux-kernel, J. Bruce Fields, Linux MM,
	Jeff Moyer, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Ross Zwisler, linux-xfs, Christoph Hellwig, linux-nvdimm

On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>>     1/ Fail truncate while any file blocks might be under dma
>>
>>     2/ Block (sleep-wait) truncate while any file blocks might be under
>>        dma
>>
>>     3/ Remap file blocks to a "lost+found"-like file-inode where
>>        dma can continue and we might see what inbound data from DMA was
>>        mapped out of the original file. Blocks in this file could be
>>        freed back to the filesystem when dma eventually ends.
>>
>>     4/ Disable dax until option 3 or another long term solution has been
>>        implemented. However, filesystem-dax is still marked experimental
>>        for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another.  Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.

I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.

>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/Kconfig          |    1
>>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/locks.c          |   17 ++++-
>>  include/linux/dax.h |   23 ++++++
>>  include/linux/fs.h  |   22 +++++-
>>  mm/gup.c            |   27 ++++++-
>>  6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>>  config FS_DAX
>>       bool "Direct Access (DAX) support"
>>       depends on MMU
>> +     depends on FILE_LOCKING
>>       depends on !(ARM || MIPS || SPARC)
>>       select FS_IOMAP
>>       select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/genhd.h>
>>  #include <linux/highmem.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>>  #include <linux/mm.h>
>>  #include <linux/mutex.h>
>>  #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> +     DAX_LEASE_PAGES,
>> +     DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> +     struct page **dl_pages;
>> +     unsigned long dl_nr_pages;
>> +     unsigned long dl_state;
>> +     struct file *dl_file;
>> +     atomic_t dl_count;
>> +     /*
>> +      * Once the lease is taken and the pages have references we
>> +      * start the reap_work to poll for lease release while acquiring
>> +      * fs locks that synchronize with truncate. So, either reap_work
>> +      * cleans up the dax_lease instances or truncate itself.
>> +      *
>> +      * The break_work sleepily polls for DMA completion and then
>> +      * unlocks/removes the lease.
>> +      */
>> +     struct delayed_work dl_reap_work;
>> +     struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> +     if (atomic_dec_and_test(&dl->dl_count)) {
>> +             fput(dl->dl_file);
>> +             kfree(dl);
>> +     }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.

I'll take a look.

>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_break_work.work);
>> +     unsigned long i;
>> +
>> +     /* wait for the gup path to finish recording pages in the lease */
>> +     if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +             return;
>> +     }
>> +
>> +     /* barrier pairs with dax_lease_set_pages() */
>> +     smp_mb__after_atomic();
>> +
>> +     /*
>> +      * If we see all pages idle at least once we can remove the
>> +      * lease. If we happen to race with someone else taking a
>> +      * reference on a page they will have their own lease to protect
>> +      * against truncate.
>> +      */
>> +     for (i = 0; i < dl->dl_nr_pages; i++)
>> +             if (page_ref_count(dl->dl_pages[i]) > 1) {
>> +                     schedule_delayed_work(&dl->dl_break_work, HZ);
>> +                     return;
>> +             }
>> +     vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_reap_work.work);
>> +     struct file *file = dl->dl_file;
>> +     struct inode *inode = file_inode(file);
>> +     struct address_space *mapping = inode->i_mapping;
>> +
>> +     if (mapping->a_ops->dax_flush_dma) {
>> +             mapping->a_ops->dax_flush_dma(inode);
>> +     } else {
>> +             /* FIXME: dax-filesystem needs to add dax-dma support */
>> +             break_allocated(inode, true);
>> +     }
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> +     struct dax_lease *dl = fl->fl_owner;
>> +
>> +     if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> +             atomic_inc(&dl->dl_count);
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +     }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.

Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 15:42       ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 15:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Andrew Morton, Jan Kara, Matthew Wilcox, Dave Hansen,
	Dave Chinner, linux-kernel, J. Bruce Fields, Linux MM,
	Jeff Moyer, Alexander Viro, linux-fsdevel, Darrick J. Wong,
	Ross Zwisler, linux-xfs, Christoph Hellwig, linux-nvdimm

On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@kernel.org> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>>     1/ Fail truncate while any file blocks might be under dma
>>
>>     2/ Block (sleep-wait) truncate while any file blocks might be under
>>        dma
>>
>>     3/ Remap file blocks to a "lost+found"-like file-inode where
>>        dma can continue and we might see what inbound data from DMA was
>>        mapped out of the original file. Blocks in this file could be
>>        freed back to the filesystem when dma eventually ends.
>>
>>     4/ Disable dax until option 3 or another long term solution has been
>>        implemented. However, filesystem-dax is still marked experimental
>>        for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another.  Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.

I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.

>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/Kconfig          |    1
>>  fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/locks.c          |   17 ++++-
>>  include/linux/dax.h |   23 ++++++
>>  include/linux/fs.h  |   22 +++++-
>>  mm/gup.c            |   27 ++++++-
>>  6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>>  config FS_DAX
>>       bool "Direct Access (DAX) support"
>>       depends on MMU
>> +     depends on FILE_LOCKING
>>       depends on !(ARM || MIPS || SPARC)
>>       select FS_IOMAP
>>       select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/genhd.h>
>>  #include <linux/highmem.h>
>>  #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>>  #include <linux/mm.h>
>>  #include <linux/mutex.h>
>>  #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> +     DAX_LEASE_PAGES,
>> +     DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> +     struct page **dl_pages;
>> +     unsigned long dl_nr_pages;
>> +     unsigned long dl_state;
>> +     struct file *dl_file;
>> +     atomic_t dl_count;
>> +     /*
>> +      * Once the lease is taken and the pages have references we
>> +      * start the reap_work to poll for lease release while acquiring
>> +      * fs locks that synchronize with truncate. So, either reap_work
>> +      * cleans up the dax_lease instances or truncate itself.
>> +      *
>> +      * The break_work sleepily polls for DMA completion and then
>> +      * unlocks/removes the lease.
>> +      */
>> +     struct delayed_work dl_reap_work;
>> +     struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> +     if (atomic_dec_and_test(&dl->dl_count)) {
>> +             fput(dl->dl_file);
>> +             kfree(dl);
>> +     }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.

I'll take a look.

>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_break_work.work);
>> +     unsigned long i;
>> +
>> +     /* wait for the gup path to finish recording pages in the lease */
>> +     if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +             return;
>> +     }
>> +
>> +     /* barrier pairs with dax_lease_set_pages() */
>> +     smp_mb__after_atomic();
>> +
>> +     /*
>> +      * If we see all pages idle at least once we can remove the
>> +      * lease. If we happen to race with someone else taking a
>> +      * reference on a page they will have their own lease to protect
>> +      * against truncate.
>> +      */
>> +     for (i = 0; i < dl->dl_nr_pages; i++)
>> +             if (page_ref_count(dl->dl_pages[i]) > 1) {
>> +                     schedule_delayed_work(&dl->dl_break_work, HZ);
>> +                     return;
>> +             }
>> +     vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> +     struct dax_lease *dl = container_of(work, typeof(*dl),
>> +                     dl_reap_work.work);
>> +     struct file *file = dl->dl_file;
>> +     struct inode *inode = file_inode(file);
>> +     struct address_space *mapping = inode->i_mapping;
>> +
>> +     if (mapping->a_ops->dax_flush_dma) {
>> +             mapping->a_ops->dax_flush_dma(inode);
>> +     } else {
>> +             /* FIXME: dax-filesystem needs to add dax-dma support */
>> +             break_allocated(inode, true);
>> +     }
>> +     put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> +     struct dax_lease *dl = fl->fl_owner;
>> +
>> +     if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> +             atomic_inc(&dl->dl_count);
>> +             schedule_delayed_work(&dl->dl_break_work, HZ);
>> +     }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.

Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20 15:23       ` Dan Williams
  (?)
  (?)
@ 2017-10-20 16:29         ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Gerald Schaefer

On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> Yes, however it seems these drivers / platforms have been living with
> the lack of struct page for a long time. So they either don't use DAX,
> or they have a constrained use case that never triggers
> get_user_pages(). If it is the latter then they could introduce a new
> configuration option that bypasses the pfn_t_devmap() check in
> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> So, I'd like to understand how these drivers have been using DAX
> support without struct page to see if we need a workaround or we can
> go ahead delete this support. If the usage is limited to
> execute-in-place perhaps we can do a constrained ->direct_access() for
> just that case.

For axonram I doubt anyone is using it any more - it was a very for
the IBM Cell blades, which were produceѕ in a rather limited number.
And Cell basically seems to be dead as far as I can tell.

For S/390 Martin might be able to help out what the status of xpram
in general and DAX support in particular is.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 16:29         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> Yes, however it seems these drivers / platforms have been living with
> the lack of struct page for a long time. So they either don't use DAX,
> or they have a constrained use case that never triggers
> get_user_pages(). If it is the latter then they could introduce a new
> configuration option that bypasses the pfn_t_devmap() check in
> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> So, I'd like to understand how these drivers have been using DAX
> support without struct page to see if we need a workaround or we can
> go ahead delete this support. If the usage is limited to
> execute-in-place perhaps we can do a constrained ->direct_access() for
> just that case.

For axonram I doubt anyone is using it any more - it was a very for
the IBM Cell blades, which were produceѕ in a rather limited number.
And Cell basically seems to be dead as far as I can tell.

For S/390 Martin might be able to help out what the status of xpram
in general and DAX support in particular is.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 16:29         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> Yes, however it seems these drivers / platforms have been living with
> the lack of struct page for a long time. So they either don't use DAX,
> or they have a constrained use case that never triggers
> get_user_pages(). If it is the latter then they could introduce a new
> configuration option that bypasses the pfn_t_devmap() check in
> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> So, I'd like to understand how these drivers have been using DAX
> support without struct page to see if we need a workaround or we can
> go ahead delete this support. If the usage is limited to
> execute-in-place perhaps we can do a constrained ->direct_access() for
> just that case.

For axonram I doubt anyone is using it any more - it was a very for
the IBM Cell blades, which were produceѕ in a rather limited number.
And Cell basically seems to be dead as far as I can tell.

For S/390 Martin might be able to help out what the status of xpram
in general and DAX support in particular is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 16:29         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> Yes, however it seems these drivers / platforms have been living with
> the lack of struct page for a long time. So they either don't use DAX,
> or they have a constrained use case that never triggers
> get_user_pages(). If it is the latter then they could introduce a new
> configuration option that bypasses the pfn_t_devmap() check in
> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> So, I'd like to understand how these drivers have been using DAX
> support without struct page to see if we need a workaround or we can
> go ahead delete this support. If the usage is limited to
> execute-in-place perhaps we can do a constrained ->direct_access() for
> just that case.

For axonram I doubt anyone is using it any more - it was a very for
the IBM Cell blades, which were produceN? in a rather limited number.
And Cell basically seems to be dead as far as I can tell.

For S/390 Martin might be able to help out what the status of xpram
in general and DAX support in particular is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20 15:42       ` Dan Williams
  (?)
@ 2017-10-20 16:32         ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs, Jan Kara, Matthew Wilcox, Dave Chinner, Dave Hansen,
	Jeff Layton, linux-kernel, Christoph Hellwig, J. Bruce Fields,
	Linux MM, Alexander Viro, linux-fsdevel, Andrew Morton,
	Darrick J. Wong, linux-nvdimm

On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
> I agree, but it needs quite a bit more thought and restructuring of
> the truncate path. I also wonder how we reclaim those stranded
> filesystem blocks, but a first approximation is wait for the
> administrator to delete them or auto-delete them at the next mount.
> XFS seems well prepared to reflink-swap these DMA blocks around, but
> I'm not sure about EXT4.

reflink still is an optional and experimental feature in XFS.  That
being said we should not need to swap block pointers around on disk.
We just need to prevent the block allocator from reusing the blocks
for new allocations, and we have code for that, both for transactions
that haven't been committed to disk yet, and for deleted blocks
undergoing discard operations.

But as mentioned in my second mail from this morning I'm not even
sure we need that.  For short-term elevated page counts like normal
get_user_pages users I think we can just wait for the page count
to reach zero, while for abuses of get_user_pages for long term
pinning memory (not sure if anyone but rdma is doing that) we'll need
something like FL_LAYOUT leases to release the mapping.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 16:32         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Andrew Morton, Jan Kara, Matthew Wilcox,
	Dave Hansen, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, Jeff Moyer, Alexander Viro, linux-fsdevel,
	Darrick J. Wong, Ross Zwisler, linux-xfs, Christoph Hellwig,
	linux-nvdimm

On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
> I agree, but it needs quite a bit more thought and restructuring of
> the truncate path. I also wonder how we reclaim those stranded
> filesystem blocks, but a first approximation is wait for the
> administrator to delete them or auto-delete them at the next mount.
> XFS seems well prepared to reflink-swap these DMA blocks around, but
> I'm not sure about EXT4.

reflink still is an optional and experimental feature in XFS.  That
being said we should not need to swap block pointers around on disk.
We just need to prevent the block allocator from reusing the blocks
for new allocations, and we have code for that, both for transactions
that haven't been committed to disk yet, and for deleted blocks
undergoing discard operations.

But as mentioned in my second mail from this morning I'm not even
sure we need that.  For short-term elevated page counts like normal
get_user_pages users I think we can just wait for the page count
to reach zero, while for abuses of get_user_pages for long term
pinning memory (not sure if anyone but rdma is doing that) we'll need
something like FL_LAYOUT leases to release the mapping.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 16:32         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-20 16:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Andrew Morton, Jan Kara, Matthew Wilcox,
	Dave Hansen, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, Jeff Moyer, Alexander Viro, linux-fsdevel,
	Darrick J. Wong, Ross Zwisler, linux-xfs, Christoph Hellwig,
	linux-nvdimm

On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
> I agree, but it needs quite a bit more thought and restructuring of
> the truncate path. I also wonder how we reclaim those stranded
> filesystem blocks, but a first approximation is wait for the
> administrator to delete them or auto-delete them at the next mount.
> XFS seems well prepared to reflink-swap these DMA blocks around, but
> I'm not sure about EXT4.

reflink still is an optional and experimental feature in XFS.  That
being said we should not need to swap block pointers around on disk.
We just need to prevent the block allocator from reusing the blocks
for new allocations, and we have code for that, both for transactions
that haven't been committed to disk yet, and for deleted blocks
undergoing discard operations.

But as mentioned in my second mail from this morning I'm not even
sure we need that.  For short-term elevated page counts like normal
get_user_pages users I think we can just wait for the page count
to reach zero, while for abuses of get_user_pages for long term
pinning memory (not sure if anyone but rdma is doing that) we'll need
something like FL_LAYOUT leases to release the mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20 16:32         ` Christoph Hellwig
  (?)
@ 2017-10-20 17:27           ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 17:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, Jan Kara, Matthew Wilcox, Dave Chinner, Dave Hansen,
	Jeff Layton, linux-kernel, J. Bruce Fields, Linux MM,
	Alexander Viro, linux-fsdevel, Andrew Morton, Darrick J. Wong,
	linux-nvdimm

On Fri, Oct 20, 2017 at 9:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
>> I agree, but it needs quite a bit more thought and restructuring of
>> the truncate path. I also wonder how we reclaim those stranded
>> filesystem blocks, but a first approximation is wait for the
>> administrator to delete them or auto-delete them at the next mount.
>> XFS seems well prepared to reflink-swap these DMA blocks around, but
>> I'm not sure about EXT4.
>
> reflink still is an optional and experimental feature in XFS.  That
> being said we should not need to swap block pointers around on disk.
> We just need to prevent the block allocator from reusing the blocks
> for new allocations, and we have code for that, both for transactions
> that haven't been committed to disk yet, and for deleted blocks
> undergoing discard operations.
>
> But as mentioned in my second mail from this morning I'm not even
> sure we need that.  For short-term elevated page counts like normal
> get_user_pages users I think we can just wait for the page count
> to reach zero, while for abuses of get_user_pages for long term
> pinning memory (not sure if anyone but rdma is doing that) we'll need
> something like FL_LAYOUT leases to release the mapping.

I'll take a look at hooking this up through a page-idle callback. Can
I get some breadcrumbs to grep for from XFS folks on how to set/clear
the busy state of extents?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 17:27           ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 17:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, Andrew Morton, Jan Kara, Matthew Wilcox,
	Dave Hansen, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, Jeff Moyer, Alexander Viro, linux-fsdevel,
	Darrick J. Wong, Ross Zwisler, linux-xfs, linux-nvdimm

On Fri, Oct 20, 2017 at 9:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
>> I agree, but it needs quite a bit more thought and restructuring of
>> the truncate path. I also wonder how we reclaim those stranded
>> filesystem blocks, but a first approximation is wait for the
>> administrator to delete them or auto-delete them at the next mount.
>> XFS seems well prepared to reflink-swap these DMA blocks around, but
>> I'm not sure about EXT4.
>
> reflink still is an optional and experimental feature in XFS.  That
> being said we should not need to swap block pointers around on disk.
> We just need to prevent the block allocator from reusing the blocks
> for new allocations, and we have code for that, both for transactions
> that haven't been committed to disk yet, and for deleted blocks
> undergoing discard operations.
>
> But as mentioned in my second mail from this morning I'm not even
> sure we need that.  For short-term elevated page counts like normal
> get_user_pages users I think we can just wait for the page count
> to reach zero, while for abuses of get_user_pages for long term
> pinning memory (not sure if anyone but rdma is doing that) we'll need
> something like FL_LAYOUT leases to release the mapping.

I'll take a look at hooking this up through a page-idle callback. Can
I get some breadcrumbs to grep for from XFS folks on how to set/clear
the busy state of extents?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 17:27           ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 17:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, Andrew Morton, Jan Kara, Matthew Wilcox,
	Dave Hansen, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, Jeff Moyer, Alexander Viro, linux-fsdevel,
	Darrick J. Wong, Ross Zwisler, linux-xfs, linux-nvdimm

On Fri, Oct 20, 2017 at 9:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
>> I agree, but it needs quite a bit more thought and restructuring of
>> the truncate path. I also wonder how we reclaim those stranded
>> filesystem blocks, but a first approximation is wait for the
>> administrator to delete them or auto-delete them at the next mount.
>> XFS seems well prepared to reflink-swap these DMA blocks around, but
>> I'm not sure about EXT4.
>
> reflink still is an optional and experimental feature in XFS.  That
> being said we should not need to swap block pointers around on disk.
> We just need to prevent the block allocator from reusing the blocks
> for new allocations, and we have code for that, both for transactions
> that haven't been committed to disk yet, and for deleted blocks
> undergoing discard operations.
>
> But as mentioned in my second mail from this morning I'm not even
> sure we need that.  For short-term elevated page counts like normal
> get_user_pages users I think we can just wait for the page count
> to reach zero, while for abuses of get_user_pages for long term
> pinning memory (not sure if anyone but rdma is doing that) we'll need
> something like FL_LAYOUT leases to release the mapping.

I'll take a look at hooking this up through a page-idle callback. Can
I get some breadcrumbs to grep for from XFS folks on how to set/clear
the busy state of extents?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20 17:27           ` Dan Williams
  (?)
@ 2017-10-20 20:36             ` Brian Foster
  -1 siblings, 0 replies; 143+ messages in thread
From: Brian Foster @ 2017-10-20 20:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs, Jan Kara, Matthew Wilcox, Dave Chinner, Dave Hansen,
	Jeff Layton, linux-kernel, J. Bruce Fields, Linux MM,
	Alexander Viro, linux-fsdevel, Darrick J. Wong, Andrew Morton,
	Christoph Hellwig, linux-nvdimm

On Fri, Oct 20, 2017 at 10:27:22AM -0700, Dan Williams wrote:
> On Fri, Oct 20, 2017 at 9:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
> >> I agree, but it needs quite a bit more thought and restructuring of
> >> the truncate path. I also wonder how we reclaim those stranded
> >> filesystem blocks, but a first approximation is wait for the
> >> administrator to delete them or auto-delete them at the next mount.
> >> XFS seems well prepared to reflink-swap these DMA blocks around, but
> >> I'm not sure about EXT4.
> >
> > reflink still is an optional and experimental feature in XFS.  That
> > being said we should not need to swap block pointers around on disk.
> > We just need to prevent the block allocator from reusing the blocks
> > for new allocations, and we have code for that, both for transactions
> > that haven't been committed to disk yet, and for deleted blocks
> > undergoing discard operations.
> >
> > But as mentioned in my second mail from this morning I'm not even
> > sure we need that.  For short-term elevated page counts like normal
> > get_user_pages users I think we can just wait for the page count
> > to reach zero, while for abuses of get_user_pages for long term
> > pinning memory (not sure if anyone but rdma is doing that) we'll need
> > something like FL_LAYOUT leases to release the mapping.
> 
> I'll take a look at hooking this up through a page-idle callback. Can
> I get some breadcrumbs to grep for from XFS folks on how to set/clear
> the busy state of extents?

See fs/xfs/xfs_extent_busy.c.

Brian
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 20:36             ` Brian Foster
  0 siblings, 0 replies; 143+ messages in thread
From: Brian Foster @ 2017-10-20 20:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jeff Layton, Andrew Morton, Jan Kara,
	Matthew Wilcox, Dave Hansen, Dave Chinner, linux-kernel,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Darrick J. Wong, Ross Zwisler, linux-xfs,
	linux-nvdimm

On Fri, Oct 20, 2017 at 10:27:22AM -0700, Dan Williams wrote:
> On Fri, Oct 20, 2017 at 9:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
> >> I agree, but it needs quite a bit more thought and restructuring of
> >> the truncate path. I also wonder how we reclaim those stranded
> >> filesystem blocks, but a first approximation is wait for the
> >> administrator to delete them or auto-delete them at the next mount.
> >> XFS seems well prepared to reflink-swap these DMA blocks around, but
> >> I'm not sure about EXT4.
> >
> > reflink still is an optional and experimental feature in XFS.  That
> > being said we should not need to swap block pointers around on disk.
> > We just need to prevent the block allocator from reusing the blocks
> > for new allocations, and we have code for that, both for transactions
> > that haven't been committed to disk yet, and for deleted blocks
> > undergoing discard operations.
> >
> > But as mentioned in my second mail from this morning I'm not even
> > sure we need that.  For short-term elevated page counts like normal
> > get_user_pages users I think we can just wait for the page count
> > to reach zero, while for abuses of get_user_pages for long term
> > pinning memory (not sure if anyone but rdma is doing that) we'll need
> > something like FL_LAYOUT leases to release the mapping.
> 
> I'll take a look at hooking this up through a page-idle callback. Can
> I get some breadcrumbs to grep for from XFS folks on how to set/clear
> the busy state of extents?

See fs/xfs/xfs_extent_busy.c.

Brian

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-20 20:36             ` Brian Foster
  0 siblings, 0 replies; 143+ messages in thread
From: Brian Foster @ 2017-10-20 20:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jeff Layton, Andrew Morton, Jan Kara,
	Matthew Wilcox, Dave Hansen, Dave Chinner, linux-kernel,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Darrick J. Wong, Ross Zwisler, linux-xfs,
	linux-nvdimm

On Fri, Oct 20, 2017 at 10:27:22AM -0700, Dan Williams wrote:
> On Fri, Oct 20, 2017 at 9:32 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Oct 20, 2017 at 08:42:00AM -0700, Dan Williams wrote:
> >> I agree, but it needs quite a bit more thought and restructuring of
> >> the truncate path. I also wonder how we reclaim those stranded
> >> filesystem blocks, but a first approximation is wait for the
> >> administrator to delete them or auto-delete them at the next mount.
> >> XFS seems well prepared to reflink-swap these DMA blocks around, but
> >> I'm not sure about EXT4.
> >
> > reflink still is an optional and experimental feature in XFS.  That
> > being said we should not need to swap block pointers around on disk.
> > We just need to prevent the block allocator from reusing the blocks
> > for new allocations, and we have code for that, both for transactions
> > that haven't been committed to disk yet, and for deleted blocks
> > undergoing discard operations.
> >
> > But as mentioned in my second mail from this morning I'm not even
> > sure we need that.  For short-term elevated page counts like normal
> > get_user_pages users I think we can just wait for the page count
> > to reach zero, while for abuses of get_user_pages for long term
> > pinning memory (not sure if anyone but rdma is doing that) we'll need
> > something like FL_LAYOUT leases to release the mapping.
> 
> I'll take a look at hooking this up through a page-idle callback. Can
> I get some breadcrumbs to grep for from XFS folks on how to set/clear
> the busy state of extents?

See fs/xfs/xfs_extent_busy.c.

Brian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20 16:29         ` Christoph Hellwig
  (?)
@ 2017-10-20 22:29           ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 22:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Andrew Morton, Gerald Schaefer

On Fri, Oct 20, 2017 at 9:29 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> Yes, however it seems these drivers / platforms have been living with
>> the lack of struct page for a long time. So they either don't use DAX,
>> or they have a constrained use case that never triggers
>> get_user_pages(). If it is the latter then they could introduce a new
>> configuration option that bypasses the pfn_t_devmap() check in
>> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> So, I'd like to understand how these drivers have been using DAX
>> support without struct page to see if we need a workaround or we can
>> go ahead delete this support. If the usage is limited to
>> execute-in-place perhaps we can do a constrained ->direct_access() for
>> just that case.
>
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
>
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

Ok, I'd also like to kill DAX support in the brd driver. It's a source
of complexity and maintenance burden for zero benefit. It's the only
->direct_access() implementation that sleeps and it's the only
implementation where there is a non-linear relationship between
sectors and pfns. Having a 1:1 sector to pfn relationship will help
with the dma-extent-busy management since we don't need to keep
calling into the driver to map pfns back to sectors once we know the
pfn[0] sector[0] relationship.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 22:29           ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 22:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, linux-nvdimm, Benjamin Herrenschmidt,
	Heiko Carstens, linux-kernel, linux-xfs, Linux MM, Jeff Moyer,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 9:29 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> Yes, however it seems these drivers / platforms have been living with
>> the lack of struct page for a long time. So they either don't use DAX,
>> or they have a constrained use case that never triggers
>> get_user_pages(). If it is the latter then they could introduce a new
>> configuration option that bypasses the pfn_t_devmap() check in
>> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> So, I'd like to understand how these drivers have been using DAX
>> support without struct page to see if we need a workaround or we can
>> go ahead delete this support. If the usage is limited to
>> execute-in-place perhaps we can do a constrained ->direct_access() for
>> just that case.
>
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
>
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

Ok, I'd also like to kill DAX support in the brd driver. It's a source
of complexity and maintenance burden for zero benefit. It's the only
->direct_access() implementation that sleeps and it's the only
implementation where there is a non-linear relationship between
sectors and pfns. Having a 1:1 sector to pfn relationship will help
with the dma-extent-busy management since we don't need to keep
calling into the driver to map pfns back to sectors once we know the
pfn[0] sector[0] relationship.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-20 22:29           ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-20 22:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, linux-nvdimm, Benjamin Herrenschmidt,
	Heiko Carstens, linux-kernel, linux-xfs, Linux MM, Jeff Moyer,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 9:29 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> Yes, however it seems these drivers / platforms have been living with
>> the lack of struct page for a long time. So they either don't use DAX,
>> or they have a constrained use case that never triggers
>> get_user_pages(). If it is the latter then they could introduce a new
>> configuration option that bypasses the pfn_t_devmap() check in
>> bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> So, I'd like to understand how these drivers have been using DAX
>> support without struct page to see if we need a workaround or we can
>> go ahead delete this support. If the usage is limited to
>> execute-in-place perhaps we can do a constrained ->direct_access() for
>> just that case.
>
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
>
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

Ok, I'd also like to kill DAX support in the brd driver. It's a source
of complexity and maintenance burden for zero benefit. It's the only
->direct_access() implementation that sleeps and it's the only
implementation where there is a non-linear relationship between
sectors and pfns. Having a 1:1 sector to pfn relationship will help
with the dma-extent-busy management since we don't need to keep
calling into the driver to map pfns back to sectors once we know the
pfn[0] sector[0] relationship.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20 22:29           ` Dan Williams
  (?)
@ 2017-10-21  3:20             ` Matthew Wilcox
  -1 siblings, 0 replies; 143+ messages in thread
From: Matthew Wilcox @ 2017-10-21  3:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Gerald Schaefer

On Fri, Oct 20, 2017 at 03:29:57PM -0700, Dan Williams wrote:
> Ok, I'd also like to kill DAX support in the brd driver. It's a source
> of complexity and maintenance burden for zero benefit. It's the only
> ->direct_access() implementation that sleeps and it's the only
> implementation where there is a non-linear relationship between
> sectors and pfns. Having a 1:1 sector to pfn relationship will help
> with the dma-extent-busy management since we don't need to keep
> calling into the driver to map pfns back to sectors once we know the
> pfn[0] sector[0] relationship.

But these are important things that other block devices may / will want.

For example, I think it's entirely sensible to support ->direct_access
for RAID-0.  Dell are looking at various different options for having
one pmemX device per DIMM and using RAID to lash them together.
->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
sense to me.

Last time we tried to take sleeping out, there were grumblings from people
with network block devices who thought they'd want to bring pages in
across the network.  I'm a bit less sympathetic to this because I don't
know anyone actively working on it, but the RAID-0 case is something I
think we should care about.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-21  3:20             ` Matthew Wilcox
  0 siblings, 0 replies; 143+ messages in thread
From: Matthew Wilcox @ 2017-10-21  3:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 03:29:57PM -0700, Dan Williams wrote:
> Ok, I'd also like to kill DAX support in the brd driver. It's a source
> of complexity and maintenance burden for zero benefit. It's the only
> ->direct_access() implementation that sleeps and it's the only
> implementation where there is a non-linear relationship between
> sectors and pfns. Having a 1:1 sector to pfn relationship will help
> with the dma-extent-busy management since we don't need to keep
> calling into the driver to map pfns back to sectors once we know the
> pfn[0] sector[0] relationship.

But these are important things that other block devices may / will want.

For example, I think it's entirely sensible to support ->direct_access
for RAID-0.  Dell are looking at various different options for having
one pmemX device per DIMM and using RAID to lash them together.
->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
sense to me.

Last time we tried to take sleeping out, there were grumblings from people
with network block devices who thought they'd want to bring pages in
across the network.  I'm a bit less sympathetic to this because I don't
know anyone actively working on it, but the RAID-0 case is something I
think we should care about.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-21  3:20             ` Matthew Wilcox
  0 siblings, 0 replies; 143+ messages in thread
From: Matthew Wilcox @ 2017-10-21  3:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 03:29:57PM -0700, Dan Williams wrote:
> Ok, I'd also like to kill DAX support in the brd driver. It's a source
> of complexity and maintenance burden for zero benefit. It's the only
> ->direct_access() implementation that sleeps and it's the only
> implementation where there is a non-linear relationship between
> sectors and pfns. Having a 1:1 sector to pfn relationship will help
> with the dma-extent-busy management since we don't need to keep
> calling into the driver to map pfns back to sectors once we know the
> pfn[0] sector[0] relationship.

But these are important things that other block devices may / will want.

For example, I think it's entirely sensible to support ->direct_access
for RAID-0.  Dell are looking at various different options for having
one pmemX device per DIMM and using RAID to lash them together.
->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
sense to me.

Last time we tried to take sleeping out, there were grumblings from people
with network block devices who thought they'd want to bring pages in
across the network.  I'm a bit less sympathetic to this because I don't
know anyone actively working on it, but the RAID-0 case is something I
think we should care about.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-21  3:20             ` Matthew Wilcox
  (?)
@ 2017-10-21  4:16               ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-21  4:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Andrew Morton, Christoph Hellwig, Gerald Schaefer

On Fri, Oct 20, 2017 at 8:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Oct 20, 2017 at 03:29:57PM -0700, Dan Williams wrote:
>> Ok, I'd also like to kill DAX support in the brd driver. It's a source
>> of complexity and maintenance burden for zero benefit. It's the only
>> ->direct_access() implementation that sleeps and it's the only
>> implementation where there is a non-linear relationship between
>> sectors and pfns. Having a 1:1 sector to pfn relationship will help
>> with the dma-extent-busy management since we don't need to keep
>> calling into the driver to map pfns back to sectors once we know the
>> pfn[0] sector[0] relationship.
>
> But these are important things that other block devices may / will want.
>
> For example, I think it's entirely sensible to support ->direct_access
> for RAID-0.  Dell are looking at various different options for having
> one pmemX device per DIMM and using RAID to lash them together.
> ->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
> sense to me.
>
> Last time we tried to take sleeping out, there were grumblings from people
> with network block devices who thought they'd want to bring pages in
> across the network.  I'm a bit less sympathetic to this because I don't
> know anyone actively working on it, but the RAID-0 case is something I
> think we should care about.

True, good point. In fact we already support device-mapper striping
with ->direct_access(). I'd still like to go ahead with the sleeping
removal. When those folks come back and add network direct_access they
can do the hard work of figuring out cases where we need to call
direct_access in atomic contexts.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-21  4:16               ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-21  4:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 8:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Oct 20, 2017 at 03:29:57PM -0700, Dan Williams wrote:
>> Ok, I'd also like to kill DAX support in the brd driver. It's a source
>> of complexity and maintenance burden for zero benefit. It's the only
>> ->direct_access() implementation that sleeps and it's the only
>> implementation where there is a non-linear relationship between
>> sectors and pfns. Having a 1:1 sector to pfn relationship will help
>> with the dma-extent-busy management since we don't need to keep
>> calling into the driver to map pfns back to sectors once we know the
>> pfn[0] sector[0] relationship.
>
> But these are important things that other block devices may / will want.
>
> For example, I think it's entirely sensible to support ->direct_access
> for RAID-0.  Dell are looking at various different options for having
> one pmemX device per DIMM and using RAID to lash them together.
> ->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
> sense to me.
>
> Last time we tried to take sleeping out, there were grumblings from people
> with network block devices who thought they'd want to bring pages in
> across the network.  I'm a bit less sympathetic to this because I don't
> know anyone actively working on it, but the RAID-0 case is something I
> think we should care about.

True, good point. In fact we already support device-mapper striping
with ->direct_access(). I'd still like to go ahead with the sleeping
removal. When those folks come back and add network direct_access they
can do the hard work of figuring out cases where we need to call
direct_access in atomic contexts.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-21  4:16               ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-21  4:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 8:20 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Oct 20, 2017 at 03:29:57PM -0700, Dan Williams wrote:
>> Ok, I'd also like to kill DAX support in the brd driver. It's a source
>> of complexity and maintenance burden for zero benefit. It's the only
>> ->direct_access() implementation that sleeps and it's the only
>> implementation where there is a non-linear relationship between
>> sectors and pfns. Having a 1:1 sector to pfn relationship will help
>> with the dma-extent-busy management since we don't need to keep
>> calling into the driver to map pfns back to sectors once we know the
>> pfn[0] sector[0] relationship.
>
> But these are important things that other block devices may / will want.
>
> For example, I think it's entirely sensible to support ->direct_access
> for RAID-0.  Dell are looking at various different options for having
> one pmemX device per DIMM and using RAID to lash them together.
> ->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
> sense to me.
>
> Last time we tried to take sleeping out, there were grumblings from people
> with network block devices who thought they'd want to bring pages in
> across the network.  I'm a bit less sympathetic to this because I don't
> know anyone actively working on it, but the RAID-0 case is something I
> think we should care about.

True, good point. In fact we already support device-mapper striping
with ->direct_access(). I'd still like to go ahead with the sleeping
removal. When those folks come back and add network direct_access they
can do the hard work of figuring out cases where we need to call
direct_access in atomic contexts.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
  2017-10-20 17:27           ` Dan Williams
@ 2017-10-21  8:11             ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-21  8:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jeff Layton, Andrew Morton, Jan Kara,
	Matthew Wilcox, Dave Hansen, Dave Chinner, linux-kernel,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Darrick J. Wong, Ross Zwisler, linux-xfs,
	linux-nvdimm

On Fri, Oct 20, 2017 at 10:27:22AM -0700, Dan Williams wrote:
> I'll take a look at hooking this up through a page-idle callback. Can
> I get some breadcrumbs to grep for from XFS folks on how to set/clear
> the busy state of extents?

As Brian pointed out it's the xfs_extent_busy.c file (and I pointed
out the same in a reply to the previous series).  Be careful because
you'll need a refcount or flags now that there are different busy
reasons.

I still think we'd be better off just blocking on an elevated page
count directly in truncate as that will avoid all the busy list
manipulations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
@ 2017-10-21  8:11             ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-21  8:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jeff Layton, Andrew Morton, Jan Kara,
	Matthew Wilcox, Dave Hansen, Dave Chinner, linux-kernel,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Darrick J. Wong, Ross Zwisler, linux-xfs,
	linux-nvdimm

On Fri, Oct 20, 2017 at 10:27:22AM -0700, Dan Williams wrote:
> I'll take a look at hooking this up through a page-idle callback. Can
> I get some breadcrumbs to grep for from XFS folks on how to set/clear
> the busy state of extents?

As Brian pointed out it's the xfs_extent_busy.c file (and I pointed
out the same in a reply to the previous series).  Be careful because
you'll need a refcount or flags now that there are different busy
reasons.

I still think we'd be better off just blocking on an elevated page
count directly in truncate as that will avoid all the busy list
manipulations.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-21  4:16               ` Dan Williams
  (?)
@ 2017-10-21  8:15                 ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-21  8:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, Matthew Wilcox, linux-xfs, Linux MM,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	linux-fsdevel, Andrew Morton, Christoph Hellwig, Gerald Schaefer

On Fri, Oct 20, 2017 at 09:16:21PM -0700, Dan Williams wrote:
> > For example, I think it's entirely sensible to support ->direct_access
> > for RAID-0.  Dell are looking at various different options for having
> > one pmemX device per DIMM and using RAID to lash them together.
> > ->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
> > sense to me.
> >
> > Last time we tried to take sleeping out, there were grumblings from people
> > with network block devices who thought they'd want to bring pages in
> > across the network.  I'm a bit less sympathetic to this because I don't
> > know anyone actively working on it, but the RAID-0 case is something I
> > think we should care about.
> 
> True, good point. In fact we already support device-mapper striping
> with ->direct_access(). I'd still like to go ahead with the sleeping
> removal. When those folks come back and add network direct_access they
> can do the hard work of figuring out cases where we need to call
> direct_access in atomic contexts.

It would be great to move DAX striping out of DM so that we don't need
to keep fake block devices around just for that.  In fact if Dell is so
interested in it it would be great if they get a strip/concact table
into ACPI so that the bios and OS can agree on it in a standardized way,
and we can just implement it in the nvdimm layer.

I agree that there is no reason at all to support sleeping in
->direct_access - it makes life painful for no gain at all.  If you
network access remote memory you will need local memory to support
mmap, so we might as well use the page cache instead of reinventing
it. (saying that with my remote pmem over NFS hat on).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-21  8:15                 ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-21  8:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Christoph Hellwig, Andrew Morton, Jan Kara,
	linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 09:16:21PM -0700, Dan Williams wrote:
> > For example, I think it's entirely sensible to support ->direct_access
> > for RAID-0.  Dell are looking at various different options for having
> > one pmemX device per DIMM and using RAID to lash them together.
> > ->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
> > sense to me.
> >
> > Last time we tried to take sleeping out, there were grumblings from people
> > with network block devices who thought they'd want to bring pages in
> > across the network.  I'm a bit less sympathetic to this because I don't
> > know anyone actively working on it, but the RAID-0 case is something I
> > think we should care about.
> 
> True, good point. In fact we already support device-mapper striping
> with ->direct_access(). I'd still like to go ahead with the sleeping
> removal. When those folks come back and add network direct_access they
> can do the hard work of figuring out cases where we need to call
> direct_access in atomic contexts.

It would be great to move DAX striping out of DM so that we don't need
to keep fake block devices around just for that.  In fact if Dell is so
interested in it it would be great if they get a strip/concact table
into ACPI so that the bios and OS can agree on it in a standardized way,
and we can just implement it in the nvdimm layer.

I agree that there is no reason at all to support sleeping in
->direct_access - it makes life painful for no gain at all.  If you
network access remote memory you will need local memory to support
mmap, so we might as well use the page cache instead of reinventing
it. (saying that with my remote pmem over NFS hat on).

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-21  8:15                 ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-21  8:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Christoph Hellwig, Andrew Morton, Jan Kara,
	linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, linux-fsdevel,
	Ross Zwisler, Gerald Schaefer

On Fri, Oct 20, 2017 at 09:16:21PM -0700, Dan Williams wrote:
> > For example, I think it's entirely sensible to support ->direct_access
> > for RAID-0.  Dell are looking at various different options for having
> > one pmemX device per DIMM and using RAID to lash them together.
> > ->direct_access makes no sense for RAID-5 or RAID-1, but RAID-0 makes
> > sense to me.
> >
> > Last time we tried to take sleeping out, there were grumblings from people
> > with network block devices who thought they'd want to bring pages in
> > across the network.  I'm a bit less sympathetic to this because I don't
> > know anyone actively working on it, but the RAID-0 case is something I
> > think we should care about.
> 
> True, good point. In fact we already support device-mapper striping
> with ->direct_access(). I'd still like to go ahead with the sleeping
> removal. When those folks come back and add network direct_access they
> can do the hard work of figuring out cases where we need to call
> direct_access in atomic contexts.

It would be great to move DAX striping out of DM so that we don't need
to keep fake block devices around just for that.  In fact if Dell is so
interested in it it would be great if they get a strip/concact table
into ACPI so that the bios and OS can agree on it in a standardized way,
and we can just implement it in the nvdimm layer.

I agree that there is no reason at all to support sleeping in
->direct_access - it makes life painful for no gain at all.  If you
network access remote memory you will need local memory to support
mmap, so we might as well use the page cache instead of reinventing
it. (saying that with my remote pmem over NFS hat on).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-20 16:29         ` Christoph Hellwig
  (?)
@ 2017-10-23  5:18           ` Martin Schwidefsky
  -1 siblings, 0 replies; 143+ messages in thread
From: Martin Schwidefsky @ 2017-10-23  5:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, linux-fsdevel, Andrew Morton, Gerald Schaefer

On Fri, 20 Oct 2017 18:29:33 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> > Yes, however it seems these drivers / platforms have been living with
> > the lack of struct page for a long time. So they either don't use DAX,
> > or they have a constrained use case that never triggers
> > get_user_pages(). If it is the latter then they could introduce a new
> > configuration option that bypasses the pfn_t_devmap() check in
> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> > So, I'd like to understand how these drivers have been using DAX
> > support without struct page to see if we need a workaround or we can
> > go ahead delete this support. If the usage is limited to
> > execute-in-place perhaps we can do a constrained ->direct_access() for
> > just that case.  
> 
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
> 
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

The goes back to the time where DAX was called XIP. The initial design
point has been *not* to have struct pages for a large read-only memory
area. There is a block device driver for z/VM that maps a DCSS segment
somewhere in memore (no struct page!) with e.g. the complete /usr
filesystem. The xpram driver is a different beast and has nothing to
do with XIP/DAX.

Now, if any there are very few users of the dcssblk driver out there.
The idea to save a few megabyte for /usr never really took of.

We have to look at our get_user_pages() implementation to see how hard
it would be to make it fail if the target address is for an area without
struct pages.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23  5:18           ` Martin Schwidefsky
  0 siblings, 0 replies; 143+ messages in thread
From: Martin Schwidefsky @ 2017-10-23  5:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, 20 Oct 2017 18:29:33 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> > Yes, however it seems these drivers / platforms have been living with
> > the lack of struct page for a long time. So they either don't use DAX,
> > or they have a constrained use case that never triggers
> > get_user_pages(). If it is the latter then they could introduce a new
> > configuration option that bypasses the pfn_t_devmap() check in
> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> > So, I'd like to understand how these drivers have been using DAX
> > support without struct page to see if we need a workaround or we can
> > go ahead delete this support. If the usage is limited to
> > execute-in-place perhaps we can do a constrained ->direct_access() for
> > just that case.  
> 
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
> 
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

The goes back to the time where DAX was called XIP. The initial design
point has been *not* to have struct pages for a large read-only memory
area. There is a block device driver for z/VM that maps a DCSS segment
somewhere in memore (no struct page!) with e.g. the complete /usr
filesystem. The xpram driver is a different beast and has nothing to
do with XIP/DAX.

Now, if any there are very few users of the dcssblk driver out there.
The idea to save a few megabyte for /usr never really took of.

We have to look at our get_user_pages() implementation to see how hard
it would be to make it fail if the target address is for an area without
struct pages.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23  5:18           ` Martin Schwidefsky
  0 siblings, 0 replies; 143+ messages in thread
From: Martin Schwidefsky @ 2017-10-23  5:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Fri, 20 Oct 2017 18:29:33 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> > Yes, however it seems these drivers / platforms have been living with
> > the lack of struct page for a long time. So they either don't use DAX,
> > or they have a constrained use case that never triggers
> > get_user_pages(). If it is the latter then they could introduce a new
> > configuration option that bypasses the pfn_t_devmap() check in
> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> > So, I'd like to understand how these drivers have been using DAX
> > support without struct page to see if we need a workaround or we can
> > go ahead delete this support. If the usage is limited to
> > execute-in-place perhaps we can do a constrained ->direct_access() for
> > just that case.  
> 
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
> 
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

The goes back to the time where DAX was called XIP. The initial design
point has been *not* to have struct pages for a large read-only memory
area. There is a block device driver for z/VM that maps a DCSS segment
somewhere in memore (no struct page!) with e.g. the complete /usr
filesystem. The xpram driver is a different beast and has nothing to
do with XIP/DAX.

Now, if any there are very few users of the dcssblk driver out there.
The idea to save a few megabyte for /usr never really took of.

We have to look at our get_user_pages() implementation to see how hard
it would be to make it fail if the target address is for an area without
struct pages.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-23  5:18           ` Martin Schwidefsky
@ 2017-10-23  8:55             ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-23  8:55 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Fri, 20 Oct 2017 18:29:33 +0200
> Christoph Hellwig <hch@lst.de> wrote:
>
>> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> > Yes, however it seems these drivers / platforms have been living with
>> > the lack of struct page for a long time. So they either don't use DAX,
>> > or they have a constrained use case that never triggers
>> > get_user_pages(). If it is the latter then they could introduce a new
>> > configuration option that bypasses the pfn_t_devmap() check in
>> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> > So, I'd like to understand how these drivers have been using DAX
>> > support without struct page to see if we need a workaround or we can
>> > go ahead delete this support. If the usage is limited to
>> > execute-in-place perhaps we can do a constrained ->direct_access() for
>> > just that case.
>>
>> For axonram I doubt anyone is using it any more - it was a very for
>> the IBM Cell blades, which were produceѕ in a rather limited number.
>> And Cell basically seems to be dead as far as I can tell.
>>
>> For S/390 Martin might be able to help out what the status of xpram
>> in general and DAX support in particular is.
>
> The goes back to the time where DAX was called XIP. The initial design
> point has been *not* to have struct pages for a large read-only memory
> area. There is a block device driver for z/VM that maps a DCSS segment
> somewhere in memore (no struct page!) with e.g. the complete /usr
> filesystem. The xpram driver is a different beast and has nothing to
> do with XIP/DAX.
>
> Now, if any there are very few users of the dcssblk driver out there.
> The idea to save a few megabyte for /usr never really took of.
>
> We have to look at our get_user_pages() implementation to see how hard
> it would be to make it fail if the target address is for an area without
> struct pages.

For read-only memory I think we can enable a subset of DAX, and
explicitly turn off the paths that require get_user_pages(). However,
I wonder if anyone has tested DAX with dcssblk because fork() requires
get_user_pages()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23  8:55             ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-23  8:55 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Fri, 20 Oct 2017 18:29:33 +0200
> Christoph Hellwig <hch@lst.de> wrote:
>
>> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> > Yes, however it seems these drivers / platforms have been living with
>> > the lack of struct page for a long time. So they either don't use DAX,
>> > or they have a constrained use case that never triggers
>> > get_user_pages(). If it is the latter then they could introduce a new
>> > configuration option that bypasses the pfn_t_devmap() check in
>> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> > So, I'd like to understand how these drivers have been using DAX
>> > support without struct page to see if we need a workaround or we can
>> > go ahead delete this support. If the usage is limited to
>> > execute-in-place perhaps we can do a constrained ->direct_access() for
>> > just that case.
>>
>> For axonram I doubt anyone is using it any more - it was a very for
>> the IBM Cell blades, which were produceѕ in a rather limited number.
>> And Cell basically seems to be dead as far as I can tell.
>>
>> For S/390 Martin might be able to help out what the status of xpram
>> in general and DAX support in particular is.
>
> The goes back to the time where DAX was called XIP. The initial design
> point has been *not* to have struct pages for a large read-only memory
> area. There is a block device driver for z/VM that maps a DCSS segment
> somewhere in memore (no struct page!) with e.g. the complete /usr
> filesystem. The xpram driver is a different beast and has nothing to
> do with XIP/DAX.
>
> Now, if any there are very few users of the dcssblk driver out there.
> The idea to save a few megabyte for /usr never really took of.
>
> We have to look at our get_user_pages() implementation to see how hard
> it would be to make it fail if the target address is for an area without
> struct pages.

For read-only memory I think we can enable a subset of DAX, and
explicitly turn off the paths that require get_user_pages(). However,
I wonder if anyone has tested DAX with dcssblk because fork() requires
get_user_pages()?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-23  8:55             ` Dan Williams
  (?)
@ 2017-10-23 10:44               ` Martin Schwidefsky
  -1 siblings, 0 replies; 143+ messages in thread
From: Martin Schwidefsky @ 2017-10-23 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Gerald Schaefer

On Mon, 23 Oct 2017 01:55:20 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
> <schwidefsky@de.ibm.com> wrote:
> > On Fri, 20 Oct 2017 18:29:33 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >  
> >> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:  
> >> > Yes, however it seems these drivers / platforms have been living with
> >> > the lack of struct page for a long time. So they either don't use DAX,
> >> > or they have a constrained use case that never triggers
> >> > get_user_pages(). If it is the latter then they could introduce a new
> >> > configuration option that bypasses the pfn_t_devmap() check in
> >> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> >> > So, I'd like to understand how these drivers have been using DAX
> >> > support without struct page to see if we need a workaround or we can
> >> > go ahead delete this support. If the usage is limited to
> >> > execute-in-place perhaps we can do a constrained ->direct_access() for
> >> > just that case.  
> >>
> >> For axonram I doubt anyone is using it any more - it was a very for
> >> the IBM Cell blades, which were produceѕ in a rather limited number.
> >> And Cell basically seems to be dead as far as I can tell.
> >>
> >> For S/390 Martin might be able to help out what the status of xpram
> >> in general and DAX support in particular is.  
> >
> > The goes back to the time where DAX was called XIP. The initial design
> > point has been *not* to have struct pages for a large read-only memory
> > area. There is a block device driver for z/VM that maps a DCSS segment
> > somewhere in memore (no struct page!) with e.g. the complete /usr
> > filesystem. The xpram driver is a different beast and has nothing to
> > do with XIP/DAX.
> >
> > Now, if any there are very few users of the dcssblk driver out there.
> > The idea to save a few megabyte for /usr never really took of.
> >
> > We have to look at our get_user_pages() implementation to see how hard
> > it would be to make it fail if the target address is for an area without
> > struct pages.  
> 
> For read-only memory I think we can enable a subset of DAX, and
> explicitly turn off the paths that require get_user_pages(). However,
> I wonder if anyone has tested DAX with dcssblk because fork() requires
> get_user_pages()?
 
I did not test it recently, someone else might have. Gerald?

Looking at the code I see this in the s390 version of gup_pte_range:

        mask = (write ? _PAGE_PROTECT : 0) | _PAGE_INVALID | _PAGE_SPECIAL;
	...
                if ((pte_val(pte) & mask) != 0)
                        return 0;
	...

The XIP code used the pte_mkspecial mechanics to make it work. As far as
I can see the pfn_t_devmap returns true for the DAX mappins, yes?
Then I would say that dcssblk and DAX currently do not work together.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23 10:44               ` Martin Schwidefsky
  0 siblings, 0 replies; 143+ messages in thread
From: Martin Schwidefsky @ 2017-10-23 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Mon, 23 Oct 2017 01:55:20 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
> <schwidefsky@de.ibm.com> wrote:
> > On Fri, 20 Oct 2017 18:29:33 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >  
> >> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:  
> >> > Yes, however it seems these drivers / platforms have been living with
> >> > the lack of struct page for a long time. So they either don't use DAX,
> >> > or they have a constrained use case that never triggers
> >> > get_user_pages(). If it is the latter then they could introduce a new
> >> > configuration option that bypasses the pfn_t_devmap() check in
> >> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> >> > So, I'd like to understand how these drivers have been using DAX
> >> > support without struct page to see if we need a workaround or we can
> >> > go ahead delete this support. If the usage is limited to
> >> > execute-in-place perhaps we can do a constrained ->direct_access() for
> >> > just that case.  
> >>
> >> For axonram I doubt anyone is using it any more - it was a very for
> >> the IBM Cell blades, which were produceѕ in a rather limited number.
> >> And Cell basically seems to be dead as far as I can tell.
> >>
> >> For S/390 Martin might be able to help out what the status of xpram
> >> in general and DAX support in particular is.  
> >
> > The goes back to the time where DAX was called XIP. The initial design
> > point has been *not* to have struct pages for a large read-only memory
> > area. There is a block device driver for z/VM that maps a DCSS segment
> > somewhere in memore (no struct page!) with e.g. the complete /usr
> > filesystem. The xpram driver is a different beast and has nothing to
> > do with XIP/DAX.
> >
> > Now, if any there are very few users of the dcssblk driver out there.
> > The idea to save a few megabyte for /usr never really took of.
> >
> > We have to look at our get_user_pages() implementation to see how hard
> > it would be to make it fail if the target address is for an area without
> > struct pages.  
> 
> For read-only memory I think we can enable a subset of DAX, and
> explicitly turn off the paths that require get_user_pages(). However,
> I wonder if anyone has tested DAX with dcssblk because fork() requires
> get_user_pages()?
 
I did not test it recently, someone else might have. Gerald?

Looking at the code I see this in the s390 version of gup_pte_range:

        mask = (write ? _PAGE_PROTECT : 0) | _PAGE_INVALID | _PAGE_SPECIAL;
	...
                if ((pte_val(pte) & mask) != 0)
                        return 0;
	...

The XIP code used the pte_mkspecial mechanics to make it work. As far as
I can see the pfn_t_devmap returns true for the DAX mappins, yes?
Then I would say that dcssblk and DAX currently do not work together.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23 10:44               ` Martin Schwidefsky
  0 siblings, 0 replies; 143+ messages in thread
From: Martin Schwidefsky @ 2017-10-23 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Mon, 23 Oct 2017 01:55:20 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
> <schwidefsky@de.ibm.com> wrote:
> > On Fri, 20 Oct 2017 18:29:33 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >  
> >> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:  
> >> > Yes, however it seems these drivers / platforms have been living with
> >> > the lack of struct page for a long time. So they either don't use DAX,
> >> > or they have a constrained use case that never triggers
> >> > get_user_pages(). If it is the latter then they could introduce a new
> >> > configuration option that bypasses the pfn_t_devmap() check in
> >> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> >> > So, I'd like to understand how these drivers have been using DAX
> >> > support without struct page to see if we need a workaround or we can
> >> > go ahead delete this support. If the usage is limited to
> >> > execute-in-place perhaps we can do a constrained ->direct_access() for
> >> > just that case.  
> >>
> >> For axonram I doubt anyone is using it any more - it was a very for
> >> the IBM Cell blades, which were produceѕ in a rather limited number.
> >> And Cell basically seems to be dead as far as I can tell.
> >>
> >> For S/390 Martin might be able to help out what the status of xpram
> >> in general and DAX support in particular is.  
> >
> > The goes back to the time where DAX was called XIP. The initial design
> > point has been *not* to have struct pages for a large read-only memory
> > area. There is a block device driver for z/VM that maps a DCSS segment
> > somewhere in memore (no struct page!) with e.g. the complete /usr
> > filesystem. The xpram driver is a different beast and has nothing to
> > do with XIP/DAX.
> >
> > Now, if any there are very few users of the dcssblk driver out there.
> > The idea to save a few megabyte for /usr never really took of.
> >
> > We have to look at our get_user_pages() implementation to see how hard
> > it would be to make it fail if the target address is for an area without
> > struct pages.  
> 
> For read-only memory I think we can enable a subset of DAX, and
> explicitly turn off the paths that require get_user_pages(). However,
> I wonder if anyone has tested DAX with dcssblk because fork() requires
> get_user_pages()?
 
I did not test it recently, someone else might have. Gerald?

Looking at the code I see this in the s390 version of gup_pte_range:

        mask = (write ? _PAGE_PROTECT : 0) | _PAGE_INVALID | _PAGE_SPECIAL;
	...
                if ((pte_val(pte) & mask) != 0)
                        return 0;
	...

The XIP code used the pte_mkspecial mechanics to make it work. As far as
I can see the pfn_t_devmap returns true for the DAX mappins, yes?
Then I would say that dcssblk and DAX currently do not work together.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
  2017-10-23 10:44               ` Martin Schwidefsky
  (?)
@ 2017-10-23 11:20                 ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-23 11:20 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Jan Kara, linux-nvdimm, Benjamin Herrenschmidt, Heiko Carstens,
	linux-kernel, linux-xfs, Linux MM, Paul Mackerras,
	Michael Ellerman, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Gerald Schaefer

On Mon, Oct 23, 2017 at 3:44 AM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Mon, 23 Oct 2017 01:55:20 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
>> <schwidefsky@de.ibm.com> wrote:
>> > On Fri, 20 Oct 2017 18:29:33 +0200
>> > Christoph Hellwig <hch@lst.de> wrote:
>> >
>> >> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> >> > Yes, however it seems these drivers / platforms have been living with
>> >> > the lack of struct page for a long time. So they either don't use DAX,
>> >> > or they have a constrained use case that never triggers
>> >> > get_user_pages(). If it is the latter then they could introduce a new
>> >> > configuration option that bypasses the pfn_t_devmap() check in
>> >> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> >> > So, I'd like to understand how these drivers have been using DAX
>> >> > support without struct page to see if we need a workaround or we can
>> >> > go ahead delete this support. If the usage is limited to
>> >> > execute-in-place perhaps we can do a constrained ->direct_access() for
>> >> > just that case.
>> >>
>> >> For axonram I doubt anyone is using it any more - it was a very for
>> >> the IBM Cell blades, which were produceѕ in a rather limited number.
>> >> And Cell basically seems to be dead as far as I can tell.
>> >>
>> >> For S/390 Martin might be able to help out what the status of xpram
>> >> in general and DAX support in particular is.
>> >
>> > The goes back to the time where DAX was called XIP. The initial design
>> > point has been *not* to have struct pages for a large read-only memory
>> > area. There is a block device driver for z/VM that maps a DCSS segment
>> > somewhere in memore (no struct page!) with e.g. the complete /usr
>> > filesystem. The xpram driver is a different beast and has nothing to
>> > do with XIP/DAX.
>> >
>> > Now, if any there are very few users of the dcssblk driver out there.
>> > The idea to save a few megabyte for /usr never really took of.
>> >
>> > We have to look at our get_user_pages() implementation to see how hard
>> > it would be to make it fail if the target address is for an area without
>> > struct pages.
>>
>> For read-only memory I think we can enable a subset of DAX, and
>> explicitly turn off the paths that require get_user_pages(). However,
>> I wonder if anyone has tested DAX with dcssblk because fork() requires
>> get_user_pages()?
>
> I did not test it recently, someone else might have. Gerald?
>
> Looking at the code I see this in the s390 version of gup_pte_range:
>
>         mask = (write ? _PAGE_PROTECT : 0) | _PAGE_INVALID | _PAGE_SPECIAL;
>         ...
>                 if ((pte_val(pte) & mask) != 0)
>                         return 0;
>         ...
>
> The XIP code used the pte_mkspecial mechanics to make it work. As far as
> I can see the pfn_t_devmap returns true for the DAX mappins, yes?

Yes, but that's only for get_user_pages_fast() support.

> Then I would say that dcssblk and DAX currently do not work together.

I think at a minimum we need a new pfn_t flag for the 'special' bit to
at least indicate that DAX mappings of dcssblk and axonram do not
support normal get_user_pages(). Then I don't need to explicitly
disable DAX in the !pfn_t_devmap() case. I think I also want to split
the "pfn_to_virt()" and the "sector to pfn" operations into distinct
dax_operations rather than doing both in one ->direct_access(). This
supports storing pfns in the fs/dax radix rather than sectors.

In other words, the pfn_t_devmap() requirement was only about making
get_user_pages() safely fail, and pte_special() fills that
requirement.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23 11:20                 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-23 11:20 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Mon, Oct 23, 2017 at 3:44 AM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Mon, 23 Oct 2017 01:55:20 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
>> <schwidefsky@de.ibm.com> wrote:
>> > On Fri, 20 Oct 2017 18:29:33 +0200
>> > Christoph Hellwig <hch@lst.de> wrote:
>> >
>> >> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> >> > Yes, however it seems these drivers / platforms have been living with
>> >> > the lack of struct page for a long time. So they either don't use DAX,
>> >> > or they have a constrained use case that never triggers
>> >> > get_user_pages(). If it is the latter then they could introduce a new
>> >> > configuration option that bypasses the pfn_t_devmap() check in
>> >> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> >> > So, I'd like to understand how these drivers have been using DAX
>> >> > support without struct page to see if we need a workaround or we can
>> >> > go ahead delete this support. If the usage is limited to
>> >> > execute-in-place perhaps we can do a constrained ->direct_access() for
>> >> > just that case.
>> >>
>> >> For axonram I doubt anyone is using it any more - it was a very for
>> >> the IBM Cell blades, which were produceѕ in a rather limited number.
>> >> And Cell basically seems to be dead as far as I can tell.
>> >>
>> >> For S/390 Martin might be able to help out what the status of xpram
>> >> in general and DAX support in particular is.
>> >
>> > The goes back to the time where DAX was called XIP. The initial design
>> > point has been *not* to have struct pages for a large read-only memory
>> > area. There is a block device driver for z/VM that maps a DCSS segment
>> > somewhere in memore (no struct page!) with e.g. the complete /usr
>> > filesystem. The xpram driver is a different beast and has nothing to
>> > do with XIP/DAX.
>> >
>> > Now, if any there are very few users of the dcssblk driver out there.
>> > The idea to save a few megabyte for /usr never really took of.
>> >
>> > We have to look at our get_user_pages() implementation to see how hard
>> > it would be to make it fail if the target address is for an area without
>> > struct pages.
>>
>> For read-only memory I think we can enable a subset of DAX, and
>> explicitly turn off the paths that require get_user_pages(). However,
>> I wonder if anyone has tested DAX with dcssblk because fork() requires
>> get_user_pages()?
>
> I did not test it recently, someone else might have. Gerald?
>
> Looking at the code I see this in the s390 version of gup_pte_range:
>
>         mask = (write ? _PAGE_PROTECT : 0) | _PAGE_INVALID | _PAGE_SPECIAL;
>         ...
>                 if ((pte_val(pte) & mask) != 0)
>                         return 0;
>         ...
>
> The XIP code used the pte_mkspecial mechanics to make it work. As far as
> I can see the pfn_t_devmap returns true for the DAX mappins, yes?

Yes, but that's only for get_user_pages_fast() support.

> Then I would say that dcssblk and DAX currently do not work together.

I think at a minimum we need a new pfn_t flag for the 'special' bit to
at least indicate that DAX mappings of dcssblk and axonram do not
support normal get_user_pages(). Then I don't need to explicitly
disable DAX in the !pfn_t_devmap() case. I think I also want to split
the "pfn_to_virt()" and the "sector to pfn" operations into distinct
dax_operations rather than doing both in one ->direct_access(). This
supports storing pfns in the fs/dax radix rather than sectors.

In other words, the pfn_t_devmap() requirement was only about making
get_user_pages() safely fail, and pte_special() fills that
requirement.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax
@ 2017-10-23 11:20                 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-23 11:20 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	Benjamin Herrenschmidt, Heiko Carstens, linux-kernel, linux-xfs,
	Linux MM, Jeff Moyer, Paul Mackerras, Michael Ellerman,
	linux-fsdevel, Ross Zwisler, Gerald Schaefer

On Mon, Oct 23, 2017 at 3:44 AM, Martin Schwidefsky
<schwidefsky@de.ibm.com> wrote:
> On Mon, 23 Oct 2017 01:55:20 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Sun, Oct 22, 2017 at 10:18 PM, Martin Schwidefsky
>> <schwidefsky@de.ibm.com> wrote:
>> > On Fri, 20 Oct 2017 18:29:33 +0200
>> > Christoph Hellwig <hch@lst.de> wrote:
>> >
>> >> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
>> >> > Yes, however it seems these drivers / platforms have been living with
>> >> > the lack of struct page for a long time. So they either don't use DAX,
>> >> > or they have a constrained use case that never triggers
>> >> > get_user_pages(). If it is the latter then they could introduce a new
>> >> > configuration option that bypasses the pfn_t_devmap() check in
>> >> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
>> >> > So, I'd like to understand how these drivers have been using DAX
>> >> > support without struct page to see if we need a workaround or we can
>> >> > go ahead delete this support. If the usage is limited to
>> >> > execute-in-place perhaps we can do a constrained ->direct_access() for
>> >> > just that case.
>> >>
>> >> For axonram I doubt anyone is using it any more - it was a very for
>> >> the IBM Cell blades, which were produceѕ in a rather limited number.
>> >> And Cell basically seems to be dead as far as I can tell.
>> >>
>> >> For S/390 Martin might be able to help out what the status of xpram
>> >> in general and DAX support in particular is.
>> >
>> > The goes back to the time where DAX was called XIP. The initial design
>> > point has been *not* to have struct pages for a large read-only memory
>> > area. There is a block device driver for z/VM that maps a DCSS segment
>> > somewhere in memore (no struct page!) with e.g. the complete /usr
>> > filesystem. The xpram driver is a different beast and has nothing to
>> > do with XIP/DAX.
>> >
>> > Now, if any there are very few users of the dcssblk driver out there.
>> > The idea to save a few megabyte for /usr never really took of.
>> >
>> > We have to look at our get_user_pages() implementation to see how hard
>> > it would be to make it fail if the target address is for an area without
>> > struct pages.
>>
>> For read-only memory I think we can enable a subset of DAX, and
>> explicitly turn off the paths that require get_user_pages(). However,
>> I wonder if anyone has tested DAX with dcssblk because fork() requires
>> get_user_pages()?
>
> I did not test it recently, someone else might have. Gerald?
>
> Looking at the code I see this in the s390 version of gup_pte_range:
>
>         mask = (write ? _PAGE_PROTECT : 0) | _PAGE_INVALID | _PAGE_SPECIAL;
>         ...
>                 if ((pte_val(pte) & mask) != 0)
>                         return 0;
>         ...
>
> The XIP code used the pte_mkspecial mechanics to make it work. As far as
> I can see the pfn_t_devmap returns true for the DAX mappins, yes?

Yes, but that's only for get_user_pages_fast() support.

> Then I would say that dcssblk and DAX currently do not work together.

I think at a minimum we need a new pfn_t flag for the 'special' bit to
at least indicate that DAX mappings of dcssblk and axonram do not
support normal get_user_pages(). Then I don't need to explicitly
disable DAX in the !pfn_t_devmap() case. I think I also want to split
the "pfn_to_virt()" and the "sector to pfn" operations into distinct
dax_operations rather than doing both in one ->direct_access(). This
supports storing pfns in the fs/dax radix rather than sectors.

In other words, the pfn_t_devmap() requirement was only about making
get_user_pages() safely fail, and pte_special() fills that
requirement.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-20  9:31     ` Christoph Hellwig
  (?)
  (?)
@ 2017-10-26 10:58       ` Jan Kara
  -1 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-26 10:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Sean Hefty, Jeff Layton, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Jason Gunthorpe, Doug Ledford, Hal Rosenstock,
	Dave Chinner, linux-fsdevel, Alexander Viro, Gerald Schaefer,
	linux-nvdimm, linux-kernel, linux-xfs, Martin Schwidefsky, akpm,
	Darrick J. Wong, Kirill A. Shutemov

On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > I'd like to brainstorm how we can do something better.
> > 
> > How about:
> > 
> > If we hit a page with an elevated refcount in truncate / hole puch
> > etc for a DAX file system we do not free the blocks in the file system,
> > but add it to the extent busy list.  We mark the page as delayed
> > free (e.g. page flag?) so that when it finally hits refcount zero we
> > call back into the file system to remove it from the busy list.
> 
> Brainstorming some more:
> 
> Given that on a DAX file there shouldn't be any long-term page
> references after we unmap it from the page table and don't allow
> get_user_pages calls why not wait for the references for all
> DAX pages to go away first?  E.g. if we find a DAX page in
> truncate_inode_pages_range that has an elevated refcount we set
> a new flag to prevent new references from showing up, and then
> simply wait for it to go away.  Instead of a busy way we can
> do this through a few hashed waitqueued in dev_pagemap.  And in
> fact put_zone_device_page already gets called when putting the
> last page so we can handle the wakeup from there.
> 
> In fact if we can't find a page flag for the stop new callers
> things we could probably come up with a way to do that through
> dev_pagemap somehow, but I'm not sure how efficient that would
> be.

We were talking about this yesterday with Dan so some more brainstorming
from us. We can implement the solution with extent busy list in ext4
relatively easily - we already have such list currently similarly to XFS.
There would be some modifications needed but nothing too complex. The
biggest downside of this solution I see is that it requires per-filesystem
solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
may have problems and ext2 definitely will need some modifications.
Invisible used blocks may be surprising to users at times although given
page refs should be relatively short term, that should not be a big issue.
But are we guaranteed page refs are short term? E.g. if someone creates
v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
can be rather long-term similarly as in RDMA case. Also freeing of blocks
on page reference drop is another async entry point into the filesystem
which could unpleasantly surprise us but I guess workqueues would solve
that reasonably fine.

WRT waiting for page refs to be dropped before proceeding with truncate (or
punch hole for that matter - that case is even nastier since we don't have
i_size to guard us). What I like about this solution is that it is very
visible there's something unusual going on with the file being truncated /
punched and so problems are easier to diagnose / fix from the admin side.
So far we have guarded hole punching from concurrent faults (and
get_user_pages() does fault once you do unmap_mapping_range()) with
I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
obvious case Dan came up with is when GUP obtains ref to page A, then hole
punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
dropped, and then GUP blocks on trying to fault in another page.

I think we cannot easily prevent new page references to be grabbed as you
write above since nobody expects stuff like get_page() to fail. But I 
think that unmapping relevant pages and then preventing them to be faulted
in again is workable and stops GUP as well. The problem with that is though
what to do with page faults to such pages - you cannot just fail them for
hole punch, and you cannot easily allocate new blocks either. So we are
back at a situation where we need to detach blocks from the inode and then
wait for page refs to be dropped - so some form of busy extents. Am I
missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-26 10:58       ` Jan Kara
  0 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-26 10:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, akpm, Michal Hocko, Jan Kara,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens, linux-nvdimm, Alexander Viro,
	Gerald Schaefer

On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > I'd like to brainstorm how we can do something better.
> > 
> > How about:
> > 
> > If we hit a page with an elevated refcount in truncate / hole puch
> > etc for a DAX file system we do not free the blocks in the file system,
> > but add it to the extent busy list.  We mark the page as delayed
> > free (e.g. page flag?) so that when it finally hits refcount zero we
> > call back into the file system to remove it from the busy list.
> 
> Brainstorming some more:
> 
> Given that on a DAX file there shouldn't be any long-term page
> references after we unmap it from the page table and don't allow
> get_user_pages calls why not wait for the references for all
> DAX pages to go away first?  E.g. if we find a DAX page in
> truncate_inode_pages_range that has an elevated refcount we set
> a new flag to prevent new references from showing up, and then
> simply wait for it to go away.  Instead of a busy way we can
> do this through a few hashed waitqueued in dev_pagemap.  And in
> fact put_zone_device_page already gets called when putting the
> last page so we can handle the wakeup from there.
> 
> In fact if we can't find a page flag for the stop new callers
> things we could probably come up with a way to do that through
> dev_pagemap somehow, but I'm not sure how efficient that would
> be.

We were talking about this yesterday with Dan so some more brainstorming
from us. We can implement the solution with extent busy list in ext4
relatively easily - we already have such list currently similarly to XFS.
There would be some modifications needed but nothing too complex. The
biggest downside of this solution I see is that it requires per-filesystem
solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
may have problems and ext2 definitely will need some modifications.
Invisible used blocks may be surprising to users at times although given
page refs should be relatively short term, that should not be a big issue.
But are we guaranteed page refs are short term? E.g. if someone creates
v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
can be rather long-term similarly as in RDMA case. Also freeing of blocks
on page reference drop is another async entry point into the filesystem
which could unpleasantly surprise us but I guess workqueues would solve
that reasonably fine.

WRT waiting for page refs to be dropped before proceeding with truncate (or
punch hole for that matter - that case is even nastier since we don't have
i_size to guard us). What I like about this solution is that it is very
visible there's something unusual going on with the file being truncated /
punched and so problems are easier to diagnose / fix from the admin side.
So far we have guarded hole punching from concurrent faults (and
get_user_pages() does fault once you do unmap_mapping_range()) with
I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
obvious case Dan came up with is when GUP obtains ref to page A, then hole
punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
dropped, and then GUP blocks on trying to fault in another page.

I think we cannot easily prevent new page references to be grabbed as you
write above since nobody expects stuff like get_page() to fail. But I 
think that unmapping relevant pages and then preventing them to be faulted
in again is workable and stops GUP as well. The problem with that is though
what to do with page faults to such pages - you cannot just fail them for
hole punch, and you cannot easily allocate new blocks either. So we are
back at a situation where we need to detach blocks from the inode and then
wait for page refs to be dropped - so some form of busy extents. Am I
missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-26 10:58       ` Jan Kara
  0 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-26 10:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, akpm, Michal Hocko, Jan Kara,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens, linux-nvdimm, Alexander Viro,
	Gerald Schaefer, Darrick J. Wong, linux-kernel, linux-xfs,
	Martin Schwidefsky, linux-fsdevel, Kirill A. Shutemov

On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > I'd like to brainstorm how we can do something better.
> > 
> > How about:
> > 
> > If we hit a page with an elevated refcount in truncate / hole puch
> > etc for a DAX file system we do not free the blocks in the file system,
> > but add it to the extent busy list.  We mark the page as delayed
> > free (e.g. page flag?) so that when it finally hits refcount zero we
> > call back into the file system to remove it from the busy list.
> 
> Brainstorming some more:
> 
> Given that on a DAX file there shouldn't be any long-term page
> references after we unmap it from the page table and don't allow
> get_user_pages calls why not wait for the references for all
> DAX pages to go away first?  E.g. if we find a DAX page in
> truncate_inode_pages_range that has an elevated refcount we set
> a new flag to prevent new references from showing up, and then
> simply wait for it to go away.  Instead of a busy way we can
> do this through a few hashed waitqueued in dev_pagemap.  And in
> fact put_zone_device_page already gets called when putting the
> last page so we can handle the wakeup from there.
> 
> In fact if we can't find a page flag for the stop new callers
> things we could probably come up with a way to do that through
> dev_pagemap somehow, but I'm not sure how efficient that would
> be.

We were talking about this yesterday with Dan so some more brainstorming
from us. We can implement the solution with extent busy list in ext4
relatively easily - we already have such list currently similarly to XFS.
There would be some modifications needed but nothing too complex. The
biggest downside of this solution I see is that it requires per-filesystem
solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
may have problems and ext2 definitely will need some modifications.
Invisible used blocks may be surprising to users at times although given
page refs should be relatively short term, that should not be a big issue.
But are we guaranteed page refs are short term? E.g. if someone creates
v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
can be rather long-term similarly as in RDMA case. Also freeing of blocks
on page reference drop is another async entry point into the filesystem
which could unpleasantly surprise us but I guess workqueues would solve
that reasonably fine.

WRT waiting for page refs to be dropped before proceeding with truncate (or
punch hole for that matter - that case is even nastier since we don't have
i_size to guard us). What I like about this solution is that it is very
visible there's something unusual going on with the file being truncated /
punched and so problems are easier to diagnose / fix from the admin side.
So far we have guarded hole punching from concurrent faults (and
get_user_pages() does fault once you do unmap_mapping_range()) with
I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
obvious case Dan came up with is when GUP obtains ref to page A, then hole
punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
dropped, and then GUP blocks on trying to fault in another page.

I think we cannot easily prevent new page references to be grabbed as you
write above since nobody expects stuff like get_page() to fail. But I 
think that unmapping relevant pages and then preventing them to be faulted
in again is workable and stops GUP as well. The problem with that is though
what to do with page faults to such pages - you cannot just fail them for
hole punch, and you cannot easily allocate new blocks either. So we are
back at a situation where we need to detach blocks from the inode and then
wait for page refs to be dropped - so some form of busy extents. Am I
missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-26 10:58       ` Jan Kara
  0 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-26 10:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, akpm, Michal Hocko, Jan Kara,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens, linux-nvdimm, Alexander Viro,
	Gerald Schaefer, Darrick J. Wong, linux-kernel, linux-xfs,
	Martin Schwidefsky, linux-fsdevel, Kirill A. Shutemov

On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > I'd like to brainstorm how we can do something better.
> > 
> > How about:
> > 
> > If we hit a page with an elevated refcount in truncate / hole puch
> > etc for a DAX file system we do not free the blocks in the file system,
> > but add it to the extent busy list.  We mark the page as delayed
> > free (e.g. page flag?) so that when it finally hits refcount zero we
> > call back into the file system to remove it from the busy list.
> 
> Brainstorming some more:
> 
> Given that on a DAX file there shouldn't be any long-term page
> references after we unmap it from the page table and don't allow
> get_user_pages calls why not wait for the references for all
> DAX pages to go away first?  E.g. if we find a DAX page in
> truncate_inode_pages_range that has an elevated refcount we set
> a new flag to prevent new references from showing up, and then
> simply wait for it to go away.  Instead of a busy way we can
> do this through a few hashed waitqueued in dev_pagemap.  And in
> fact put_zone_device_page already gets called when putting the
> last page so we can handle the wakeup from there.
> 
> In fact if we can't find a page flag for the stop new callers
> things we could probably come up with a way to do that through
> dev_pagemap somehow, but I'm not sure how efficient that would
> be.

We were talking about this yesterday with Dan so some more brainstorming
from us. We can implement the solution with extent busy list in ext4
relatively easily - we already have such list currently similarly to XFS.
There would be some modifications needed but nothing too complex. The
biggest downside of this solution I see is that it requires per-filesystem
solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
may have problems and ext2 definitely will need some modifications.
Invisible used blocks may be surprising to users at times although given
page refs should be relatively short term, that should not be a big issue.
But are we guaranteed page refs are short term? E.g. if someone creates
v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
can be rather long-term similarly as in RDMA case. Also freeing of blocks
on page reference drop is another async entry point into the filesystem
which could unpleasantly surprise us but I guess workqueues would solve
that reasonably fine.

WRT waiting for page refs to be dropped before proceeding with truncate (or
punch hole for that matter - that case is even nastier since we don't have
i_size to guard us). What I like about this solution is that it is very
visible there's something unusual going on with the file being truncated /
punched and so problems are easier to diagnose / fix from the admin side.
So far we have guarded hole punching from concurrent faults (and
get_user_pages() does fault once you do unmap_mapping_range()) with
I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
obvious case Dan came up with is when GUP obtains ref to page A, then hole
punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
dropped, and then GUP blocks on trying to fault in another page.

I think we cannot easily prevent new page references to be grabbed as you
write above since nobody expects stuff like get_page() to fail. But I 
think that unmapping relevant pages and then preventing them to be faulted
in again is workable and stops GUP as well. The problem with that is though
what to do with page faults to such pages - you cannot just fail them for
hole punch, and you cannot easily allocate new blocks either. So we are
back at a situation where we need to detach blocks from the inode and then
wait for page refs to be dropped - so some form of busy extents. Am I
missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 10:58       ` Jan Kara
  (?)
  (?)
@ 2017-10-26 23:51         ` Williams, Dan J
  -1 siblings, 0 replies; 143+ messages in thread
From: Williams, Dan J @ 2017-10-26 23:51 UTC (permalink / raw)
  To: hch, jack
  Cc: mhocko, benh, dave.hansen, heiko.carstens, bfields, linux-mm,
	paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma, mpe,
	dledford, jgunthorpe, hal.rosenstock, david, schwidefsky, viro,
	gerald.schaefer, linux-nvdimm, darrick.wong, linux-kernel,
	linux-xfs, linux-fsdevel, akpm, kirill.shutemov

On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > I'd like to brainstorm how we can do something better.
> > > 
> > > How about:
> > > 
> > > If we hit a page with an elevated refcount in truncate / hole puch
> > > etc for a DAX file system we do not free the blocks in the file system,
> > > but add it to the extent busy list.  We mark the page as delayed
> > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > call back into the file system to remove it from the busy list.
> > 
> > Brainstorming some more:
> > 
> > Given that on a DAX file there shouldn't be any long-term page
> > references after we unmap it from the page table and don't allow
> > get_user_pages calls why not wait for the references for all
> > DAX pages to go away first?  E.g. if we find a DAX page in
> > truncate_inode_pages_range that has an elevated refcount we set
> > a new flag to prevent new references from showing up, and then
> > simply wait for it to go away.  Instead of a busy way we can
> > do this through a few hashed waitqueued in dev_pagemap.  And in
> > fact put_zone_device_page already gets called when putting the
> > last page so we can handle the wakeup from there.
> > 
> > In fact if we can't find a page flag for the stop new callers
> > things we could probably come up with a way to do that through
> > dev_pagemap somehow, but I'm not sure how efficient that would
> > be.
> 
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
> 
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
> 
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I 
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?
> 

No, that's a good summary of what we talked about. However, I did go
back and give the new lock approach a try and was able to get my test
to pass. The new locking is not pretty especially since you need to
drop and reacquire the lock so that get_user_pages() can finish
grabbing all the pages it needs. Here are the two primary patches in
the series, do you think the extent-busy approach would be cleaner?

---

commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Wed Oct 25 05:14:54 2017 -0700

    mm, dax: handle truncate of dma-busy pages
    
    get_user_pages() pins file backed memory pages for access by dma
    devices. However, it only pins the memory pages not the page-to-file
    offset association. If a file is truncated the pages are mapped out of
    the file and dma may continue indefinitely into a page that is owned by
    a device driver. This breaks coherency of the file vs dma, but the
    assumption is that if userspace wants the file-space truncated it does
    not matter what data is inbound from the device, it is not relevant
    anymore.
    
    The assumptions of the truncate-page-cache model are broken by DAX where
    the target DMA page *is* the filesystem block. Leaving the page pinned
    for DMA, but truncating the file block out of the file, means that the
    filesytem is free to reallocate a block under active DMA to another
    file!
    
    Here are some possible options for fixing this situation ('truncate' and
    'fallocate(punch hole)' are synonymous below):
    
        1/ Fail truncate while any file blocks might be under dma
    
        2/ Block (sleep-wait) truncate while any file blocks might be under
           dma
    
        3/ Remap file blocks to a "lost+found"-like file-inode where
           dma can continue and we might see what inbound data from DMA was
           mapped out of the original file. Blocks in this file could be
           freed back to the filesystem when dma eventually ends.
    
        4/ List the blocks under DMA in the extent busy list and either hold
           off commit of the truncate transaction until commit, or otherwise
           keep the blocks marked busy so the allocator does not reuse them
           until DMA completes.
    
        5/ Disable dax until option 3 or another long term solution has been
           implemented. However, filesystem-dax is still marked experimental
           for concerns like this.
    
    Option 1 will throw failures where userspace has never expected them
    before, option 2 might hang the truncating process indefinitely, and
    option 3 requires per filesystem enabling to remap blocks from one inode
    to another.  Option 2 is implemented in this patch for the DAX path with
    the expectation that non-transient users of get_user_pages() (RDMA) are
    disallowed from setting up dax mappings and that the potential delay
    introduced to the truncate path is acceptable compared to the response
    time of the page cache case. This can only be seen as a stop-gap until
    we can solve the problem of safely sequestering unallocated filesystem
    blocks under active dma.
    
    The solution introduces a new inode semaphore that that is held
    exclusively for get_user_pages() and held for read at truncate while
    sleep-waiting on a hashed waitqueue.
    
    Credit for option 3 goes to Dave Hansen, who proposed something similar
    as an alternative way to solve the problem that MAP_DIRECT was trying to
    solve. Credit for option 4 goes to Christoph Hellwig.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index fd5d385988d1..f2c98f9cb833 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -346,6 +346,19 @@ static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
 	}
 }
 
+static struct page *dma_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -487,6 +500,97 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static int wait_page(atomic_t *_refcount)
+{
+	struct page *page = container_of(_refcount, struct page, _refcount);
+	struct inode *inode = page->inode;
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	i_daxdma_unlock_shared(inode);
+	schedule();
+	i_daxdma_lock_shared(inode);
+
+	/*
+	 * if we bounced the daxdma_lock then we need to rescan the
+	 * truncate area.
+	 */
+	return 1;
+}
+
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t	indices[PAGEVEC_SIZE];
+	pgoff_t	start, end, index;
+	struct pagevec pvec;
+	unsigned i;
+
+	lockdep_assert_held(&inode->i_dax_dmasem);
+
+	if (lstart < 0 || len < -1)
+		return;
+
+	/* in the limited case get_user_pages for dax is disabled */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	if (!dax_mapping(mapping))
+		return;
+
+	if (mapping->nrexceptional == 0)
+		return;
+
+	if (len == -1)
+		end = -1;
+	else
+		end = (lstart + len) >> PAGE_SHIFT;
+	start = lstart >> PAGE_SHIFT;
+
+retry:
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			struct page *page = NULL;
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dma_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			if (page && wait_on_devmap_idle(&page->_refcount,
+						wait_page,
+						TASK_UNINTERRUPTIBLE) != 0) {
+				/*
+				 * We dropped the dma lock, so we need
+				 * to revalidate that previously seen
+				 * idle pages are still idle.
+				 */
+				goto retry;
+			}
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+	}
+}
+EXPORT_SYMBOL_GPL(dax_wait_dma);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -509,8 +613,10 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 out:
 	put_unlocked_mapping_entry(mapping, index, entry);
 	spin_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/fs/inode.c b/fs/inode.c
index d1e35b53bb23..95408e87a96c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -192,6 +192,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 	inode->i_flctx = NULL;
+	i_daxdma_init(inode);
 	this_cpu_inc(nr_inodes);
 
 	return 0;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ea21ebfd1889..6ce1c50519e7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len);
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
 #else
+static inline void dax_wait_dma(struct address_space *mapping, loff_t lstart,
+		loff_t len)
+{
+}
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..cd5b4a092d1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore	i_dax_dmasem;
+#endif
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -747,6 +750,59 @@ static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
 	down_write_nested(&inode->i_rwsem, subclass);
 }
 
+#ifdef CONFIG_FS_DAX
+static inline void i_daxdma_init(struct inode *inode)
+{
+	init_rwsem(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+	down_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+	up_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+	/*
+	 * The write lock is taken under mmap_sem in the
+	 * get_user_pages() path the read lock nests in the truncate
+	 * path.
+	 */
+#define DAXDMA_TRUNCATE_CLASS 1
+	down_read_nested(&inode->i_dax_dmasem, DAXDMA_TRUNCATE_CLASS);
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+	up_read(&inode->i_dax_dmasem);
+}
+#else /* CONFIG_FS_DAX */
+static inline void i_daxdma_init(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+}
+#endif /* CONFIG_FS_DAX */
+
 void lock_two_nondirectories(struct inode *, struct inode*);
 void unlock_two_nondirectories(struct inode *, struct inode*);
 
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..fd7b2a2e2d19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -579,6 +579,41 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+static struct inode *do_dax_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct file *file;
+	struct inode *inode;
+
+	if (!(foll_flags & FOLL_GET))
+		return NULL;
+	if (!vma_is_dax(vma))
+		return NULL;
+	file = vma->vm_file;
+	inode = file_inode(file);
+	if (inode->i_mode == S_IFCHR)
+		return NULL;
+	return inode;
+}
+
+static struct inode *dax_truncate_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct inode *inode = do_dax_lock(vma, foll_flags);
+
+	if (!inode)
+		return NULL;
+	i_daxdma_lock(inode);
+	return inode;
+}
+
+static void dax_truncate_unlock(struct inode *inode)
+{
+	if (!inode)
+		return;
+	i_daxdma_unlock(inode);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -659,6 +694,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	do {
 		struct page *page;
+		struct inode *inode;
 		unsigned int foll_flags = gup_flags;
 		unsigned int page_increm;
 
@@ -693,7 +729,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		inode = dax_truncate_lock(vma, foll_flags);
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_truncate_unlock(inode);
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,

commit 67d952314e9989b3b1945c50488f4a0f760264c3
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Tue Oct 24 13:41:22 2017 -0700

    xfs: wire up dax dma waiting
    
    The dax-dma vs truncate collision avoidance involves acquiring the new
    i_dax_dmasem and validating the no ranges that are to be mapped out of
    the file are active for dma. If any are found we wait for page idle
    and retry the scan. The locations where we implement this wait line up
    with where we currently wait for pnfs layout leases to expire.
    
    Since we need both dma to be idle and leases to be broken, and since
    xfs_break_layouts drops locks, we need to retry the dma busy scan until
    we can complete one that finds no busy pages.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..e3ec46c28c60 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,7 +347,7 @@ xfs_file_aio_write_checks(
 		return error;
 
 	error = xfs_break_layouts(inode, iolock);
-	if (error)
+	if (error < 0)
 		return error;
 
 	/*
@@ -762,7 +762,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -771,10 +771,20 @@ xfs_file_fallocate(
 	if (mode & ~XFS_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
+retry:
 	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, offset, len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
 	error = xfs_break_layouts(inode, &iolock);
-	if (error)
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
+	}
 
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	iolock |= XFS_MMAPLOCK_EXCL;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4ec5b7f45401..783f15894b7b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -171,7 +171,14 @@ xfs_ilock_attr_map_shared(
  * taken in places where we need to invalidate the page cache in a race
  * free manner (e.g. truncate, hole punch and other extent manipulation
  * functions).
- */
+ *
+ * The XFS_DAXDMA_LOCK_SHARED lock is a CONFIG_FS_DAX special case lock
+ * for synchronizing truncate vs ongoing DMA. The get_user_pages() path
+ * will hold this lock exclusively when incrementing page reference
+ * counts for DMA. Before an extent can be truncated we need to complete
+ * a validate-idle sweep of all pages in the range while holding this
+ * lock in shared mode.
+*/
 void
 xfs_ilock(
 	xfs_inode_t		*ip,
@@ -192,6 +199,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_lock_shared(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -328,6 +338,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_unlock_shared(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..0662edf00529 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -283,10 +283,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAXDMA_LOCK_SHARED	(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAXDMA_LOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -294,7 +296,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAXDMA_LOCK_SHARED, "XFS_DAXDMA_LOCK_SHARED" }
 
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..fd384ea00ede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	int			error;
 
 	/*
@@ -637,18 +637,6 @@ xfs_ioc_space(
 	if (filp->f_mode & FMODE_NOCMTIME)
 		flags |= XFS_PREALLOC_INVISIBLE;
 
-	error = mnt_want_write_file(filp);
-	if (error)
-		return error;
-
-	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
-	if (error)
-		goto out_unlock;
-
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -659,10 +647,31 @@ xfs_ioc_space(
 		bf->l_start += XFS_ISIZE(ip);
 		break;
 	default:
-		error = -EINVAL;
+		return -EINVAL;
+	}
+
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+
+retry:
+	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, bf->l_start, bf->l_len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
+	error = xfs_break_layouts(inode, &iolock);
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
 	}
 
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	iolock |= XFS_MMAPLOCK_EXCL;
+
 	/*
 	 * length of <= 0 for resv/unresv/zero is invalid.  length for
 	 * alloc/free is ignored completely and we have no idea what userspace
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..5f4d46b3cd7f 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -35,18 +35,19 @@ xfs_break_layouts(
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
+	int			error, did_unlock = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		did_unlock = 1;
 		error = break_layout(inode, true);
 		*iolock = XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 
-	return error;
+	return error < 0 ? error : did_unlock;
 }
 
 /*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-26 23:51         ` Williams, Dan J
  0 siblings, 0 replies; 143+ messages in thread
From: Williams, Dan J @ 2017-10-26 23:51 UTC (permalink / raw)
  To: hch, jack
  Cc: schwidefsky, darrick.wong, dledford, linux-rdma, linux-fsdevel,
	bfields, linux-mm, heiko.carstens, dave.hansen, linux-xfs,
	linux-kernel, jmoyer, viro, kirill.shutemov, akpm

On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > I'd like to brainstorm how we can do something better.
> > > 
> > > How about:
> > > 
> > > If we hit a page with an elevated refcount in truncate / hole puch
> > > etc for a DAX file system we do not free the blocks in the file system,
> > > but add it to the extent busy list.  We mark the page as delayed
> > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > call back into the file system to remove it from the busy list.
> > 
> > Brainstorming some more:
> > 
> > Given that on a DAX file there shouldn't be any long-term page
> > references after we unmap it from the page table and don't allow
> > get_user_pages calls why not wait for the references for all
> > DAX pages to go away first?  E.g. if we find a DAX page in
> > truncate_inode_pages_range that has an elevated refcount we set
> > a new flag to prevent new references from showing up, and then
> > simply wait for it to go away.  Instead of a busy way we can
> > do this through a few hashed waitqueued in dev_pagemap.  And in
> > fact put_zone_device_page already gets called when putting the
> > last page so we can handle the wakeup from there.
> > 
> > In fact if we can't find a page flag for the stop new callers
> > things we could probably come up with a way to do that through
> > dev_pagemap somehow, but I'm not sure how efficient that would
> > be.
> 
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
> 
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
> 
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I 
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?
> 

No, that's a good summary of what we talked about. However, I did go
back and give the new lock approach a try and was able to get my test
to pass. The new locking is not pretty especially since you need to
drop and reacquire the lock so that get_user_pages() can finish
grabbing all the pages it needs. Here are the two primary patches in
the series, do you think the extent-busy approach would be cleaner?

---

commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Wed Oct 25 05:14:54 2017 -0700

    mm, dax: handle truncate of dma-busy pages
    
    get_user_pages() pins file backed memory pages for access by dma
    devices. However, it only pins the memory pages not the page-to-file
    offset association. If a file is truncated the pages are mapped out of
    the file and dma may continue indefinitely into a page that is owned by
    a device driver. This breaks coherency of the file vs dma, but the
    assumption is that if userspace wants the file-space truncated it does
    not matter what data is inbound from the device, it is not relevant
    anymore.
    
    The assumptions of the truncate-page-cache model are broken by DAX where
    the target DMA page *is* the filesystem block. Leaving the page pinned
    for DMA, but truncating the file block out of the file, means that the
    filesytem is free to reallocate a block under active DMA to another
    file!
    
    Here are some possible options for fixing this situation ('truncate' and
    'fallocate(punch hole)' are synonymous below):
    
        1/ Fail truncate while any file blocks might be under dma
    
        2/ Block (sleep-wait) truncate while any file blocks might be under
           dma
    
        3/ Remap file blocks to a "lost+found"-like file-inode where
           dma can continue and we might see what inbound data from DMA was
           mapped out of the original file. Blocks in this file could be
           freed back to the filesystem when dma eventually ends.
    
        4/ List the blocks under DMA in the extent busy list and either hold
           off commit of the truncate transaction until commit, or otherwise
           keep the blocks marked busy so the allocator does not reuse them
           until DMA completes.
    
        5/ Disable dax until option 3 or another long term solution has been
           implemented. However, filesystem-dax is still marked experimental
           for concerns like this.
    
    Option 1 will throw failures where userspace has never expected them
    before, option 2 might hang the truncating process indefinitely, and
    option 3 requires per filesystem enabling to remap blocks from one inode
    to another.  Option 2 is implemented in this patch for the DAX path with
    the expectation that non-transient users of get_user_pages() (RDMA) are
    disallowed from setting up dax mappings and that the potential delay
    introduced to the truncate path is acceptable compared to the response
    time of the page cache case. This can only be seen as a stop-gap until
    we can solve the problem of safely sequestering unallocated filesystem
    blocks under active dma.
    
    The solution introduces a new inode semaphore that that is held
    exclusively for get_user_pages() and held for read at truncate while
    sleep-waiting on a hashed waitqueue.
    
    Credit for option 3 goes to Dave Hansen, who proposed something similar
    as an alternative way to solve the problem that MAP_DIRECT was trying to
    solve. Credit for option 4 goes to Christoph Hellwig.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index fd5d385988d1..f2c98f9cb833 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -346,6 +346,19 @@ static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
 	}
 }
 
+static struct page *dma_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -487,6 +500,97 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static int wait_page(atomic_t *_refcount)
+{
+	struct page *page = container_of(_refcount, struct page, _refcount);
+	struct inode *inode = page->inode;
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	i_daxdma_unlock_shared(inode);
+	schedule();
+	i_daxdma_lock_shared(inode);
+
+	/*
+	 * if we bounced the daxdma_lock then we need to rescan the
+	 * truncate area.
+	 */
+	return 1;
+}
+
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t	indices[PAGEVEC_SIZE];
+	pgoff_t	start, end, index;
+	struct pagevec pvec;
+	unsigned i;
+
+	lockdep_assert_held(&inode->i_dax_dmasem);
+
+	if (lstart < 0 || len < -1)
+		return;
+
+	/* in the limited case get_user_pages for dax is disabled */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	if (!dax_mapping(mapping))
+		return;
+
+	if (mapping->nrexceptional == 0)
+		return;
+
+	if (len == -1)
+		end = -1;
+	else
+		end = (lstart + len) >> PAGE_SHIFT;
+	start = lstart >> PAGE_SHIFT;
+
+retry:
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			struct page *page = NULL;
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dma_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			if (page && wait_on_devmap_idle(&page->_refcount,
+						wait_page,
+						TASK_UNINTERRUPTIBLE) != 0) {
+				/*
+				 * We dropped the dma lock, so we need
+				 * to revalidate that previously seen
+				 * idle pages are still idle.
+				 */
+				goto retry;
+			}
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+	}
+}
+EXPORT_SYMBOL_GPL(dax_wait_dma);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -509,8 +613,10 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 out:
 	put_unlocked_mapping_entry(mapping, index, entry);
 	spin_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/fs/inode.c b/fs/inode.c
index d1e35b53bb23..95408e87a96c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -192,6 +192,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 	inode->i_flctx = NULL;
+	i_daxdma_init(inode);
 	this_cpu_inc(nr_inodes);
 
 	return 0;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ea21ebfd1889..6ce1c50519e7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len);
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
 #else
+static inline void dax_wait_dma(struct address_space *mapping, loff_t lstart,
+		loff_t len)
+{
+}
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..cd5b4a092d1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore	i_dax_dmasem;
+#endif
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -747,6 +750,59 @@ static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
 	down_write_nested(&inode->i_rwsem, subclass);
 }
 
+#ifdef CONFIG_FS_DAX
+static inline void i_daxdma_init(struct inode *inode)
+{
+	init_rwsem(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+	down_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+	up_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+	/*
+	 * The write lock is taken under mmap_sem in the
+	 * get_user_pages() path the read lock nests in the truncate
+	 * path.
+	 */
+#define DAXDMA_TRUNCATE_CLASS 1
+	down_read_nested(&inode->i_dax_dmasem, DAXDMA_TRUNCATE_CLASS);
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+	up_read(&inode->i_dax_dmasem);
+}
+#else /* CONFIG_FS_DAX */
+static inline void i_daxdma_init(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+}
+#endif /* CONFIG_FS_DAX */
+
 void lock_two_nondirectories(struct inode *, struct inode*);
 void unlock_two_nondirectories(struct inode *, struct inode*);
 
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..fd7b2a2e2d19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -579,6 +579,41 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+static struct inode *do_dax_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct file *file;
+	struct inode *inode;
+
+	if (!(foll_flags & FOLL_GET))
+		return NULL;
+	if (!vma_is_dax(vma))
+		return NULL;
+	file = vma->vm_file;
+	inode = file_inode(file);
+	if (inode->i_mode == S_IFCHR)
+		return NULL;
+	return inode;
+}
+
+static struct inode *dax_truncate_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct inode *inode = do_dax_lock(vma, foll_flags);
+
+	if (!inode)
+		return NULL;
+	i_daxdma_lock(inode);
+	return inode;
+}
+
+static void dax_truncate_unlock(struct inode *inode)
+{
+	if (!inode)
+		return;
+	i_daxdma_unlock(inode);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -659,6 +694,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	do {
 		struct page *page;
+		struct inode *inode;
 		unsigned int foll_flags = gup_flags;
 		unsigned int page_increm;
 
@@ -693,7 +729,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		inode = dax_truncate_lock(vma, foll_flags);
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_truncate_unlock(inode);
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,

commit 67d952314e9989b3b1945c50488f4a0f760264c3
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Tue Oct 24 13:41:22 2017 -0700

    xfs: wire up dax dma waiting
    
    The dax-dma vs truncate collision avoidance involves acquiring the new
    i_dax_dmasem and validating the no ranges that are to be mapped out of
    the file are active for dma. If any are found we wait for page idle
    and retry the scan. The locations where we implement this wait line up
    with where we currently wait for pnfs layout leases to expire.
    
    Since we need both dma to be idle and leases to be broken, and since
    xfs_break_layouts drops locks, we need to retry the dma busy scan until
    we can complete one that finds no busy pages.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..e3ec46c28c60 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,7 +347,7 @@ xfs_file_aio_write_checks(
 		return error;
 
 	error = xfs_break_layouts(inode, iolock);
-	if (error)
+	if (error < 0)
 		return error;
 
 	/*
@@ -762,7 +762,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -771,10 +771,20 @@ xfs_file_fallocate(
 	if (mode & ~XFS_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
+retry:
 	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, offset, len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
 	error = xfs_break_layouts(inode, &iolock);
-	if (error)
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
+	}
 
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	iolock |= XFS_MMAPLOCK_EXCL;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4ec5b7f45401..783f15894b7b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -171,7 +171,14 @@ xfs_ilock_attr_map_shared(
  * taken in places where we need to invalidate the page cache in a race
  * free manner (e.g. truncate, hole punch and other extent manipulation
  * functions).
- */
+ *
+ * The XFS_DAXDMA_LOCK_SHARED lock is a CONFIG_FS_DAX special case lock
+ * for synchronizing truncate vs ongoing DMA. The get_user_pages() path
+ * will hold this lock exclusively when incrementing page reference
+ * counts for DMA. Before an extent can be truncated we need to complete
+ * a validate-idle sweep of all pages in the range while holding this
+ * lock in shared mode.
+*/
 void
 xfs_ilock(
 	xfs_inode_t		*ip,
@@ -192,6 +199,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_lock_shared(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -328,6 +338,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_unlock_shared(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..0662edf00529 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -283,10 +283,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAXDMA_LOCK_SHARED	(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAXDMA_LOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -294,7 +296,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAXDMA_LOCK_SHARED, "XFS_DAXDMA_LOCK_SHARED" }
 
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..fd384ea00ede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	int			error;
 
 	/*
@@ -637,18 +637,6 @@ xfs_ioc_space(
 	if (filp->f_mode & FMODE_NOCMTIME)
 		flags |= XFS_PREALLOC_INVISIBLE;
 
-	error = mnt_want_write_file(filp);
-	if (error)
-		return error;
-
-	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
-	if (error)
-		goto out_unlock;
-
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -659,10 +647,31 @@ xfs_ioc_space(
 		bf->l_start += XFS_ISIZE(ip);
 		break;
 	default:
-		error = -EINVAL;
+		return -EINVAL;
+	}
+
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+
+retry:
+	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, bf->l_start, bf->l_len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
+	error = xfs_break_layouts(inode, &iolock);
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
 	}
 
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	iolock |= XFS_MMAPLOCK_EXCL;
+
 	/*
 	 * length of <= 0 for resv/unresv/zero is invalid.  length for
 	 * alloc/free is ignored completely and we have no idea what userspace
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..5f4d46b3cd7f 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -35,18 +35,19 @@ xfs_break_layouts(
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
+	int			error, did_unlock = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		did_unlock = 1;
 		error = break_layout(inode, true);
 		*iolock = XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 
-	return error;
+	return error < 0 ? error : did_unlock;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-26 23:51         ` Williams, Dan J
  0 siblings, 0 replies; 143+ messages in thread
From: Williams, Dan J @ 2017-10-26 23:51 UTC (permalink / raw)
  To: hch, jack
  Cc: schwidefsky, darrick.wong, dledford, linux-rdma, linux-fsdevel,
	bfields, linux-mm, heiko.carstens, dave.hansen, linux-xfs,
	linux-kernel, jmoyer, viro, kirill.shutemov, akpm, Hefty, Sean,
	linux-nvdimm, jlayton, mawilcox, mhocko, ross.zwisler,
	gerald.schaefer, jgunthorpe, hal.rosenstock, benh, david, mpe,
	paulus

On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > I'd like to brainstorm how we can do something better.
> > > 
> > > How about:
> > > 
> > > If we hit a page with an elevated refcount in truncate / hole puch
> > > etc for a DAX file system we do not free the blocks in the file system,
> > > but add it to the extent busy list.  We mark the page as delayed
> > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > call back into the file system to remove it from the busy list.
> > 
> > Brainstorming some more:
> > 
> > Given that on a DAX file there shouldn't be any long-term page
> > references after we unmap it from the page table and don't allow
> > get_user_pages calls why not wait for the references for all
> > DAX pages to go away first?  E.g. if we find a DAX page in
> > truncate_inode_pages_range that has an elevated refcount we set
> > a new flag to prevent new references from showing up, and then
> > simply wait for it to go away.  Instead of a busy way we can
> > do this through a few hashed waitqueued in dev_pagemap.  And in
> > fact put_zone_device_page already gets called when putting the
> > last page so we can handle the wakeup from there.
> > 
> > In fact if we can't find a page flag for the stop new callers
> > things we could probably come up with a way to do that through
> > dev_pagemap somehow, but I'm not sure how efficient that would
> > be.
> 
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
> 
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
> 
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I 
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?
> 

No, that's a good summary of what we talked about. However, I did go
back and give the new lock approach a try and was able to get my test
to pass. The new locking is not pretty especially since you need to
drop and reacquire the lock so that get_user_pages() can finish
grabbing all the pages it needs. Here are the two primary patches in
the series, do you think the extent-busy approach would be cleaner?

---

commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Wed Oct 25 05:14:54 2017 -0700

    mm, dax: handle truncate of dma-busy pages
    
    get_user_pages() pins file backed memory pages for access by dma
    devices. However, it only pins the memory pages not the page-to-file
    offset association. If a file is truncated the pages are mapped out of
    the file and dma may continue indefinitely into a page that is owned by
    a device driver. This breaks coherency of the file vs dma, but the
    assumption is that if userspace wants the file-space truncated it does
    not matter what data is inbound from the device, it is not relevant
    anymore.
    
    The assumptions of the truncate-page-cache model are broken by DAX where
    the target DMA page *is* the filesystem block. Leaving the page pinned
    for DMA, but truncating the file block out of the file, means that the
    filesytem is free to reallocate a block under active DMA to another
    file!
    
    Here are some possible options for fixing this situation ('truncate' and
    'fallocate(punch hole)' are synonymous below):
    
        1/ Fail truncate while any file blocks might be under dma
    
        2/ Block (sleep-wait) truncate while any file blocks might be under
           dma
    
        3/ Remap file blocks to a "lost+found"-like file-inode where
           dma can continue and we might see what inbound data from DMA was
           mapped out of the original file. Blocks in this file could be
           freed back to the filesystem when dma eventually ends.
    
        4/ List the blocks under DMA in the extent busy list and either hold
           off commit of the truncate transaction until commit, or otherwise
           keep the blocks marked busy so the allocator does not reuse them
           until DMA completes.
    
        5/ Disable dax until option 3 or another long term solution has been
           implemented. However, filesystem-dax is still marked experimental
           for concerns like this.
    
    Option 1 will throw failures where userspace has never expected them
    before, option 2 might hang the truncating process indefinitely, and
    option 3 requires per filesystem enabling to remap blocks from one inode
    to another.  Option 2 is implemented in this patch for the DAX path with
    the expectation that non-transient users of get_user_pages() (RDMA) are
    disallowed from setting up dax mappings and that the potential delay
    introduced to the truncate path is acceptable compared to the response
    time of the page cache case. This can only be seen as a stop-gap until
    we can solve the problem of safely sequestering unallocated filesystem
    blocks under active dma.
    
    The solution introduces a new inode semaphore that that is held
    exclusively for get_user_pages() and held for read at truncate while
    sleep-waiting on a hashed waitqueue.
    
    Credit for option 3 goes to Dave Hansen, who proposed something similar
    as an alternative way to solve the problem that MAP_DIRECT was trying to
    solve. Credit for option 4 goes to Christoph Hellwig.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index fd5d385988d1..f2c98f9cb833 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -346,6 +346,19 @@ static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
 	}
 }
 
+static struct page *dma_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -487,6 +500,97 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static int wait_page(atomic_t *_refcount)
+{
+	struct page *page = container_of(_refcount, struct page, _refcount);
+	struct inode *inode = page->inode;
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	i_daxdma_unlock_shared(inode);
+	schedule();
+	i_daxdma_lock_shared(inode);
+
+	/*
+	 * if we bounced the daxdma_lock then we need to rescan the
+	 * truncate area.
+	 */
+	return 1;
+}
+
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t	indices[PAGEVEC_SIZE];
+	pgoff_t	start, end, index;
+	struct pagevec pvec;
+	unsigned i;
+
+	lockdep_assert_held(&inode->i_dax_dmasem);
+
+	if (lstart < 0 || len < -1)
+		return;
+
+	/* in the limited case get_user_pages for dax is disabled */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	if (!dax_mapping(mapping))
+		return;
+
+	if (mapping->nrexceptional == 0)
+		return;
+
+	if (len == -1)
+		end = -1;
+	else
+		end = (lstart + len) >> PAGE_SHIFT;
+	start = lstart >> PAGE_SHIFT;
+
+retry:
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			struct page *page = NULL;
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dma_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			if (page && wait_on_devmap_idle(&page->_refcount,
+						wait_page,
+						TASK_UNINTERRUPTIBLE) != 0) {
+				/*
+				 * We dropped the dma lock, so we need
+				 * to revalidate that previously seen
+				 * idle pages are still idle.
+				 */
+				goto retry;
+			}
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+	}
+}
+EXPORT_SYMBOL_GPL(dax_wait_dma);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -509,8 +613,10 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 out:
 	put_unlocked_mapping_entry(mapping, index, entry);
 	spin_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/fs/inode.c b/fs/inode.c
index d1e35b53bb23..95408e87a96c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -192,6 +192,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 	inode->i_flctx = NULL;
+	i_daxdma_init(inode);
 	this_cpu_inc(nr_inodes);
 
 	return 0;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ea21ebfd1889..6ce1c50519e7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len);
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
 #else
+static inline void dax_wait_dma(struct address_space *mapping, loff_t lstart,
+		loff_t len)
+{
+}
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..cd5b4a092d1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore	i_dax_dmasem;
+#endif
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -747,6 +750,59 @@ static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
 	down_write_nested(&inode->i_rwsem, subclass);
 }
 
+#ifdef CONFIG_FS_DAX
+static inline void i_daxdma_init(struct inode *inode)
+{
+	init_rwsem(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+	down_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+	up_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+	/*
+	 * The write lock is taken under mmap_sem in the
+	 * get_user_pages() path the read lock nests in the truncate
+	 * path.
+	 */
+#define DAXDMA_TRUNCATE_CLASS 1
+	down_read_nested(&inode->i_dax_dmasem, DAXDMA_TRUNCATE_CLASS);
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+	up_read(&inode->i_dax_dmasem);
+}
+#else /* CONFIG_FS_DAX */
+static inline void i_daxdma_init(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+}
+#endif /* CONFIG_FS_DAX */
+
 void lock_two_nondirectories(struct inode *, struct inode*);
 void unlock_two_nondirectories(struct inode *, struct inode*);
 
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..fd7b2a2e2d19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -579,6 +579,41 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+static struct inode *do_dax_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct file *file;
+	struct inode *inode;
+
+	if (!(foll_flags & FOLL_GET))
+		return NULL;
+	if (!vma_is_dax(vma))
+		return NULL;
+	file = vma->vm_file;
+	inode = file_inode(file);
+	if (inode->i_mode == S_IFCHR)
+		return NULL;
+	return inode;
+}
+
+static struct inode *dax_truncate_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct inode *inode = do_dax_lock(vma, foll_flags);
+
+	if (!inode)
+		return NULL;
+	i_daxdma_lock(inode);
+	return inode;
+}
+
+static void dax_truncate_unlock(struct inode *inode)
+{
+	if (!inode)
+		return;
+	i_daxdma_unlock(inode);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -659,6 +694,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	do {
 		struct page *page;
+		struct inode *inode;
 		unsigned int foll_flags = gup_flags;
 		unsigned int page_increm;
 
@@ -693,7 +729,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		inode = dax_truncate_lock(vma, foll_flags);
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_truncate_unlock(inode);
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,

commit 67d952314e9989b3b1945c50488f4a0f760264c3
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Tue Oct 24 13:41:22 2017 -0700

    xfs: wire up dax dma waiting
    
    The dax-dma vs truncate collision avoidance involves acquiring the new
    i_dax_dmasem and validating the no ranges that are to be mapped out of
    the file are active for dma. If any are found we wait for page idle
    and retry the scan. The locations where we implement this wait line up
    with where we currently wait for pnfs layout leases to expire.
    
    Since we need both dma to be idle and leases to be broken, and since
    xfs_break_layouts drops locks, we need to retry the dma busy scan until
    we can complete one that finds no busy pages.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..e3ec46c28c60 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,7 +347,7 @@ xfs_file_aio_write_checks(
 		return error;
 
 	error = xfs_break_layouts(inode, iolock);
-	if (error)
+	if (error < 0)
 		return error;
 
 	/*
@@ -762,7 +762,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -771,10 +771,20 @@ xfs_file_fallocate(
 	if (mode & ~XFS_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
+retry:
 	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, offset, len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
 	error = xfs_break_layouts(inode, &iolock);
-	if (error)
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
+	}
 
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	iolock |= XFS_MMAPLOCK_EXCL;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4ec5b7f45401..783f15894b7b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -171,7 +171,14 @@ xfs_ilock_attr_map_shared(
  * taken in places where we need to invalidate the page cache in a race
  * free manner (e.g. truncate, hole punch and other extent manipulation
  * functions).
- */
+ *
+ * The XFS_DAXDMA_LOCK_SHARED lock is a CONFIG_FS_DAX special case lock
+ * for synchronizing truncate vs ongoing DMA. The get_user_pages() path
+ * will hold this lock exclusively when incrementing page reference
+ * counts for DMA. Before an extent can be truncated we need to complete
+ * a validate-idle sweep of all pages in the range while holding this
+ * lock in shared mode.
+*/
 void
 xfs_ilock(
 	xfs_inode_t		*ip,
@@ -192,6 +199,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_lock_shared(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -328,6 +338,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_unlock_shared(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..0662edf00529 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -283,10 +283,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAXDMA_LOCK_SHARED	(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAXDMA_LOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -294,7 +296,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAXDMA_LOCK_SHARED, "XFS_DAXDMA_LOCK_SHARED" }
 
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..fd384ea00ede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	int			error;
 
 	/*
@@ -637,18 +637,6 @@ xfs_ioc_space(
 	if (filp->f_mode & FMODE_NOCMTIME)
 		flags |= XFS_PREALLOC_INVISIBLE;
 
-	error = mnt_want_write_file(filp);
-	if (error)
-		return error;
-
-	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
-	if (error)
-		goto out_unlock;
-
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -659,10 +647,31 @@ xfs_ioc_space(
 		bf->l_start += XFS_ISIZE(ip);
 		break;
 	default:
-		error = -EINVAL;
+		return -EINVAL;
+	}
+
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+
+retry:
+	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, bf->l_start, bf->l_len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
+	error = xfs_break_layouts(inode, &iolock);
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
 	}
 
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	iolock |= XFS_MMAPLOCK_EXCL;
+
 	/*
 	 * length of <= 0 for resv/unresv/zero is invalid.  length for
 	 * alloc/free is ignored completely and we have no idea what userspace
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..5f4d46b3cd7f 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -35,18 +35,19 @@ xfs_break_layouts(
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
+	int			error, did_unlock = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		did_unlock = 1;
 		error = break_layout(inode, true);
 		*iolock = XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 
-	return error;
+	return error < 0 ? error : did_unlock;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-26 23:51         ` Williams, Dan J
  0 siblings, 0 replies; 143+ messages in thread
From: Williams, Dan J @ 2017-10-26 23:51 UTC (permalink / raw)
  To: hch, jack
  Cc: schwidefsky, darrick.wong, dledford, linux-rdma, linux-fsdevel,
	bfields, linux-mm, heiko.carstens, dave.hansen, linux-xfs,
	linux-kernel, jmoyer, viro, kirill.shutemov, akpm, Hefty, Sean,
	linux-nvdimm, jlayton, mawilcox, mhocko, ross.zwisler,
	gerald.schaefer, jgunthorpe, hal.rosenstock, benh, david, mpe,
	paulus

On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > I'd like to brainstorm how we can do something better.
> > > 
> > > How about:
> > > 
> > > If we hit a page with an elevated refcount in truncate / hole puch
> > > etc for a DAX file system we do not free the blocks in the file system,
> > > but add it to the extent busy list.  We mark the page as delayed
> > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > call back into the file system to remove it from the busy list.
> > 
> > Brainstorming some more:
> > 
> > Given that on a DAX file there shouldn't be any long-term page
> > references after we unmap it from the page table and don't allow
> > get_user_pages calls why not wait for the references for all
> > DAX pages to go away first?  E.g. if we find a DAX page in
> > truncate_inode_pages_range that has an elevated refcount we set
> > a new flag to prevent new references from showing up, and then
> > simply wait for it to go away.  Instead of a busy way we can
> > do this through a few hashed waitqueued in dev_pagemap.  And in
> > fact put_zone_device_page already gets called when putting the
> > last page so we can handle the wakeup from there.
> > 
> > In fact if we can't find a page flag for the stop new callers
> > things we could probably come up with a way to do that through
> > dev_pagemap somehow, but I'm not sure how efficient that would
> > be.
> 
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
> 
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
> 
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I 
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?
> 

No, that's a good summary of what we talked about. However, I did go
back and give the new lock approach a try and was able to get my test
to pass. The new locking is not pretty especially since you need to
drop and reacquire the lock so that get_user_pages() can finish
grabbing all the pages it needs. Here are the two primary patches in
the series, do you think the extent-busy approach would be cleaner?

---

commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Wed Oct 25 05:14:54 2017 -0700

    mm, dax: handle truncate of dma-busy pages
    
    get_user_pages() pins file backed memory pages for access by dma
    devices. However, it only pins the memory pages not the page-to-file
    offset association. If a file is truncated the pages are mapped out of
    the file and dma may continue indefinitely into a page that is owned by
    a device driver. This breaks coherency of the file vs dma, but the
    assumption is that if userspace wants the file-space truncated it does
    not matter what data is inbound from the device, it is not relevant
    anymore.
    
    The assumptions of the truncate-page-cache model are broken by DAX where
    the target DMA page *is* the filesystem block. Leaving the page pinned
    for DMA, but truncating the file block out of the file, means that the
    filesytem is free to reallocate a block under active DMA to another
    file!
    
    Here are some possible options for fixing this situation ('truncate' and
    'fallocate(punch hole)' are synonymous below):
    
        1/ Fail truncate while any file blocks might be under dma
    
        2/ Block (sleep-wait) truncate while any file blocks might be under
           dma
    
        3/ Remap file blocks to a "lost+found"-like file-inode where
           dma can continue and we might see what inbound data from DMA was
           mapped out of the original file. Blocks in this file could be
           freed back to the filesystem when dma eventually ends.
    
        4/ List the blocks under DMA in the extent busy list and either hold
           off commit of the truncate transaction until commit, or otherwise
           keep the blocks marked busy so the allocator does not reuse them
           until DMA completes.
    
        5/ Disable dax until option 3 or another long term solution has been
           implemented. However, filesystem-dax is still marked experimental
           for concerns like this.
    
    Option 1 will throw failures where userspace has never expected them
    before, option 2 might hang the truncating process indefinitely, and
    option 3 requires per filesystem enabling to remap blocks from one inode
    to another.  Option 2 is implemented in this patch for the DAX path with
    the expectation that non-transient users of get_user_pages() (RDMA) are
    disallowed from setting up dax mappings and that the potential delay
    introduced to the truncate path is acceptable compared to the response
    time of the page cache case. This can only be seen as a stop-gap until
    we can solve the problem of safely sequestering unallocated filesystem
    blocks under active dma.
    
    The solution introduces a new inode semaphore that that is held
    exclusively for get_user_pages() and held for read at truncate while
    sleep-waiting on a hashed waitqueue.
    
    Credit for option 3 goes to Dave Hansen, who proposed something similar
    as an alternative way to solve the problem that MAP_DIRECT was trying to
    solve. Credit for option 4 goes to Christoph Hellwig.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4ac359e14777..a5a4b95ffdaf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,6 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
+	wake_up_devmap_idle(&page->_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/fs/dax.c b/fs/dax.c
index fd5d385988d1..f2c98f9cb833 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -346,6 +346,19 @@ static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
 	}
 }
 
+static struct page *dma_busy_page(void *entry)
+{
+	unsigned long pfn, end_pfn;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (page_ref_count(page) > 1)
+			return page;
+	}
+	return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -487,6 +500,97 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 	return entry;
 }
 
+static int wait_page(atomic_t *_refcount)
+{
+	struct page *page = container_of(_refcount, struct page, _refcount);
+	struct inode *inode = page->inode;
+
+	if (page_ref_count(page) == 1)
+		return 0;
+
+	i_daxdma_unlock_shared(inode);
+	schedule();
+	i_daxdma_lock_shared(inode);
+
+	/*
+	 * if we bounced the daxdma_lock then we need to rescan the
+	 * truncate area.
+	 */
+	return 1;
+}
+
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t	indices[PAGEVEC_SIZE];
+	pgoff_t	start, end, index;
+	struct pagevec pvec;
+	unsigned i;
+
+	lockdep_assert_held(&inode->i_dax_dmasem);
+
+	if (lstart < 0 || len < -1)
+		return;
+
+	/* in the limited case get_user_pages for dax is disabled */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	if (!dax_mapping(mapping))
+		return;
+
+	if (mapping->nrexceptional == 0)
+		return;
+
+	if (len == -1)
+		end = -1;
+	else
+		end = (lstart + len) >> PAGE_SHIFT;
+	start = lstart >> PAGE_SHIFT;
+
+retry:
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE),
+				indices)) {
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *pvec_ent = pvec.pages[i];
+			struct page *page = NULL;
+			void *entry;
+
+			index = indices[i];
+			if (index >= end)
+				break;
+
+			if (!radix_tree_exceptional_entry(pvec_ent))
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			entry = get_unlocked_mapping_entry(mapping, index, NULL);
+			if (entry)
+				page = dma_busy_page(entry);
+			put_unlocked_mapping_entry(mapping, index, entry);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			if (page && wait_on_devmap_idle(&page->_refcount,
+						wait_page,
+						TASK_UNINTERRUPTIBLE) != 0) {
+				/*
+				 * We dropped the dma lock, so we need
+				 * to revalidate that previously seen
+				 * idle pages are still idle.
+				 */
+				goto retry;
+			}
+		}
+		pagevec_remove_exceptionals(&pvec);
+		pagevec_release(&pvec);
+		index++;
+	}
+}
+EXPORT_SYMBOL_GPL(dax_wait_dma);
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
@@ -509,8 +613,10 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 out:
 	put_unlocked_mapping_entry(mapping, index, entry);
 	spin_unlock_irq(&mapping->tree_lock);
+
 	return ret;
 }
+
 /*
  * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
  * entry to get unlocked before deleting it.
diff --git a/fs/inode.c b/fs/inode.c
index d1e35b53bb23..95408e87a96c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -192,6 +192,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 	inode->i_flctx = NULL;
+	i_daxdma_init(inode);
 	this_cpu_inc(nr_inodes);
 
 	return 0;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ea21ebfd1889..6ce1c50519e7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -100,10 +100,15 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
+void dax_wait_dma(struct address_space *mapping, loff_t lstart, loff_t len);
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length);
 #else
+static inline void dax_wait_dma(struct address_space *mapping, loff_t lstart,
+		loff_t len)
+{
+}
 static inline int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int length)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..cd5b4a092d1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+#ifdef CONFIG_FS_DAX
+	struct rw_semaphore	i_dax_dmasem;
+#endif
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -747,6 +750,59 @@ static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
 	down_write_nested(&inode->i_rwsem, subclass);
 }
 
+#ifdef CONFIG_FS_DAX
+static inline void i_daxdma_init(struct inode *inode)
+{
+	init_rwsem(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+	down_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+	up_write(&inode->i_dax_dmasem);
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+	/*
+	 * The write lock is taken under mmap_sem in the
+	 * get_user_pages() path the read lock nests in the truncate
+	 * path.
+	 */
+#define DAXDMA_TRUNCATE_CLASS 1
+	down_read_nested(&inode->i_dax_dmasem, DAXDMA_TRUNCATE_CLASS);
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+	up_read(&inode->i_dax_dmasem);
+}
+#else /* CONFIG_FS_DAX */
+static inline void i_daxdma_init(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_lock_shared(struct inode *inode)
+{
+}
+
+static inline void i_daxdma_unlock_shared(struct inode *inode)
+{
+}
+#endif /* CONFIG_FS_DAX */
+
 void lock_two_nondirectories(struct inode *, struct inode*);
 void unlock_two_nondirectories(struct inode *, struct inode*);
 
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 12b26660d7e9..6186ecdb9df7 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -30,10 +30,12 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+void wake_up_devmap_idle(atomic_t *p);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
+int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*)(atomic_t *), unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -258,4 +260,12 @@ int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
 	return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_devmap_idle(atomic_t *val, int (*action)(atomic_t *), unsigned mode)
+{
+	might_sleep();
+	if (atomic_read(val) == 1)
+		return 0;
+	return out_of_line_wait_on_devmap_idle(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index f8159698aa4d..6ea93149614a 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,11 +162,17 @@ static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p)
 	return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync,
-				  void *arg)
+static inline struct wait_bit_queue_entry *to_wait_bit_q(
+		struct wait_queue_entry *wq_entry)
+{
+	return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
 {
 	struct wait_bit_key *key = arg;
-	struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
 	atomic_t *val = key->flags;
 
 	if (wait_bit->key.flags != key->flags ||
@@ -176,14 +182,29 @@ static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mo
 	return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_devmap_idle_function(struct wait_queue_entry *wq_entry,
+		unsigned mode, int sync, void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
+	atomic_t *val = key->flags;
+
+	if (wait_bit->key.flags != key->flags ||
+	    wait_bit->key.bit_nr != key->bit_nr ||
+	    atomic_read(val) != 1)
+		return 0;
+	return autoremove_wake_function(wq_entry, mode, sync, key);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
  * the actions of __wait_on_atomic_t() are permitted return codes.  Nonzero
  * return codes halt waiting and return.
  */
 static __sched
-int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry,
-		       int (*action)(atomic_t *), unsigned mode)
+int __wait_on_atomic_t(struct wait_queue_head *wq_head,
+		struct wait_bit_queue_entry *wbq_entry,
+		int (*action)(atomic_t *), unsigned mode, int target)
 {
 	atomic_t *val;
 	int ret = 0;
@@ -191,10 +212,10 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 	do {
 		prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode);
 		val = wbq_entry->key.flags;
-		if (atomic_read(val) == 0)
+		if (atomic_read(val) == target)
 			break;
 		ret = (*action)(val);
-	} while (!ret && atomic_read(val) != 0);
+	} while (!ret && atomic_read(val) != target);
 	finish_wait(wq_head, &wbq_entry->wq_entry);
 	return ret;
 }
@@ -210,16 +231,37 @@ int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_en
 		},							\
 	}
 
+#define DEFINE_WAIT_DEVMAP_IDLE(name, p)					\
+	struct wait_bit_queue_entry name = {				\
+		.key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p),		\
+		.wq_entry = {						\
+			.private	= current,			\
+			.func		= wake_devmap_idle_function,	\
+			.entry		=				\
+				LIST_HEAD_INIT((name).wq_entry.entry),	\
+		},							\
+	}
+
 __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *),
 					 unsigned mode)
 {
 	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
 	DEFINE_WAIT_ATOMIC_T(wq_entry, p);
 
-	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode);
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 0);
 }
 EXPORT_SYMBOL(out_of_line_wait_on_atomic_t);
 
+__sched int out_of_line_wait_on_devmap_idle(atomic_t *p, int (*action)(atomic_t *),
+					 unsigned mode)
+{
+	struct wait_queue_head *wq_head = atomic_t_waitqueue(p);
+	DEFINE_WAIT_DEVMAP_IDLE(wq_entry, p);
+
+	return __wait_on_atomic_t(wq_head, &wq_entry, action, mode, 1);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_devmap_idle);
+
 /**
  * wake_up_atomic_t - Wake up a waiter on a atomic_t
  * @p: The atomic_t being waited on, a kernel virtual address
@@ -235,6 +277,12 @@ void wake_up_atomic_t(atomic_t *p)
 }
 EXPORT_SYMBOL(wake_up_atomic_t);
 
+void wake_up_devmap_idle(atomic_t *p)
+{
+	__wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR);
+}
+EXPORT_SYMBOL(wake_up_devmap_idle);
+
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
 	schedule();
diff --git a/mm/gup.c b/mm/gup.c
index 308be897d22a..fd7b2a2e2d19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -579,6 +579,41 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	return 0;
 }
 
+static struct inode *do_dax_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct file *file;
+	struct inode *inode;
+
+	if (!(foll_flags & FOLL_GET))
+		return NULL;
+	if (!vma_is_dax(vma))
+		return NULL;
+	file = vma->vm_file;
+	inode = file_inode(file);
+	if (inode->i_mode == S_IFCHR)
+		return NULL;
+	return inode;
+}
+
+static struct inode *dax_truncate_lock(struct vm_area_struct *vma,
+		unsigned int foll_flags)
+{
+	struct inode *inode = do_dax_lock(vma, foll_flags);
+
+	if (!inode)
+		return NULL;
+	i_daxdma_lock(inode);
+	return inode;
+}
+
+static void dax_truncate_unlock(struct inode *inode)
+{
+	if (!inode)
+		return;
+	i_daxdma_unlock(inode);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -659,6 +694,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	do {
 		struct page *page;
+		struct inode *inode;
 		unsigned int foll_flags = gup_flags;
 		unsigned int page_increm;
 
@@ -693,7 +729,9 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		if (unlikely(fatal_signal_pending(current)))
 			return i ? i : -ERESTARTSYS;
 		cond_resched();
+		inode = dax_truncate_lock(vma, foll_flags);
 		page = follow_page_mask(vma, start, foll_flags, &page_mask);
+		dax_truncate_unlock(inode);
 		if (!page) {
 			int ret;
 			ret = faultin_page(tsk, vma, start, &foll_flags,

commit 67d952314e9989b3b1945c50488f4a0f760264c3
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Tue Oct 24 13:41:22 2017 -0700

    xfs: wire up dax dma waiting
    
    The dax-dma vs truncate collision avoidance involves acquiring the new
    i_dax_dmasem and validating the no ranges that are to be mapped out of
    the file are active for dma. If any are found we wait for page idle
    and retry the scan. The locations where we implement this wait line up
    with where we currently wait for pnfs layout leases to expire.
    
    Since we need both dma to be idle and leases to be broken, and since
    xfs_break_layouts drops locks, we need to retry the dma busy scan until
    we can complete one that finds no busy pages.
    
    Cc: Jan Kara <jack@suse.cz>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6780743f8ec..e3ec46c28c60 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -347,7 +347,7 @@ xfs_file_aio_write_checks(
 		return error;
 
 	error = xfs_break_layouts(inode, iolock);
-	if (error)
+	if (error < 0)
 		return error;
 
 	/*
@@ -762,7 +762,7 @@ xfs_file_fallocate(
 	struct xfs_inode	*ip = XFS_I(inode);
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	loff_t			new_size = 0;
 	bool			do_file_insert = 0;
 
@@ -771,10 +771,20 @@ xfs_file_fallocate(
 	if (mode & ~XFS_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
 
+retry:
 	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, offset, len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
 	error = xfs_break_layouts(inode, &iolock);
-	if (error)
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
+	}
 
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	iolock |= XFS_MMAPLOCK_EXCL;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4ec5b7f45401..783f15894b7b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -171,7 +171,14 @@ xfs_ilock_attr_map_shared(
  * taken in places where we need to invalidate the page cache in a race
  * free manner (e.g. truncate, hole punch and other extent manipulation
  * functions).
- */
+ *
+ * The XFS_DAXDMA_LOCK_SHARED lock is a CONFIG_FS_DAX special case lock
+ * for synchronizing truncate vs ongoing DMA. The get_user_pages() path
+ * will hold this lock exclusively when incrementing page reference
+ * counts for DMA. Before an extent can be truncated we need to complete
+ * a validate-idle sweep of all pages in the range while holding this
+ * lock in shared mode.
+*/
 void
 xfs_ilock(
 	xfs_inode_t		*ip,
@@ -192,6 +199,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_lock_shared(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -328,6 +338,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAXDMA_LOCK_SHARED)
+		i_daxdma_unlock_shared(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..0662edf00529 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -283,10 +283,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAXDMA_LOCK_SHARED	(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAXDMA_LOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -294,7 +296,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAXDMA_LOCK_SHARED, "XFS_DAXDMA_LOCK_SHARED" }
 
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..fd384ea00ede 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -612,7 +612,7 @@ xfs_ioc_space(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct iattr		iattr;
 	enum xfs_prealloc_flags	flags = 0;
-	uint			iolock = XFS_IOLOCK_EXCL;
+	uint			iolock = XFS_DAXDMA_LOCK_SHARED;
 	int			error;
 
 	/*
@@ -637,18 +637,6 @@ xfs_ioc_space(
 	if (filp->f_mode & FMODE_NOCMTIME)
 		flags |= XFS_PREALLOC_INVISIBLE;
 
-	error = mnt_want_write_file(filp);
-	if (error)
-		return error;
-
-	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock);
-	if (error)
-		goto out_unlock;
-
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-	iolock |= XFS_MMAPLOCK_EXCL;
-
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -659,10 +647,31 @@ xfs_ioc_space(
 		bf->l_start += XFS_ISIZE(ip);
 		break;
 	default:
-		error = -EINVAL;
+		return -EINVAL;
+	}
+
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+
+retry:
+	xfs_ilock(ip, iolock);
+	dax_wait_dma(inode->i_mapping, bf->l_start, bf->l_len);
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	iolock |= XFS_IOLOCK_EXCL;
+	error = xfs_break_layouts(inode, &iolock);
+	if (error < 0)
 		goto out_unlock;
+	else if (error > 0 && IS_ENABLED(CONFIG_FS_DAX)) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_DAXDMA_LOCK_SHARED;
+		goto retry;
 	}
 
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	iolock |= XFS_MMAPLOCK_EXCL;
+
 	/*
 	 * length of <= 0 for resv/unresv/zero is invalid.  length for
 	 * alloc/free is ignored completely and we have no idea what userspace
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..5f4d46b3cd7f 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -35,18 +35,19 @@ xfs_break_layouts(
 	uint			*iolock)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
+	int			error, did_unlock = 0;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
 	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
 		xfs_iunlock(ip, *iolock);
+		did_unlock = 1;
 		error = break_layout(inode, true);
 		*iolock = XFS_IOLOCK_EXCL;
 		xfs_ilock(ip, *iolock);
 	}
 
-	return error;
+	return error < 0 ? error : did_unlock;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 10:58       ` Jan Kara
  (?)
@ 2017-10-27  6:45         ` Christoph Hellwig
  -1 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-27  6:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dan Williams, akpm, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens, linux-nvdimm, Alexander Viro,
	Gerald Schaefer, Darrick J. Wong, linux-kernel, linux-xfs,
	Martin Schwidefsky, linux-fsdevel, Kirill A. Shutemov

On Thu, Oct 26, 2017 at 12:58:50PM +0200, Jan Kara wrote:
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27  6:45         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-27  6:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dan Williams, akpm, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko

On Thu, Oct 26, 2017 at 12:58:50PM +0200, Jan Kara wrote:
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27  6:45         ` Christoph Hellwig
  0 siblings, 0 replies; 143+ messages in thread
From: Christoph Hellwig @ 2017-10-27  6:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dan Williams, akpm, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Dave Chinner,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jeff Moyer, Jason Gunthorpe, Doug Ledford, Ross Zwisler,
	Hal Rosenstock, Heiko Carstens, linux-nvdimm, Alexander Viro,
	Gerald Schaefer, Darrick J. Wong, linux-kernel, linux-xfs,
	Martin Schwidefsky, linux-fsdevel, Kirill A. Shutemov

On Thu, Oct 26, 2017 at 12:58:50PM +0200, Jan Kara wrote:
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 23:51         ` Williams, Dan J
                             ` (2 preceding siblings ...)
  (?)
@ 2017-10-27  6:48           ` Dave Chinner
  -1 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-27  6:48 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma,
	mpe, dledford, hch, jgunthorpe, hal.rosenstock, schwidefsky,
	viro, gerald.schaefer, linux-nvdimm, darrick.wong, linux-kernel,
	linux-xfs, linux-fsdevel, akpm, kirill.shutemov

On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > > 
> > > > How about:
> > > > 
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list.  We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > > 
> > > Brainstorming some more:
> > > 
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first?  E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away.  Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap.  And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > > 
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> > 
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> > 
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> > 
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I 
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> > 
> 
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?

The XFS_DAXDMA.... 

$DEITY that patch is so ugly I can't even bring myself to type it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27  6:48           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-27  6:48 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: hch, jack, schwidefsky, darrick.wong, dledford, linux-rdma,
	linux-fsdevel, bfields, linux-mm, heiko.carstens, dave.hansen,
	linux-xfs, linux-kernel, jmoyer, viro, kirill.shutemov, akpm

On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > > 
> > > > How about:
> > > > 
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list.  We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > > 
> > > Brainstorming some more:
> > > 
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first?  E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away.  Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap.  And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > > 
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> > 
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> > 
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> > 
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I 
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> > 
> 
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?

The XFS_DAXDMA.... 

$DEITY that patch is so ugly I can't even bring myself to type it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27  6:48           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-27  6:48 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: hch, jack, schwidefsky, darrick.wong, dledford, linux-rdma,
	linux-fsdevel, bfields, linux-mm, heiko.carstens, dave.hansen,
	linux-xfs, linux-kernel, jmoyer, viro, kirill.shutemov, akpm,
	Hefty, Sean, linux-nvdimm, jlayton, mawilcox, mhocko,
	ross.zwisler, gerald.schaefer, jgunthorpe, hal.rosenstock, benh,
	mpe, paulus

On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > > 
> > > > How about:
> > > > 
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list.  We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > > 
> > > Brainstorming some more:
> > > 
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first?  E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away.  Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap.  And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > > 
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> > 
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> > 
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> > 
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I 
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> > 
> 
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?

The XFS_DAXDMA.... 

$DEITY that patch is so ugly I can't even bring myself to type it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27  6:48           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-27  6:48 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: hch, jack, schwidefsky, darrick.wong, dledford, linux-rdma,
	linux-fsdevel, bfields, linux-mm, heiko.carstens, dave.hansen,
	linux-xfs, linux-kernel, jmoyer, viro, kirill.shutemov, akpm,
	Hefty, Sean, linux-nvdimm, jlayton, mawilcox, mhocko,
	ross.zwisler, gerald.schaefer, jgunthorpe, hal.rosenstock, benh,
	mpe, paulus

On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > > 
> > > > How about:
> > > > 
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list.��We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > > 
> > > Brainstorming some more:
> > > 
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first?��E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away.��Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap.��And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > > 
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> > 
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> > 
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> > 
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I�
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> > 
> 
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?

The XFS_DAXDMA.... 

$DEITY that patch is so ugly I can't even bring myself to type it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27  6:48           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-27  6:48 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: hch, jack, schwidefsky, darrick.wong, dledford, linux-rdma,
	linux-fsdevel, bfields, linux-mm, heiko.carstens, dave.hansen,
	linux-xfs, linux-kernel, jmoyer, viro, kirill.shutemov, akpm,
	Hefty, Sean, linux-nvdimm, jlayton, mawilcox, mhocko,
	ross.zwisler, gerald.schaefer, jgunthorpe, hal.rosenstock, benh,
	mpe, paulus

On Thu, Oct 26, 2017 at 11:51:04PM +0000, Williams, Dan J wrote:
> On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > > I'd like to brainstorm how we can do something better.
> > > > 
> > > > How about:
> > > > 
> > > > If we hit a page with an elevated refcount in truncate / hole puch
> > > > etc for a DAX file system we do not free the blocks in the file system,
> > > > but add it to the extent busy list.  We mark the page as delayed
> > > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > > call back into the file system to remove it from the busy list.
> > > 
> > > Brainstorming some more:
> > > 
> > > Given that on a DAX file there shouldn't be any long-term page
> > > references after we unmap it from the page table and don't allow
> > > get_user_pages calls why not wait for the references for all
> > > DAX pages to go away first?  E.g. if we find a DAX page in
> > > truncate_inode_pages_range that has an elevated refcount we set
> > > a new flag to prevent new references from showing up, and then
> > > simply wait for it to go away.  Instead of a busy way we can
> > > do this through a few hashed waitqueued in dev_pagemap.  And in
> > > fact put_zone_device_page already gets called when putting the
> > > last page so we can handle the wakeup from there.
> > > 
> > > In fact if we can't find a page flag for the stop new callers
> > > things we could probably come up with a way to do that through
> > > dev_pagemap somehow, but I'm not sure how efficient that would
> > > be.
> > 
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> > 
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> > 
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I 
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> > 
> 
> No, that's a good summary of what we talked about. However, I did go
> back and give the new lock approach a try and was able to get my test
> to pass. The new locking is not pretty especially since you need to
> drop and reacquire the lock so that get_user_pages() can finish
> grabbing all the pages it needs. Here are the two primary patches in
> the series, do you think the extent-busy approach would be cleaner?

The XFS_DAXDMA.... 

$DEITY that patch is so ugly I can't even bring myself to type it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-27  6:48           ` Dave Chinner
  (?)
@ 2017-10-27 11:42             ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-27 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, hch, mawilcox, linux-rdma, mpe,
	jgunthorpe, dledford, hal.rosenstock, linux-nvdimm, viro,
	jlayton, gerald.schaefer, darrick.wong, linux-kernel, linux-xfs,
	schwidefsky, linux-fsdevel, akpm, kirill.shutemov

[replying from my phone, please forgive formatting]

On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:


> > Here are the two primary patches in
> > the series, do you think the extent-busy approach would be cleaner?
>
> The XFS_DAXDMA....
>
> $DEITY that patch is so ugly I can't even bring myself to type it.


Right, and so is the problem it's trying to solve. So where do you want to
go from here?

I could go back to the FL_ALLOCATED approach, but use page idle callbacks
instead of polling for the lease end notification. Or do we want to try
busy extents? My concern with busy extents is that it requires more per-fs
code.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27 11:42             ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-27 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma,
	mpe, dledford, hch, jgunthorpe, hal.rosenstock, schwidefsky,
	viro

[-- Attachment #1: Type: text/plain, Size: 651 bytes --]

[replying from my phone, please forgive formatting]

On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:


> > Here are the two primary patches in
> > the series, do you think the extent-busy approach would be cleaner?
>
> The XFS_DAXDMA....
>
> $DEITY that patch is so ugly I can't even bring myself to type it.


Right, and so is the problem it's trying to solve. So where do you want to
go from here?

I could go back to the FL_ALLOCATED approach, but use page idle callbacks
instead of polling for the lease end notification. Or do we want to try
busy extents? My concern with busy extents is that it requires more per-fs
code.

[-- Attachment #2: Type: text/html, Size: 923 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-27 11:42             ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-27 11:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma,
	mpe, dledford, hch, jgunthorpe, hal.rosenstock, schwidefsky,
	viro, gerald.schaefer, linux-nvdimm, darrick.wong, linux-kernel,
	linux-xfs, linux-fsdevel, akpm, kirill.shutemov

[-- Attachment #1: Type: text/plain, Size: 651 bytes --]

[replying from my phone, please forgive formatting]

On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:


> > Here are the two primary patches in
> > the series, do you think the extent-busy approach would be cleaner?
>
> The XFS_DAXDMA....
>
> $DEITY that patch is so ugly I can't even bring myself to type it.


Right, and so is the problem it's trying to solve. So where do you want to
go from here?

I could go back to the FL_ALLOCATED approach, but use page idle callbacks
instead of polling for the lease end notification. Or do we want to try
busy extents? My concern with busy extents is that it requires more per-fs
code.

[-- Attachment #2: Type: text/html, Size: 923 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-27 11:42             ` Dan Williams
  (?)
@ 2017-10-29 21:52               ` Dave Chinner
  -1 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-29 21:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma,
	mpe, dledford, hch, jgunthorpe, hal.rosenstock, schwidefsky,
	viro, gerald.schaefer, linux-nvdimm, darrick.wong, linux-kernel,
	linux-xfs, linux-fsdevel, akpm, kirill.shutemov

On Fri, Oct 27, 2017 at 01:42:16PM +0200, Dan Williams wrote:
> [replying from my phone, please forgive formatting]
> 
> On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:
> 
> 
> > > Here are the two primary patches in
> > > the series, do you think the extent-busy approach would be cleaner?
> >
> > The XFS_DAXDMA....
> >
> > $DEITY that patch is so ugly I can't even bring myself to type it.
> 
> 
> Right, and so is the problem it's trying to solve. So where do you want to
> go from here?
> 
> I could go back to the FL_ALLOCATED approach, but use page idle callbacks
> instead of polling for the lease end notification. Or do we want to try
> busy extents? My concern with busy extents is that it requires more per-fs
> code.

I don't care if it takes more per-fs code to solve the problem -
dumping butt-ugly, nasty locking crap into filesystems that
filesystem developers are completely unable to test is about the
worst possible solution you can come up with.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-29 21:52               ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-29 21:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma,
	mpe, dledford, hch, jgunthorpe, hal.rosenstock, schwidefsky,
	viro

On Fri, Oct 27, 2017 at 01:42:16PM +0200, Dan Williams wrote:
> [replying from my phone, please forgive formatting]
> 
> On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:
> 
> 
> > > Here are the two primary patches in
> > > the series, do you think the extent-busy approach would be cleaner?
> >
> > The XFS_DAXDMA....
> >
> > $DEITY that patch is so ugly I can't even bring myself to type it.
> 
> 
> Right, and so is the problem it's trying to solve. So where do you want to
> go from here?
> 
> I could go back to the FL_ALLOCATED approach, but use page idle callbacks
> instead of polling for the lease end notification. Or do we want to try
> busy extents? My concern with busy extents is that it requires more per-fs
> code.

I don't care if it takes more per-fs code to solve the problem -
dumping butt-ugly, nasty locking crap into filesystems that
filesystem developers are completely unable to test is about the
worst possible solution you can come up with.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-29 21:52               ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-29 21:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: mhocko, jack, benh, dave.hansen, heiko.carstens, bfields,
	linux-mm, paulus, Hefty, Sean, jlayton, mawilcox, linux-rdma,
	mpe, dledford, hch, jgunthorpe, hal.rosenstock, schwidefsky,
	viro, gerald.schaefer, linux-nvdimm, darrick.wong, linux-kernel,
	linux-xfs, linux-fsdevel, akpm, kirill.shutemov

On Fri, Oct 27, 2017 at 01:42:16PM +0200, Dan Williams wrote:
> [replying from my phone, please forgive formatting]
> 
> On Friday, October 27, 2017, Dave Chinner <david@fromorbit.com> wrote:
> 
> 
> > > Here are the two primary patches in
> > > the series, do you think the extent-busy approach would be cleaner?
> >
> > The XFS_DAXDMA....
> >
> > $DEITY that patch is so ugly I can't even bring myself to type it.
> 
> 
> Right, and so is the problem it's trying to solve. So where do you want to
> go from here?
> 
> I could go back to the FL_ALLOCATED approach, but use page idle callbacks
> instead of polling for the lease end notification. Or do we want to try
> busy extents? My concern with busy extents is that it requires more per-fs
> code.

I don't care if it takes more per-fs code to solve the problem -
dumping butt-ugly, nasty locking crap into filesystems that
filesystem developers are completely unable to test is about the
worst possible solution you can come up with.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-26 10:58       ` Jan Kara
  (?)
@ 2017-10-29 23:46         ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-29 23:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Michal Hocko, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jason Gunthorpe, Doug Ledford,
	Hal Rosenstock, Dave Chinner, linux-fsdevel, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux Kernel Mailing List,
	linux-xfs, Martin Schwidefsky, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
>> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
>> > I'd like to brainstorm how we can do something better.
>> >
>> > How about:
>> >
>> > If we hit a page with an elevated refcount in truncate / hole puch
>> > etc for a DAX file system we do not free the blocks in the file system,
>> > but add it to the extent busy list.  We mark the page as delayed
>> > free (e.g. page flag?) so that when it finally hits refcount zero we
>> > call back into the file system to remove it from the busy list.
>>
>> Brainstorming some more:
>>
>> Given that on a DAX file there shouldn't be any long-term page
>> references after we unmap it from the page table and don't allow
>> get_user_pages calls why not wait for the references for all
>> DAX pages to go away first?  E.g. if we find a DAX page in
>> truncate_inode_pages_range that has an elevated refcount we set
>> a new flag to prevent new references from showing up, and then
>> simply wait for it to go away.  Instead of a busy way we can
>> do this through a few hashed waitqueued in dev_pagemap.  And in
>> fact put_zone_device_page already gets called when putting the
>> last page so we can handle the wakeup from there.
>>
>> In fact if we can't find a page flag for the stop new callers
>> things we could probably come up with a way to do that through
>> dev_pagemap somehow, but I'm not sure how efficient that would
>> be.
>
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
>
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
>
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?

Coming back to this since Dave has made clear that new locking to
coordinate get_user_pages() is a no-go.

We can unmap to force new get_user_pages() attempts to block on the
per-fs mmap lock, but if punch-hole finds any elevated pages it needs
to drop the mmap lock and wait. We need this lock dropped to get
around the problem that the driver will not start to drop page
references until it has elevated the page references on all the pages
in the I/O. If we need to drop the mmap lock that makes it impossible
to coordinate this unlock/retry loop within truncate_inode_pages_range
which would otherwise be the natural place to land this code.

Would it be palatable to unmap and drain dma in any path that needs to
detach blocks from an inode? Something like the following that builds
on dax_wait_dma() tried to achieve, but does not introduce a new lock
for the fs to manage:

retry:
    per_fs_mmap_lock(inode);
    unmap_mapping_range(mapping, start, end); /* new page references
cannot be established */
    if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
        per_fs_mmap_unlock(inode); /* new page references can happen,
so we need to start over */
        wait_for_page_idle(dax_page);
        goto retry;
    }
    truncate_inode_pages_range(mapping, start, end);
    per_fs_mmap_unlock(inode);

Given how far away taking the mmap lock occurs from where we know we
are actually performing a punch-hole operation this may lead to
unnecessary unmapping and dma flushing.

As far as I can see the extent busy mechanism does not simplify the
solution. If we have code to wait for the pages to go idle might as
well have the truncate/punch-hole wait on that event in the first
instance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-29 23:46         ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-29 23:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Michal Hocko, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jason Gunthorpe, Doug Ledford,
	Hal Rosenstock, Dave Chinner, linux-fsdevel, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux

On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
>> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
>> > I'd like to brainstorm how we can do something better.
>> >
>> > How about:
>> >
>> > If we hit a page with an elevated refcount in truncate / hole puch
>> > etc for a DAX file system we do not free the blocks in the file system,
>> > but add it to the extent busy list.  We mark the page as delayed
>> > free (e.g. page flag?) so that when it finally hits refcount zero we
>> > call back into the file system to remove it from the busy list.
>>
>> Brainstorming some more:
>>
>> Given that on a DAX file there shouldn't be any long-term page
>> references after we unmap it from the page table and don't allow
>> get_user_pages calls why not wait for the references for all
>> DAX pages to go away first?  E.g. if we find a DAX page in
>> truncate_inode_pages_range that has an elevated refcount we set
>> a new flag to prevent new references from showing up, and then
>> simply wait for it to go away.  Instead of a busy way we can
>> do this through a few hashed waitqueued in dev_pagemap.  And in
>> fact put_zone_device_page already gets called when putting the
>> last page so we can handle the wakeup from there.
>>
>> In fact if we can't find a page flag for the stop new callers
>> things we could probably come up with a way to do that through
>> dev_pagemap somehow, but I'm not sure how efficient that would
>> be.
>
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
>
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
>
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?

Coming back to this since Dave has made clear that new locking to
coordinate get_user_pages() is a no-go.

We can unmap to force new get_user_pages() attempts to block on the
per-fs mmap lock, but if punch-hole finds any elevated pages it needs
to drop the mmap lock and wait. We need this lock dropped to get
around the problem that the driver will not start to drop page
references until it has elevated the page references on all the pages
in the I/O. If we need to drop the mmap lock that makes it impossible
to coordinate this unlock/retry loop within truncate_inode_pages_range
which would otherwise be the natural place to land this code.

Would it be palatable to unmap and drain dma in any path that needs to
detach blocks from an inode? Something like the following that builds
on dax_wait_dma() tried to achieve, but does not introduce a new lock
for the fs to manage:

retry:
    per_fs_mmap_lock(inode);
    unmap_mapping_range(mapping, start, end); /* new page references
cannot be established */
    if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
        per_fs_mmap_unlock(inode); /* new page references can happen,
so we need to start over */
        wait_for_page_idle(dax_page);
        goto retry;
    }
    truncate_inode_pages_range(mapping, start, end);
    per_fs_mmap_unlock(inode);

Given how far away taking the mmap lock occurs from where we know we
are actually performing a punch-hole operation this may lead to
unnecessary unmapping and dma flushing.

As far as I can see the extent busy mechanism does not simplify the
solution. If we have code to wait for the pages to go idle might as
well have the truncate/punch-hole wait on that event in the first
instance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-29 23:46         ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-29 23:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Michal Hocko, Benjamin Herrenschmidt,
	Dave Hansen, Heiko Carstens, J. Bruce Fields, linux-mm,
	Paul Mackerras, Sean Hefty, Jeff Layton, Matthew Wilcox,
	linux-rdma, Michael Ellerman, Jason Gunthorpe, Doug Ledford,
	Hal Rosenstock, Dave Chinner, linux-fsdevel, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux Kernel Mailing List,
	linux-xfs, Martin Schwidefsky, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
>> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
>> > I'd like to brainstorm how we can do something better.
>> >
>> > How about:
>> >
>> > If we hit a page with an elevated refcount in truncate / hole puch
>> > etc for a DAX file system we do not free the blocks in the file system,
>> > but add it to the extent busy list.  We mark the page as delayed
>> > free (e.g. page flag?) so that when it finally hits refcount zero we
>> > call back into the file system to remove it from the busy list.
>>
>> Brainstorming some more:
>>
>> Given that on a DAX file there shouldn't be any long-term page
>> references after we unmap it from the page table and don't allow
>> get_user_pages calls why not wait for the references for all
>> DAX pages to go away first?  E.g. if we find a DAX page in
>> truncate_inode_pages_range that has an elevated refcount we set
>> a new flag to prevent new references from showing up, and then
>> simply wait for it to go away.  Instead of a busy way we can
>> do this through a few hashed waitqueued in dev_pagemap.  And in
>> fact put_zone_device_page already gets called when putting the
>> last page so we can handle the wakeup from there.
>>
>> In fact if we can't find a page flag for the stop new callers
>> things we could probably come up with a way to do that through
>> dev_pagemap somehow, but I'm not sure how efficient that would
>> be.
>
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
>
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
>
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?

Coming back to this since Dave has made clear that new locking to
coordinate get_user_pages() is a no-go.

We can unmap to force new get_user_pages() attempts to block on the
per-fs mmap lock, but if punch-hole finds any elevated pages it needs
to drop the mmap lock and wait. We need this lock dropped to get
around the problem that the driver will not start to drop page
references until it has elevated the page references on all the pages
in the I/O. If we need to drop the mmap lock that makes it impossible
to coordinate this unlock/retry loop within truncate_inode_pages_range
which would otherwise be the natural place to land this code.

Would it be palatable to unmap and drain dma in any path that needs to
detach blocks from an inode? Something like the following that builds
on dax_wait_dma() tried to achieve, but does not introduce a new lock
for the fs to manage:

retry:
    per_fs_mmap_lock(inode);
    unmap_mapping_range(mapping, start, end); /* new page references
cannot be established */
    if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
        per_fs_mmap_unlock(inode); /* new page references can happen,
so we need to start over */
        wait_for_page_idle(dax_page);
        goto retry;
    }
    truncate_inode_pages_range(mapping, start, end);
    per_fs_mmap_unlock(inode);

Given how far away taking the mmap lock occurs from where we know we
are actually performing a punch-hole operation this may lead to
unnecessary unmapping and dma flushing.

As far as I can see the extent busy mechanism does not simplify the
solution. If we have code to wait for the pages to go idle might as
well have the truncate/punch-hole wait on that event in the first
instance.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-29 23:46         ` Dan Williams
  (?)
  (?)
@ 2017-10-30  2:00           ` Dave Chinner
  -1 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30  2:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Sean Hefty, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Christoph Hellwig, Jason Gunthorpe,
	Doug Ledford, Hal Rosenstock, Martin Schwidefsky, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> >> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> >> > I'd like to brainstorm how we can do something better.
> >> >
> >> > How about:
> >> >
> >> > If we hit a page with an elevated refcount in truncate / hole puch
> >> > etc for a DAX file system we do not free the blocks in the file system,
> >> > but add it to the extent busy list.  We mark the page as delayed
> >> > free (e.g. page flag?) so that when it finally hits refcount zero we
> >> > call back into the file system to remove it from the busy list.
> >>
> >> Brainstorming some more:
> >>
> >> Given that on a DAX file there shouldn't be any long-term page
> >> references after we unmap it from the page table and don't allow
> >> get_user_pages calls why not wait for the references for all
> >> DAX pages to go away first?  E.g. if we find a DAX page in
> >> truncate_inode_pages_range that has an elevated refcount we set
> >> a new flag to prevent new references from showing up, and then
> >> simply wait for it to go away.  Instead of a busy way we can
> >> do this through a few hashed waitqueued in dev_pagemap.  And in
> >> fact put_zone_device_page already gets called when putting the
> >> last page so we can handle the wakeup from there.
> >>
> >> In fact if we can't find a page flag for the stop new callers
> >> things we could probably come up with a way to do that through
> >> dev_pagemap somehow, but I'm not sure how efficient that would
> >> be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> 
> Coming back to this since Dave has made clear that new locking to
> coordinate get_user_pages() is a no-go.
> 
> We can unmap to force new get_user_pages() attempts to block on the
> per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> to drop the mmap lock and wait. We need this lock dropped to get
> around the problem that the driver will not start to drop page
> references until it has elevated the page references on all the pages
> in the I/O. If we need to drop the mmap lock that makes it impossible
> to coordinate this unlock/retry loop within truncate_inode_pages_range
> which would otherwise be the natural place to land this code.
> 
> Would it be palatable to unmap and drain dma in any path that needs to
> detach blocks from an inode? Something like the following that builds
> on dax_wait_dma() tried to achieve, but does not introduce a new lock
> for the fs to manage:
> 
> retry:
>     per_fs_mmap_lock(inode);
>     unmap_mapping_range(mapping, start, end); /* new page references
> cannot be established */
>     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>         per_fs_mmap_unlock(inode); /* new page references can happen,
> so we need to start over */
>         wait_for_page_idle(dax_page);
>         goto retry;
>     }
>     truncate_inode_pages_range(mapping, start, end);
>     per_fs_mmap_unlock(inode);

These retry loops you keep proposing are just bloody horrible.  They
are basically just a method for blocking an operation until whatever
condition is preventing the invalidation goes away. IMO, that's an
ugly solution no matter how much lipstick you dress it up with.

i.e. the blocking loops mean the user process is going to be blocked
for arbitrary lengths of time. That's not a solution, it's just
passing the buck - now the userspace developers need to work around
truncate/hole punch being randomly blocked for arbitrary lengths of
time.

The whole point of pushing this into the busy extent list is that it
doesn't require blocking operations. i.e the re-use of the underlying
storage is simply delayed until notification that it is safe to
re-use comes along, but the extent removal operation doesn't get
blocked.

That's how we treat extents that require discard operations after
they have been freed - they remain in the busy list until the
discard IO completion signals "all done" and clears the busy extent.
Here we need to hold off clearing the extent until we get the "all
done" from the dax code.

e.g. what needs to happen when trying to do the invalidation is
something like this (assuming invalidate_inode_pages2_range() will
actually fail on pages under DMA):

	flags = 0;
	if (IS_DAX()) {
		error = invalidate_inode_pages2_range()
		if (error == -EBUSY && dax_dma_busy_page())
			flags = EXTENT_BUSY_DAX;
		else
			truncate_pagecache(); /* blocking */
	} else {
		truncate_pagecache();
	}

that EXTENT_BUSY_DAX flag needs to be carried all the way through to
the xfs_free_extent -> xfs_extent_busy_insert(). That's probably the
most complex part of the patch.

This flag then prevents xfs_extent_busy_reuse() from allowing reuse
of the extent.

And in xfs_extent_busy_clear(), they need to be treated sort of like
discarded extents. On transaction commit callback, we need to check
if there are still busy daxdma pages over the extent range, and if
there are we leave it in the busy list, otherwise it can be cleared.
For everything that is left in the busy list, the dax dma code will
need to call back into the filesystem when that page is released and
when the extent no long has any dax dma busy pages left over it it
can be cleared from the list.

Once we have the dax code to call back into the filesystem when the
problematic daxdma pages are released, and everything else should be
relatively straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30  2:00           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30  2:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List

On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> >> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> >> > I'd like to brainstorm how we can do something better.
> >> >
> >> > How about:
> >> >
> >> > If we hit a page with an elevated refcount in truncate / hole puch
> >> > etc for a DAX file system we do not free the blocks in the file system,
> >> > but add it to the extent busy list.  We mark the page as delayed
> >> > free (e.g. page flag?) so that when it finally hits refcount zero we
> >> > call back into the file system to remove it from the busy list.
> >>
> >> Brainstorming some more:
> >>
> >> Given that on a DAX file there shouldn't be any long-term page
> >> references after we unmap it from the page table and don't allow
> >> get_user_pages calls why not wait for the references for all
> >> DAX pages to go away first?  E.g. if we find a DAX page in
> >> truncate_inode_pages_range that has an elevated refcount we set
> >> a new flag to prevent new references from showing up, and then
> >> simply wait for it to go away.  Instead of a busy way we can
> >> do this through a few hashed waitqueued in dev_pagemap.  And in
> >> fact put_zone_device_page already gets called when putting the
> >> last page so we can handle the wakeup from there.
> >>
> >> In fact if we can't find a page flag for the stop new callers
> >> things we could probably come up with a way to do that through
> >> dev_pagemap somehow, but I'm not sure how efficient that would
> >> be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> 
> Coming back to this since Dave has made clear that new locking to
> coordinate get_user_pages() is a no-go.
> 
> We can unmap to force new get_user_pages() attempts to block on the
> per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> to drop the mmap lock and wait. We need this lock dropped to get
> around the problem that the driver will not start to drop page
> references until it has elevated the page references on all the pages
> in the I/O. If we need to drop the mmap lock that makes it impossible
> to coordinate this unlock/retry loop within truncate_inode_pages_range
> which would otherwise be the natural place to land this code.
> 
> Would it be palatable to unmap and drain dma in any path that needs to
> detach blocks from an inode? Something like the following that builds
> on dax_wait_dma() tried to achieve, but does not introduce a new lock
> for the fs to manage:
> 
> retry:
>     per_fs_mmap_lock(inode);
>     unmap_mapping_range(mapping, start, end); /* new page references
> cannot be established */
>     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>         per_fs_mmap_unlock(inode); /* new page references can happen,
> so we need to start over */
>         wait_for_page_idle(dax_page);
>         goto retry;
>     }
>     truncate_inode_pages_range(mapping, start, end);
>     per_fs_mmap_unlock(inode);

These retry loops you keep proposing are just bloody horrible.  They
are basically just a method for blocking an operation until whatever
condition is preventing the invalidation goes away. IMO, that's an
ugly solution no matter how much lipstick you dress it up with.

i.e. the blocking loops mean the user process is going to be blocked
for arbitrary lengths of time. That's not a solution, it's just
passing the buck - now the userspace developers need to work around
truncate/hole punch being randomly blocked for arbitrary lengths of
time.

The whole point of pushing this into the busy extent list is that it
doesn't require blocking operations. i.e the re-use of the underlying
storage is simply delayed until notification that it is safe to
re-use comes along, but the extent removal operation doesn't get
blocked.

That's how we treat extents that require discard operations after
they have been freed - they remain in the busy list until the
discard IO completion signals "all done" and clears the busy extent.
Here we need to hold off clearing the extent until we get the "all
done" from the dax code.

e.g. what needs to happen when trying to do the invalidation is
something like this (assuming invalidate_inode_pages2_range() will
actually fail on pages under DMA):

	flags = 0;
	if (IS_DAX()) {
		error = invalidate_inode_pages2_range()
		if (error == -EBUSY && dax_dma_busy_page())
			flags = EXTENT_BUSY_DAX;
		else
			truncate_pagecache(); /* blocking */
	} else {
		truncate_pagecache();
	}

that EXTENT_BUSY_DAX flag needs to be carried all the way through to
the xfs_free_extent -> xfs_extent_busy_insert(). That's probably the
most complex part of the patch.

This flag then prevents xfs_extent_busy_reuse() from allowing reuse
of the extent.

And in xfs_extent_busy_clear(), they need to be treated sort of like
discarded extents. On transaction commit callback, we need to check
if there are still busy daxdma pages over the extent range, and if
there are we leave it in the busy list, otherwise it can be cleared.
For everything that is left in the busy list, the dax dma code will
need to call back into the filesystem when that page is released and
when the extent no long has any dax dma busy pages left over it it
can be cleared from the list.

Once we have the dax code to call back into the filesystem when the
problematic daxdma pages are released, and everything else should be
relatively straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30  2:00           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30  2:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List, linux-xfs, Martin Schwidefsky,
	Andrew Morton, Darrick J. Wong, Kirill A. Shutemov

On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> >> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> >> > I'd like to brainstorm how we can do something better.
> >> >
> >> > How about:
> >> >
> >> > If we hit a page with an elevated refcount in truncate / hole puch
> >> > etc for a DAX file system we do not free the blocks in the file system,
> >> > but add it to the extent busy list.  We mark the page as delayed
> >> > free (e.g. page flag?) so that when it finally hits refcount zero we
> >> > call back into the file system to remove it from the busy list.
> >>
> >> Brainstorming some more:
> >>
> >> Given that on a DAX file there shouldn't be any long-term page
> >> references after we unmap it from the page table and don't allow
> >> get_user_pages calls why not wait for the references for all
> >> DAX pages to go away first?  E.g. if we find a DAX page in
> >> truncate_inode_pages_range that has an elevated refcount we set
> >> a new flag to prevent new references from showing up, and then
> >> simply wait for it to go away.  Instead of a busy way we can
> >> do this through a few hashed waitqueued in dev_pagemap.  And in
> >> fact put_zone_device_page already gets called when putting the
> >> last page so we can handle the wakeup from there.
> >>
> >> In fact if we can't find a page flag for the stop new callers
> >> things we could probably come up with a way to do that through
> >> dev_pagemap somehow, but I'm not sure how efficient that would
> >> be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> 
> Coming back to this since Dave has made clear that new locking to
> coordinate get_user_pages() is a no-go.
> 
> We can unmap to force new get_user_pages() attempts to block on the
> per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> to drop the mmap lock and wait. We need this lock dropped to get
> around the problem that the driver will not start to drop page
> references until it has elevated the page references on all the pages
> in the I/O. If we need to drop the mmap lock that makes it impossible
> to coordinate this unlock/retry loop within truncate_inode_pages_range
> which would otherwise be the natural place to land this code.
> 
> Would it be palatable to unmap and drain dma in any path that needs to
> detach blocks from an inode? Something like the following that builds
> on dax_wait_dma() tried to achieve, but does not introduce a new lock
> for the fs to manage:
> 
> retry:
>     per_fs_mmap_lock(inode);
>     unmap_mapping_range(mapping, start, end); /* new page references
> cannot be established */
>     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>         per_fs_mmap_unlock(inode); /* new page references can happen,
> so we need to start over */
>         wait_for_page_idle(dax_page);
>         goto retry;
>     }
>     truncate_inode_pages_range(mapping, start, end);
>     per_fs_mmap_unlock(inode);

These retry loops you keep proposing are just bloody horrible.  They
are basically just a method for blocking an operation until whatever
condition is preventing the invalidation goes away. IMO, that's an
ugly solution no matter how much lipstick you dress it up with.

i.e. the blocking loops mean the user process is going to be blocked
for arbitrary lengths of time. That's not a solution, it's just
passing the buck - now the userspace developers need to work around
truncate/hole punch being randomly blocked for arbitrary lengths of
time.

The whole point of pushing this into the busy extent list is that it
doesn't require blocking operations. i.e the re-use of the underlying
storage is simply delayed until notification that it is safe to
re-use comes along, but the extent removal operation doesn't get
blocked.

That's how we treat extents that require discard operations after
they have been freed - they remain in the busy list until the
discard IO completion signals "all done" and clears the busy extent.
Here we need to hold off clearing the extent until we get the "all
done" from the dax code.

e.g. what needs to happen when trying to do the invalidation is
something like this (assuming invalidate_inode_pages2_range() will
actually fail on pages under DMA):

	flags = 0;
	if (IS_DAX()) {
		error = invalidate_inode_pages2_range()
		if (error == -EBUSY && dax_dma_busy_page())
			flags = EXTENT_BUSY_DAX;
		else
			truncate_pagecache(); /* blocking */
	} else {
		truncate_pagecache();
	}

that EXTENT_BUSY_DAX flag needs to be carried all the way through to
the xfs_free_extent -> xfs_extent_busy_insert(). That's probably the
most complex part of the patch.

This flag then prevents xfs_extent_busy_reuse() from allowing reuse
of the extent.

And in xfs_extent_busy_clear(), they need to be treated sort of like
discarded extents. On transaction commit callback, we need to check
if there are still busy daxdma pages over the extent range, and if
there are we leave it in the busy list, otherwise it can be cleared.
For everything that is left in the busy list, the dax dma code will
need to call back into the filesystem when that page is released and
when the extent no long has any dax dma busy pages left over it it
can be cleared from the list.

Once we have the dax code to call back into the filesystem when the
problematic daxdma pages are released, and everything else should be
relatively straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30  2:00           ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30  2:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List, linux-xfs, Martin Schwidefsky,
	Andrew Morton, Darrick J. Wong, Kirill A. Shutemov

On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> >> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> >> > I'd like to brainstorm how we can do something better.
> >> >
> >> > How about:
> >> >
> >> > If we hit a page with an elevated refcount in truncate / hole puch
> >> > etc for a DAX file system we do not free the blocks in the file system,
> >> > but add it to the extent busy list.  We mark the page as delayed
> >> > free (e.g. page flag?) so that when it finally hits refcount zero we
> >> > call back into the file system to remove it from the busy list.
> >>
> >> Brainstorming some more:
> >>
> >> Given that on a DAX file there shouldn't be any long-term page
> >> references after we unmap it from the page table and don't allow
> >> get_user_pages calls why not wait for the references for all
> >> DAX pages to go away first?  E.g. if we find a DAX page in
> >> truncate_inode_pages_range that has an elevated refcount we set
> >> a new flag to prevent new references from showing up, and then
> >> simply wait for it to go away.  Instead of a busy way we can
> >> do this through a few hashed waitqueued in dev_pagemap.  And in
> >> fact put_zone_device_page already gets called when putting the
> >> last page so we can handle the wakeup from there.
> >>
> >> In fact if we can't find a page flag for the stop new callers
> >> things we could probably come up with a way to do that through
> >> dev_pagemap somehow, but I'm not sure how efficient that would
> >> be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> 
> Coming back to this since Dave has made clear that new locking to
> coordinate get_user_pages() is a no-go.
> 
> We can unmap to force new get_user_pages() attempts to block on the
> per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> to drop the mmap lock and wait. We need this lock dropped to get
> around the problem that the driver will not start to drop page
> references until it has elevated the page references on all the pages
> in the I/O. If we need to drop the mmap lock that makes it impossible
> to coordinate this unlock/retry loop within truncate_inode_pages_range
> which would otherwise be the natural place to land this code.
> 
> Would it be palatable to unmap and drain dma in any path that needs to
> detach blocks from an inode? Something like the following that builds
> on dax_wait_dma() tried to achieve, but does not introduce a new lock
> for the fs to manage:
> 
> retry:
>     per_fs_mmap_lock(inode);
>     unmap_mapping_range(mapping, start, end); /* new page references
> cannot be established */
>     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>         per_fs_mmap_unlock(inode); /* new page references can happen,
> so we need to start over */
>         wait_for_page_idle(dax_page);
>         goto retry;
>     }
>     truncate_inode_pages_range(mapping, start, end);
>     per_fs_mmap_unlock(inode);

These retry loops you keep proposing are just bloody horrible.  They
are basically just a method for blocking an operation until whatever
condition is preventing the invalidation goes away. IMO, that's an
ugly solution no matter how much lipstick you dress it up with.

i.e. the blocking loops mean the user process is going to be blocked
for arbitrary lengths of time. That's not a solution, it's just
passing the buck - now the userspace developers need to work around
truncate/hole punch being randomly blocked for arbitrary lengths of
time.

The whole point of pushing this into the busy extent list is that it
doesn't require blocking operations. i.e the re-use of the underlying
storage is simply delayed until notification that it is safe to
re-use comes along, but the extent removal operation doesn't get
blocked.

That's how we treat extents that require discard operations after
they have been freed - they remain in the busy list until the
discard IO completion signals "all done" and clears the busy extent.
Here we need to hold off clearing the extent until we get the "all
done" from the dax code.

e.g. what needs to happen when trying to do the invalidation is
something like this (assuming invalidate_inode_pages2_range() will
actually fail on pages under DMA):

	flags = 0;
	if (IS_DAX()) {
		error = invalidate_inode_pages2_range()
		if (error == -EBUSY && dax_dma_busy_page())
			flags = EXTENT_BUSY_DAX;
		else
			truncate_pagecache(); /* blocking */
	} else {
		truncate_pagecache();
	}

that EXTENT_BUSY_DAX flag needs to be carried all the way through to
the xfs_free_extent -> xfs_extent_busy_insert(). That's probably the
most complex part of the patch.

This flag then prevents xfs_extent_busy_reuse() from allowing reuse
of the extent.

And in xfs_extent_busy_clear(), they need to be treated sort of like
discarded extents. On transaction commit callback, we need to check
if there are still busy daxdma pages over the extent range, and if
there are we leave it in the busy list, otherwise it can be cleared.
For everything that is left in the busy list, the dax dma code will
need to call back into the filesystem when that page is released and
when the extent no long has any dax dma busy pages left over it it
can be cleared from the list.

Once we have the dax code to call back into the filesystem when the
problematic daxdma pages are released, and everything else should be
relatively straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-30  2:00           ` Dave Chinner
  (?)
@ 2017-10-30  8:38             ` Jan Kara
  -1 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-30  8:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List, linux-xfs, Martin Schwidefsky,
	Andrew Morton, Darrick J. Wong, Kirill A. Shutemov

Hi,

On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > Coming back to this since Dave has made clear that new locking to
> > coordinate get_user_pages() is a no-go.
> > 
> > We can unmap to force new get_user_pages() attempts to block on the
> > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > to drop the mmap lock and wait. We need this lock dropped to get
> > around the problem that the driver will not start to drop page
> > references until it has elevated the page references on all the pages
> > in the I/O. If we need to drop the mmap lock that makes it impossible
> > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > which would otherwise be the natural place to land this code.
> > 
> > Would it be palatable to unmap and drain dma in any path that needs to
> > detach blocks from an inode? Something like the following that builds
> > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > for the fs to manage:
> > 
> > retry:
> >     per_fs_mmap_lock(inode);
> >     unmap_mapping_range(mapping, start, end); /* new page references
> > cannot be established */
> >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > so we need to start over */
> >         wait_for_page_idle(dax_page);
> >         goto retry;
> >     }
> >     truncate_inode_pages_range(mapping, start, end);
> >     per_fs_mmap_unlock(inode);
> 
> These retry loops you keep proposing are just bloody horrible.  They
> are basically just a method for blocking an operation until whatever
> condition is preventing the invalidation goes away. IMO, that's an
> ugly solution no matter how much lipstick you dress it up with.
> 
> i.e. the blocking loops mean the user process is going to be blocked
> for arbitrary lengths of time. That's not a solution, it's just
> passing the buck - now the userspace developers need to work around
> truncate/hole punch being randomly blocked for arbitrary lengths of
> time.

So I see substantial difference between how you and Christoph think this
should be handled. Christoph writes in [1]:

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

So Christoph wants to block truncate until references are released, forbid
long term references until userspace acquiring them supports some kind of
lease-breaking. OTOH you suggest truncate should just proceed leaving
blocks allocated until references are released. We cannot have both... I'm
leaning more towards the approach Christoph suggests as it puts the burned
to the place which is causing it - the application having long term
references - and applications needing this should be sufficiently rare that
we don't have to devise a general mechanism in the kernel for this.

If the solution Christoph suggests is acceptable to you, I think we should
first write a patch to forbid acquiring long term references to DAX blocks.
On top of that we can implement mechanism to block truncate while there are
short term references pending (and for that retry loops would be IMHO
acceptable). And then we can work on a mechanism to notify userspace that
it needs to drop references to blocks that are going to be truncated so
that we can re-enable taking of long term references.

								Honza

[1]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1522887.html

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30  8:38             ` Jan Kara
  0 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-30  8:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm

Hi,

On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > Coming back to this since Dave has made clear that new locking to
> > coordinate get_user_pages() is a no-go.
> > 
> > We can unmap to force new get_user_pages() attempts to block on the
> > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > to drop the mmap lock and wait. We need this lock dropped to get
> > around the problem that the driver will not start to drop page
> > references until it has elevated the page references on all the pages
> > in the I/O. If we need to drop the mmap lock that makes it impossible
> > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > which would otherwise be the natural place to land this code.
> > 
> > Would it be palatable to unmap and drain dma in any path that needs to
> > detach blocks from an inode? Something like the following that builds
> > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > for the fs to manage:
> > 
> > retry:
> >     per_fs_mmap_lock(inode);
> >     unmap_mapping_range(mapping, start, end); /* new page references
> > cannot be established */
> >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > so we need to start over */
> >         wait_for_page_idle(dax_page);
> >         goto retry;
> >     }
> >     truncate_inode_pages_range(mapping, start, end);
> >     per_fs_mmap_unlock(inode);
> 
> These retry loops you keep proposing are just bloody horrible.  They
> are basically just a method for blocking an operation until whatever
> condition is preventing the invalidation goes away. IMO, that's an
> ugly solution no matter how much lipstick you dress it up with.
> 
> i.e. the blocking loops mean the user process is going to be blocked
> for arbitrary lengths of time. That's not a solution, it's just
> passing the buck - now the userspace developers need to work around
> truncate/hole punch being randomly blocked for arbitrary lengths of
> time.

So I see substantial difference between how you and Christoph think this
should be handled. Christoph writes in [1]:

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

So Christoph wants to block truncate until references are released, forbid
long term references until userspace acquiring them supports some kind of
lease-breaking. OTOH you suggest truncate should just proceed leaving
blocks allocated until references are released. We cannot have both... I'm
leaning more towards the approach Christoph suggests as it puts the burned
to the place which is causing it - the application having long term
references - and applications needing this should be sufficiently rare that
we don't have to devise a general mechanism in the kernel for this.

If the solution Christoph suggests is acceptable to you, I think we should
first write a patch to forbid acquiring long term references to DAX blocks.
On top of that we can implement mechanism to block truncate while there are
short term references pending (and for that retry loops would be IMHO
acceptable). And then we can work on a mechanism to notify userspace that
it needs to drop references to blocks that are going to be truncated so
that we can re-enable taking of long term references.

								Honza

[1]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1522887.html

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30  8:38             ` Jan Kara
  0 siblings, 0 replies; 143+ messages in thread
From: Jan Kara @ 2017-10-30  8:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List, linux-xfs, Martin Schwidefsky,
	Andrew Morton, Darrick J. Wong, Kirill A. Shutemov

Hi,

On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > Coming back to this since Dave has made clear that new locking to
> > coordinate get_user_pages() is a no-go.
> > 
> > We can unmap to force new get_user_pages() attempts to block on the
> > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > to drop the mmap lock and wait. We need this lock dropped to get
> > around the problem that the driver will not start to drop page
> > references until it has elevated the page references on all the pages
> > in the I/O. If we need to drop the mmap lock that makes it impossible
> > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > which would otherwise be the natural place to land this code.
> > 
> > Would it be palatable to unmap and drain dma in any path that needs to
> > detach blocks from an inode? Something like the following that builds
> > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > for the fs to manage:
> > 
> > retry:
> >     per_fs_mmap_lock(inode);
> >     unmap_mapping_range(mapping, start, end); /* new page references
> > cannot be established */
> >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > so we need to start over */
> >         wait_for_page_idle(dax_page);
> >         goto retry;
> >     }
> >     truncate_inode_pages_range(mapping, start, end);
> >     per_fs_mmap_unlock(inode);
> 
> These retry loops you keep proposing are just bloody horrible.  They
> are basically just a method for blocking an operation until whatever
> condition is preventing the invalidation goes away. IMO, that's an
> ugly solution no matter how much lipstick you dress it up with.
> 
> i.e. the blocking loops mean the user process is going to be blocked
> for arbitrary lengths of time. That's not a solution, it's just
> passing the buck - now the userspace developers need to work around
> truncate/hole punch being randomly blocked for arbitrary lengths of
> time.

So I see substantial difference between how you and Christoph think this
should be handled. Christoph writes in [1]:

The point is that we need to prohibit long term elevated page counts
with DAX anyway - we can't just let people grab allocated blocks forever
while ignoring file system operations.  For stage 1 we'll just need to
fail those, and in the long run they will have to use a mechanism
similar to FL_LAYOUT locks to deal with file system allocation changes.

So Christoph wants to block truncate until references are released, forbid
long term references until userspace acquiring them supports some kind of
lease-breaking. OTOH you suggest truncate should just proceed leaving
blocks allocated until references are released. We cannot have both... I'm
leaning more towards the approach Christoph suggests as it puts the burned
to the place which is causing it - the application having long term
references - and applications needing this should be sufficiently rare that
we don't have to devise a general mechanism in the kernel for this.

If the solution Christoph suggests is acceptable to you, I think we should
first write a patch to forbid acquiring long term references to DAX blocks.
On top of that we can implement mechanism to block truncate while there are
short term references pending (and for that retry loops would be IMHO
acceptable). And then we can work on a mechanism to notify userspace that
it needs to drop references to blocks that are going to be truncated so
that we can re-enable taking of long term references.

								Honza

[1]
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1522887.html

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-30  8:38             ` Jan Kara
  (?)
  (?)
@ 2017-10-30 11:20               ` Dave Chinner
  -1 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30 11:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Sean Hefty, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Christoph Hellwig, Jason Gunthorpe,
	Doug Ledford, Hal Rosenstock, Martin Schwidefsky, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
> Hi,
> 
> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > > Coming back to this since Dave has made clear that new locking to
> > > coordinate get_user_pages() is a no-go.
> > > 
> > > We can unmap to force new get_user_pages() attempts to block on the
> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > > to drop the mmap lock and wait. We need this lock dropped to get
> > > around the problem that the driver will not start to drop page
> > > references until it has elevated the page references on all the pages
> > > in the I/O. If we need to drop the mmap lock that makes it impossible
> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > > which would otherwise be the natural place to land this code.
> > > 
> > > Would it be palatable to unmap and drain dma in any path that needs to
> > > detach blocks from an inode? Something like the following that builds
> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > > for the fs to manage:
> > > 
> > > retry:
> > >     per_fs_mmap_lock(inode);
> > >     unmap_mapping_range(mapping, start, end); /* new page references
> > > cannot be established */
> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > > so we need to start over */
> > >         wait_for_page_idle(dax_page);
> > >         goto retry;
> > >     }
> > >     truncate_inode_pages_range(mapping, start, end);
> > >     per_fs_mmap_unlock(inode);
> > 
> > These retry loops you keep proposing are just bloody horrible.  They
> > are basically just a method for blocking an operation until whatever
> > condition is preventing the invalidation goes away. IMO, that's an
> > ugly solution no matter how much lipstick you dress it up with.
> > 
> > i.e. the blocking loops mean the user process is going to be blocked
> > for arbitrary lengths of time. That's not a solution, it's just
> > passing the buck - now the userspace developers need to work around
> > truncate/hole punch being randomly blocked for arbitrary lengths of
> > time.
> 
> So I see substantial difference between how you and Christoph think this
> should be handled. Christoph writes in [1]:
> 
> The point is that we need to prohibit long term elevated page counts
> with DAX anyway - we can't just let people grab allocated blocks forever
> while ignoring file system operations.  For stage 1 we'll just need to
> fail those, and in the long run they will have to use a mechanism
> similar to FL_LAYOUT locks to deal with file system allocation changes.
> 
> So Christoph wants to block truncate until references are released, forbid
> long term references until userspace acquiring them supports some kind of
> lease-breaking. OTOH you suggest truncate should just proceed leaving
> blocks allocated until references are released.

I don't see what I'm suggesting is a solution to long term elevated
page counts. Just something that can park extents until layout
leases are broken and references released. That's a few tens of
seconds at most.

> We cannot have both... I'm leaning more towards the approach
> Christoph suggests as it puts the burned to the place which is
> causing it - the application having long term references - and
> applications needing this should be sufficiently rare that we
> don't have to devise a general mechanism in the kernel for this.

I have no problems with blocking truncate forever if that's the
desired solution for an elevated page count due to a DMA reference
to a page. But that has absolutely nothing to do with the filesystem
though - it's a page reference vs mapping invalidation problem, not
a filesystem/inode problem.

Perhaps pages with active DAX DMA mapping references need a page
flag to indicate that invalidation must block on the page similar to
the writeback flag...

> If the solution Christoph suggests is acceptable to you, I think
> we should first write a patch to forbid acquiring long term
> references to DAX blocks.  On top of that we can implement
> mechanism to block truncate while there are short term references
> pending (and for that retry loops would be IMHO acceptable).

The problem with retry loops is that they are making a mess of an
already complex set of locking contraints on the indoe IO path. It's
rapidly descending into an unmaintainable mess - falling off the
locking cliff only make sthe code harder to maintain - please look
for solutions that don't require new locks or lock retry loops.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30 11:20               ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30 11:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm, Linux

On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
> Hi,
> 
> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > > Coming back to this since Dave has made clear that new locking to
> > > coordinate get_user_pages() is a no-go.
> > > 
> > > We can unmap to force new get_user_pages() attempts to block on the
> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > > to drop the mmap lock and wait. We need this lock dropped to get
> > > around the problem that the driver will not start to drop page
> > > references until it has elevated the page references on all the pages
> > > in the I/O. If we need to drop the mmap lock that makes it impossible
> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > > which would otherwise be the natural place to land this code.
> > > 
> > > Would it be palatable to unmap and drain dma in any path that needs to
> > > detach blocks from an inode? Something like the following that builds
> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > > for the fs to manage:
> > > 
> > > retry:
> > >     per_fs_mmap_lock(inode);
> > >     unmap_mapping_range(mapping, start, end); /* new page references
> > > cannot be established */
> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > > so we need to start over */
> > >         wait_for_page_idle(dax_page);
> > >         goto retry;
> > >     }
> > >     truncate_inode_pages_range(mapping, start, end);
> > >     per_fs_mmap_unlock(inode);
> > 
> > These retry loops you keep proposing are just bloody horrible.  They
> > are basically just a method for blocking an operation until whatever
> > condition is preventing the invalidation goes away. IMO, that's an
> > ugly solution no matter how much lipstick you dress it up with.
> > 
> > i.e. the blocking loops mean the user process is going to be blocked
> > for arbitrary lengths of time. That's not a solution, it's just
> > passing the buck - now the userspace developers need to work around
> > truncate/hole punch being randomly blocked for arbitrary lengths of
> > time.
> 
> So I see substantial difference between how you and Christoph think this
> should be handled. Christoph writes in [1]:
> 
> The point is that we need to prohibit long term elevated page counts
> with DAX anyway - we can't just let people grab allocated blocks forever
> while ignoring file system operations.  For stage 1 we'll just need to
> fail those, and in the long run they will have to use a mechanism
> similar to FL_LAYOUT locks to deal with file system allocation changes.
> 
> So Christoph wants to block truncate until references are released, forbid
> long term references until userspace acquiring them supports some kind of
> lease-breaking. OTOH you suggest truncate should just proceed leaving
> blocks allocated until references are released.

I don't see what I'm suggesting is a solution to long term elevated
page counts. Just something that can park extents until layout
leases are broken and references released. That's a few tens of
seconds at most.

> We cannot have both... I'm leaning more towards the approach
> Christoph suggests as it puts the burned to the place which is
> causing it - the application having long term references - and
> applications needing this should be sufficiently rare that we
> don't have to devise a general mechanism in the kernel for this.

I have no problems with blocking truncate forever if that's the
desired solution for an elevated page count due to a DMA reference
to a page. But that has absolutely nothing to do with the filesystem
though - it's a page reference vs mapping invalidation problem, not
a filesystem/inode problem.

Perhaps pages with active DAX DMA mapping references need a page
flag to indicate that invalidation must block on the page similar to
the writeback flag...

> If the solution Christoph suggests is acceptable to you, I think
> we should first write a patch to forbid acquiring long term
> references to DAX blocks.  On top of that we can implement
> mechanism to block truncate while there are short term references
> pending (and for that retry loops would be IMHO acceptable).

The problem with retry loops is that they are making a mess of an
already complex set of locking contraints on the indoe IO path. It's
rapidly descending into an unmaintainable mess - falling off the
locking cliff only make sthe code harder to maintain - please look
for solutions that don't require new locks or lock retry loops.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30 11:20               ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30 11:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List, linux-xfs, Martin Schwidefsky,
	Andrew Morton, Darrick J. Wong, Kirill A. Shutemov

On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
> Hi,
> 
> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > > Coming back to this since Dave has made clear that new locking to
> > > coordinate get_user_pages() is a no-go.
> > > 
> > > We can unmap to force new get_user_pages() attempts to block on the
> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > > to drop the mmap lock and wait. We need this lock dropped to get
> > > around the problem that the driver will not start to drop page
> > > references until it has elevated the page references on all the pages
> > > in the I/O. If we need to drop the mmap lock that makes it impossible
> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > > which would otherwise be the natural place to land this code.
> > > 
> > > Would it be palatable to unmap and drain dma in any path that needs to
> > > detach blocks from an inode? Something like the following that builds
> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > > for the fs to manage:
> > > 
> > > retry:
> > >     per_fs_mmap_lock(inode);
> > >     unmap_mapping_range(mapping, start, end); /* new page references
> > > cannot be established */
> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > > so we need to start over */
> > >         wait_for_page_idle(dax_page);
> > >         goto retry;
> > >     }
> > >     truncate_inode_pages_range(mapping, start, end);
> > >     per_fs_mmap_unlock(inode);
> > 
> > These retry loops you keep proposing are just bloody horrible.  They
> > are basically just a method for blocking an operation until whatever
> > condition is preventing the invalidation goes away. IMO, that's an
> > ugly solution no matter how much lipstick you dress it up with.
> > 
> > i.e. the blocking loops mean the user process is going to be blocked
> > for arbitrary lengths of time. That's not a solution, it's just
> > passing the buck - now the userspace developers need to work around
> > truncate/hole punch being randomly blocked for arbitrary lengths of
> > time.
> 
> So I see substantial difference between how you and Christoph think this
> should be handled. Christoph writes in [1]:
> 
> The point is that we need to prohibit long term elevated page counts
> with DAX anyway - we can't just let people grab allocated blocks forever
> while ignoring file system operations.  For stage 1 we'll just need to
> fail those, and in the long run they will have to use a mechanism
> similar to FL_LAYOUT locks to deal with file system allocation changes.
> 
> So Christoph wants to block truncate until references are released, forbid
> long term references until userspace acquiring them supports some kind of
> lease-breaking. OTOH you suggest truncate should just proceed leaving
> blocks allocated until references are released.

I don't see what I'm suggesting is a solution to long term elevated
page counts. Just something that can park extents until layout
leases are broken and references released. That's a few tens of
seconds at most.

> We cannot have both... I'm leaning more towards the approach
> Christoph suggests as it puts the burned to the place which is
> causing it - the application having long term references - and
> applications needing this should be sufficiently rare that we
> don't have to devise a general mechanism in the kernel for this.

I have no problems with blocking truncate forever if that's the
desired solution for an elevated page count due to a DMA reference
to a page. But that has absolutely nothing to do with the filesystem
though - it's a page reference vs mapping invalidation problem, not
a filesystem/inode problem.

Perhaps pages with active DAX DMA mapping references need a page
flag to indicate that invalidation must block on the page similar to
the writeback flag...

> If the solution Christoph suggests is acceptable to you, I think
> we should first write a patch to forbid acquiring long term
> references to DAX blocks.  On top of that we can implement
> mechanism to block truncate while there are short term references
> pending (and for that retry loops would be IMHO acceptable).

The problem with retry loops is that they are making a mess of an
already complex set of locking contraints on the indoe IO path. It's
rapidly descending into an unmaintainable mess - falling off the
locking cliff only make sthe code harder to maintain - please look
for solutions that don't require new locks or lock retry loops.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30 11:20               ` Dave Chinner
  0 siblings, 0 replies; 143+ messages in thread
From: Dave Chinner @ 2017-10-30 11:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Christoph Hellwig, Michal Hocko,
	Benjamin Herrenschmidt, Dave Hansen, Heiko Carstens,
	J. Bruce Fields, linux-mm, Paul Mackerras, Sean Hefty,
	Jeff Layton, Matthew Wilcox, linux-rdma, Michael Ellerman,
	Jason Gunthorpe, Doug Ledford, Hal Rosenstock, linux-fsdevel,
	Alexander Viro, Gerald Schaefer, linux-nvdimm,
	Linux Kernel Mailing List, linux-xfs, Martin Schwidefsky,
	Andrew Morton, Darrick J. Wong, Kirill A. Shutemov

On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
> Hi,
> 
> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> > > Coming back to this since Dave has made clear that new locking to
> > > coordinate get_user_pages() is a no-go.
> > > 
> > > We can unmap to force new get_user_pages() attempts to block on the
> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> > > to drop the mmap lock and wait. We need this lock dropped to get
> > > around the problem that the driver will not start to drop page
> > > references until it has elevated the page references on all the pages
> > > in the I/O. If we need to drop the mmap lock that makes it impossible
> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
> > > which would otherwise be the natural place to land this code.
> > > 
> > > Would it be palatable to unmap and drain dma in any path that needs to
> > > detach blocks from an inode? Something like the following that builds
> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
> > > for the fs to manage:
> > > 
> > > retry:
> > >     per_fs_mmap_lock(inode);
> > >     unmap_mapping_range(mapping, start, end); /* new page references
> > > cannot be established */
> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
> > > so we need to start over */
> > >         wait_for_page_idle(dax_page);
> > >         goto retry;
> > >     }
> > >     truncate_inode_pages_range(mapping, start, end);
> > >     per_fs_mmap_unlock(inode);
> > 
> > These retry loops you keep proposing are just bloody horrible.  They
> > are basically just a method for blocking an operation until whatever
> > condition is preventing the invalidation goes away. IMO, that's an
> > ugly solution no matter how much lipstick you dress it up with.
> > 
> > i.e. the blocking loops mean the user process is going to be blocked
> > for arbitrary lengths of time. That's not a solution, it's just
> > passing the buck - now the userspace developers need to work around
> > truncate/hole punch being randomly blocked for arbitrary lengths of
> > time.
> 
> So I see substantial difference between how you and Christoph think this
> should be handled. Christoph writes in [1]:
> 
> The point is that we need to prohibit long term elevated page counts
> with DAX anyway - we can't just let people grab allocated blocks forever
> while ignoring file system operations.  For stage 1 we'll just need to
> fail those, and in the long run they will have to use a mechanism
> similar to FL_LAYOUT locks to deal with file system allocation changes.
> 
> So Christoph wants to block truncate until references are released, forbid
> long term references until userspace acquiring them supports some kind of
> lease-breaking. OTOH you suggest truncate should just proceed leaving
> blocks allocated until references are released.

I don't see what I'm suggesting is a solution to long term elevated
page counts. Just something that can park extents until layout
leases are broken and references released. That's a few tens of
seconds at most.

> We cannot have both... I'm leaning more towards the approach
> Christoph suggests as it puts the burned to the place which is
> causing it - the application having long term references - and
> applications needing this should be sufficiently rare that we
> don't have to devise a general mechanism in the kernel for this.

I have no problems with blocking truncate forever if that's the
desired solution for an elevated page count due to a DMA reference
to a page. But that has absolutely nothing to do with the filesystem
though - it's a page reference vs mapping invalidation problem, not
a filesystem/inode problem.

Perhaps pages with active DAX DMA mapping references need a page
flag to indicate that invalidation must block on the page similar to
the writeback flag...

> If the solution Christoph suggests is acceptable to you, I think
> we should first write a patch to forbid acquiring long term
> references to DAX blocks.  On top of that we can implement
> mechanism to block truncate while there are short term references
> pending (and for that retry loops would be IMHO acceptable).

The problem with retry loops is that they are making a mess of an
already complex set of locking contraints on the indoe IO path. It's
rapidly descending into an unmaintainable mess - falling off the
locking cliff only make sthe code harder to maintain - please look
for solutions that don't require new locks or lock retry loops.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
  2017-10-30 11:20               ` Dave Chinner
  (?)
  (?)
@ 2017-10-30 17:51                 ` Dan Williams
  -1 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-30 17:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Michal Hocko, Jan Kara, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Christoph Hellwig, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Jason Gunthorpe, Doug Ledford, Sean Hefty,
	Hal Rosenstock, linux-fsdevel, Alexander Viro, Gerald Schaefer,
	linux-nvdimm, Linux Kernel Mailing List, linux-xfs,
	Martin Schwidefsky, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30 17:51                 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-30 17:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Michal Hocko, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Sean Hefty, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Christoph Hellwig, Jason Gunthorpe,
	Doug Ledford, Hal Rosenstock, Martin Schwidefsky, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30 17:51                 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-30 17:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Michal Hocko, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Sean Hefty, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Christoph Hellwig, Jason Gunthorpe,
	Doug Ledford, Hal Rosenstock, Martin Schwidefsky, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
@ 2017-10-30 17:51                 ` Dan Williams
  0 siblings, 0 replies; 143+ messages in thread
From: Dan Williams @ 2017-10-30 17:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Michal Hocko, Benjamin Herrenschmidt, Dave Hansen,
	Heiko Carstens, J. Bruce Fields, linux-mm, Paul Mackerras,
	Jeff Layton, Sean Hefty, Matthew Wilcox, linux-rdma,
	Michael Ellerman, Christoph Hellwig, Jason Gunthorpe,
	Doug Ledford, Hal Rosenstock, Martin Schwidefsky, Alexander Viro,
	Gerald Schaefer, linux-nvdimm, Linux Kernel Mailing List,
	linux-xfs, linux-fsdevel, Andrew Morton, Darrick J. Wong,
	Kirill A. Shutemov

On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote:
>> Hi,
>>
>> On Mon 30-10-17 13:00:23, Dave Chinner wrote:
>> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
>> > > Coming back to this since Dave has made clear that new locking to
>> > > coordinate get_user_pages() is a no-go.
>> > >
>> > > We can unmap to force new get_user_pages() attempts to block on the
>> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs
>> > > to drop the mmap lock and wait. We need this lock dropped to get
>> > > around the problem that the driver will not start to drop page
>> > > references until it has elevated the page references on all the pages
>> > > in the I/O. If we need to drop the mmap lock that makes it impossible
>> > > to coordinate this unlock/retry loop within truncate_inode_pages_range
>> > > which would otherwise be the natural place to land this code.
>> > >
>> > > Would it be palatable to unmap and drain dma in any path that needs to
>> > > detach blocks from an inode? Something like the following that builds
>> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock
>> > > for the fs to manage:
>> > >
>> > > retry:
>> > >     per_fs_mmap_lock(inode);
>> > >     unmap_mapping_range(mapping, start, end); /* new page references
>> > > cannot be established */
>> > >     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>> > >         per_fs_mmap_unlock(inode); /* new page references can happen,
>> > > so we need to start over */
>> > >         wait_for_page_idle(dax_page);
>> > >         goto retry;
>> > >     }
>> > >     truncate_inode_pages_range(mapping, start, end);
>> > >     per_fs_mmap_unlock(inode);
>> >
>> > These retry loops you keep proposing are just bloody horrible.  They
>> > are basically just a method for blocking an operation until whatever
>> > condition is preventing the invalidation goes away. IMO, that's an
>> > ugly solution no matter how much lipstick you dress it up with.
>> >
>> > i.e. the blocking loops mean the user process is going to be blocked
>> > for arbitrary lengths of time. That's not a solution, it's just
>> > passing the buck - now the userspace developers need to work around
>> > truncate/hole punch being randomly blocked for arbitrary lengths of
>> > time.
>>
>> So I see substantial difference between how you and Christoph think this
>> should be handled. Christoph writes in [1]:
>>
>> The point is that we need to prohibit long term elevated page counts
>> with DAX anyway - we can't just let people grab allocated blocks forever
>> while ignoring file system operations.  For stage 1 we'll just need to
>> fail those, and in the long run they will have to use a mechanism
>> similar to FL_LAYOUT locks to deal with file system allocation changes.
>>
>> So Christoph wants to block truncate until references are released, forbid
>> long term references until userspace acquiring them supports some kind of
>> lease-breaking. OTOH you suggest truncate should just proceed leaving
>> blocks allocated until references are released.
>
> I don't see what I'm suggesting is a solution to long term elevated
> page counts. Just something that can park extents until layout
> leases are broken and references released. That's a few tens of
> seconds at most.
>
>> We cannot have both... I'm leaning more towards the approach
>> Christoph suggests as it puts the burned to the place which is
>> causing it - the application having long term references - and
>> applications needing this should be sufficiently rare that we
>> don't have to devise a general mechanism in the kernel for this.
>
> I have no problems with blocking truncate forever if that's the
> desired solution for an elevated page count due to a DMA reference
> to a page. But that has absolutely nothing to do with the filesystem
> though - it's a page reference vs mapping invalidation problem, not
> a filesystem/inode problem.
>
> Perhaps pages with active DAX DMA mapping references need a page
> flag to indicate that invalidation must block on the page similar to
> the writeback flag...

We effectively already have this flag since pages where
is_zone_device_page() == true can only have their reference count
elevated by get_user_pages().

More importantly we can not block invalidation on an elevated page
count because that page count may never drop until all references have
been acquired. I.e. iov_iter_get_pages() grabs a range of pages
potentially across multiple vmas and does not drop any references in
the range until all pages have had their count elevated.

>> If the solution Christoph suggests is acceptable to you, I think
>> we should first write a patch to forbid acquiring long term
>> references to DAX blocks.  On top of that we can implement
>> mechanism to block truncate while there are short term references
>> pending (and for that retry loops would be IMHO acceptable).
>
> The problem with retry loops is that they are making a mess of an
> already complex set of locking contraints on the indoe IO path. It's
> rapidly descending into an unmaintainable mess - falling off the
> locking cliff only make sthe code harder to maintain - please look
> for solutions that don't require new locks or lock retry loops.

I was hoping to make the retry loop no worse than the one we already
perform for xfs_break_layouts(), and then the approach can be easily
shared between ext4 and xfs.

However before we get there, we need quite a bit of reworks (require
struct page for dax, use pfns in the dax radix, disable long held page
reference counts for DAX i.e. RDMA / V4L2...). I'll submit those
preparation steps first and then we can circle back to the "how to
wait for DAX-DMA to end" problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 143+ messages in thread

end of thread, other threads:[~2017-10-30 17:51 UTC | newest]

Thread overview: 143+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-20  2:38 [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:38 ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 01/13] dax: quiet bdev_dax_supported() Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 02/13] dax: require 'struct page' for filesystem dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  7:57   ` Christoph Hellwig
2017-10-20  7:57     ` Christoph Hellwig
2017-10-20 15:23     ` Dan Williams
2017-10-20 15:23       ` Dan Williams
2017-10-20 15:23       ` Dan Williams
2017-10-20 16:29       ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 16:29         ` Christoph Hellwig
2017-10-20 22:29         ` Dan Williams
2017-10-20 22:29           ` Dan Williams
2017-10-20 22:29           ` Dan Williams
2017-10-21  3:20           ` Matthew Wilcox
2017-10-21  3:20             ` Matthew Wilcox
2017-10-21  3:20             ` Matthew Wilcox
2017-10-21  4:16             ` Dan Williams
2017-10-21  4:16               ` Dan Williams
2017-10-21  4:16               ` Dan Williams
2017-10-21  8:15               ` Christoph Hellwig
2017-10-21  8:15                 ` Christoph Hellwig
2017-10-21  8:15                 ` Christoph Hellwig
2017-10-23  5:18         ` Martin Schwidefsky
2017-10-23  5:18           ` Martin Schwidefsky
2017-10-23  5:18           ` Martin Schwidefsky
2017-10-23  8:55           ` Dan Williams
2017-10-23  8:55             ` Dan Williams
2017-10-23 10:44             ` Martin Schwidefsky
2017-10-23 10:44               ` Martin Schwidefsky
2017-10-23 10:44               ` Martin Schwidefsky
2017-10-23 11:20               ` Dan Williams
2017-10-23 11:20                 ` Dan Williams
2017-10-23 11:20                 ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 03/13] dax: stop using VM_MIXEDMAP for dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 04/13] dax: stop using VM_HUGEPAGE " Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 05/13] dax: stop requiring a live device for dax_flush() Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 06/13] dax: store pfns in the radix Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 07/13] dax: warn if dma collides with truncate Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 08/13] tools/testing/nvdimm: add 'bio_delay' mechanism Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 09/13] IB/core: disable memory registration of fileystem-dax vmas Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 10/13] mm: disable get_user_pages_fast() for dax Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39 ` [PATCH v3 11/13] fs: use smp_load_acquire in break_{layout,lease} Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20  2:39   ` Dan Williams
2017-10-20 12:39   ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20 12:39     ` Jeffrey Layton
2017-10-20  2:40 ` [PATCH v3 12/13] dax: handle truncate of dma-busy pages Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20 13:05   ` Jeff Layton
2017-10-20 13:05     ` Jeff Layton
2017-10-20 13:05     ` Jeff Layton
2017-10-20 15:42     ` Dan Williams
2017-10-20 15:42       ` Dan Williams
2017-10-20 15:42       ` Dan Williams
2017-10-20 16:32       ` Christoph Hellwig
2017-10-20 16:32         ` Christoph Hellwig
2017-10-20 16:32         ` Christoph Hellwig
2017-10-20 17:27         ` Dan Williams
2017-10-20 17:27           ` Dan Williams
2017-10-20 17:27           ` Dan Williams
2017-10-20 20:36           ` Brian Foster
2017-10-20 20:36             ` Brian Foster
2017-10-20 20:36             ` Brian Foster
2017-10-21  8:11           ` Christoph Hellwig
2017-10-21  8:11             ` Christoph Hellwig
2017-10-20  2:40 ` [PATCH v3 13/13] xfs: wire up FL_ALLOCATED support Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  2:40   ` Dan Williams
2017-10-20  7:47 ` [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  7:47   ` Christoph Hellwig
2017-10-20  9:31   ` Christoph Hellwig
2017-10-20  9:31     ` Christoph Hellwig
2017-10-20  9:31     ` Christoph Hellwig
2017-10-26 10:58     ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 10:58       ` Jan Kara
2017-10-26 23:51       ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-26 23:51         ` Williams, Dan J
2017-10-27  6:48         ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27  6:48           ` Dave Chinner
2017-10-27 11:42           ` Dan Williams
2017-10-27 11:42             ` Dan Williams
2017-10-27 11:42             ` Dan Williams
2017-10-29 21:52             ` Dave Chinner
2017-10-29 21:52               ` Dave Chinner
2017-10-29 21:52               ` Dave Chinner
2017-10-27  6:45       ` Christoph Hellwig
2017-10-27  6:45         ` Christoph Hellwig
2017-10-27  6:45         ` Christoph Hellwig
2017-10-29 23:46       ` Dan Williams
2017-10-29 23:46         ` Dan Williams
2017-10-29 23:46         ` Dan Williams
2017-10-30  2:00         ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  2:00           ` Dave Chinner
2017-10-30  8:38           ` Jan Kara
2017-10-30  8:38             ` Jan Kara
2017-10-30  8:38             ` Jan Kara
2017-10-30 11:20             ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 11:20               ` Dave Chinner
2017-10-30 17:51               ` Dan Williams
2017-10-30 17:51                 ` Dan Williams
2017-10-30 17:51                 ` Dan Williams
2017-10-30 17:51                 ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.