All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush
@ 2017-10-06 22:35 ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Dave Chinner, J. Bruce Fields, linux-mm, Sean Hefty,
	Christoph Hellwig, Marek Szyprowski, Ashok Raj, Darrick J. Wong,
	linux-rdma, Joerg Roedel, Doug Ledford, Linus Torvalds,
	Hal Rosenstock, Arnd Bergmann, Alexander Viro, Andy Lutomirski,
	Jeff Layton, Greg Kroah-Hartman, linux-xfs, linux-api,
	linux-fsdevel, Andrew Morton, David Woodhouse, Robin Murphy

Changes since v6 [1]:
* Abandon the concept of immutable files and rework the implementation
  to reuse same FL_LAYOUT file lease mechanism that coordinates pnfsd
  layouts vs local filesystem changes. This establishes an interface where
  the kernel is always in control of the block-map and is free to
  invalidate MAP_DIRECT mappings when a lease breaker arrives. (Christoph)

* Introduce a new ->mmap_validate() file operation since we need both
  the original @flags and @fd passed to mmap(2) to setup a MAP_DIRECT
  mapping.

* Introduce a ->lease_direct() vm operation to allow the RDMA core to
  safely register memory against DAX and tear down the mapping when the
  lease is broken. This can be reused by any sub-system that follows a
  memory registration semantic.

[1]: https://lkml.org/lkml/2017/8/23/754

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (12):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: introduce i_mapdcount
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      xfs: wire up MAP_DIRECT
      dma-mapping: introduce dma_has_iommu()
      fs, mapdirect: introduce ->lease_direct()
      xfs: wire up ->lease_direct()
      device-dax: wire up ->lease_direct()
      IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
      tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 drivers/base/dma-mapping.c                   |   10 +
 drivers/dax/device.c                         |    4 
 drivers/infiniband/core/umem.c               |   90 ++++++-
 drivers/iommu/amd_iommu.c                    |    6 
 drivers/iommu/intel-iommu.c                  |    6 
 fs/Kconfig                                   |    4 
 fs/Makefile                                  |    1 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  349 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  130 ++++++++++
 fs/xfs/xfs_iomap.c                           |    9 +
 fs/xfs/xfs_layout.c                          |   42 +++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   30 --
 fs/xfs/xfs_pnfs.h                            |   10 -
 include/linux/dma-mapping.h                  |    3 
 include/linux/fs.h                           |   33 ++
 include/linux/mapdirect.h                    |   68 +++++
 include/linux/mm.h                           |   15 +
 include/linux/mman.h                         |   42 +++
 include/rdma/ib_umem.h                       |    8 +
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++
 mm/nommu.c                                   |    5 
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 tools/testing/nvdimm/Kbuild                  |   31 ++
 tools/testing/nvdimm/config_check.c          |    2 
 tools/testing/nvdimm/test/iomap.c            |    6 
 41 files changed, 906 insertions(+), 73 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush
@ 2017-10-06 22:35 ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Dave Chinner, J. Bruce Fields,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Sean Hefty, Jeff Layton,
	Marek Szyprowski, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Joerg Roedel, Doug Ledford,
	Christoph Hellwig, Linus Torvalds, Jeff Moyer, Ross Zwisler,
	Hal Rosenstock, Arnd Bergmann, Robin Murphy, Alexander Viro,
	Andy Lutomirski

Changes since v6 [1]:
* Abandon the concept of immutable files and rework the implementation
  to reuse same FL_LAYOUT file lease mechanism that coordinates pnfsd
  layouts vs local filesystem changes. This establishes an interface where
  the kernel is always in control of the block-map and is free to
  invalidate MAP_DIRECT mappings when a lease breaker arrives. (Christoph)

* Introduce a new ->mmap_validate() file operation since we need both
  the original @flags and @fd passed to mmap(2) to setup a MAP_DIRECT
  mapping.

* Introduce a ->lease_direct() vm operation to allow the RDMA core to
  safely register memory against DAX and tear down the mapping when the
  lease is broken. This can be reused by any sub-system that follows a
  memory registration semantic.

[1]: https://lkml.org/lkml/2017/8/23/754

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (12):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: introduce i_mapdcount
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      xfs: wire up MAP_DIRECT
      dma-mapping: introduce dma_has_iommu()
      fs, mapdirect: introduce ->lease_direct()
      xfs: wire up ->lease_direct()
      device-dax: wire up ->lease_direct()
      IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
      tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 drivers/base/dma-mapping.c                   |   10 +
 drivers/dax/device.c                         |    4 
 drivers/infiniband/core/umem.c               |   90 ++++++-
 drivers/iommu/amd_iommu.c                    |    6 
 drivers/iommu/intel-iommu.c                  |    6 
 fs/Kconfig                                   |    4 
 fs/Makefile                                  |    1 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  349 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  130 ++++++++++
 fs/xfs/xfs_iomap.c                           |    9 +
 fs/xfs/xfs_layout.c                          |   42 +++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   30 --
 fs/xfs/xfs_pnfs.h                            |   10 -
 include/linux/dma-mapping.h                  |    3 
 include/linux/fs.h                           |   33 ++
 include/linux/mapdirect.h                    |   68 +++++
 include/linux/mm.h                           |   15 +
 include/linux/mman.h                         |   42 +++
 include/rdma/ib_umem.h                       |    8 +
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++
 mm/nommu.c                                   |    5 
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 tools/testing/nvdimm/Kbuild                  |   31 ++
 tools/testing/nvdimm/config_check.c          |    2 
 tools/testing/nvdimm/test/iomap.c            |    6 
 41 files changed, 906 insertions(+), 73 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush
@ 2017-10-06 22:35 ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Dave Chinner, J. Bruce Fields, linux-mm, Sean Hefty,
	Jeff Layton, Marek Szyprowski, Ashok Raj, Darrick J. Wong,
	linux-rdma, Joerg Roedel, Doug Ledford, Christoph Hellwig,
	Linus Torvalds, Jeff Moyer, Ross Zwisler, Hal Rosenstock,
	Arnd Bergmann, Robin Murphy, Alexander Viro, Andy Lutomirski,
	Greg Kroah-Hartman, linux-xfs, linux-api, linux-fsdevel,
	Andrew Morton, David Woodhouse

Changes since v6 [1]:
* Abandon the concept of immutable files and rework the implementation
  to reuse same FL_LAYOUT file lease mechanism that coordinates pnfsd
  layouts vs local filesystem changes. This establishes an interface where
  the kernel is always in control of the block-map and is free to
  invalidate MAP_DIRECT mappings when a lease breaker arrives. (Christoph)

* Introduce a new ->mmap_validate() file operation since we need both
  the original @flags and @fd passed to mmap(2) to setup a MAP_DIRECT
  mapping.

* Introduce a ->lease_direct() vm operation to allow the RDMA core to
  safely register memory against DAX and tear down the mapping when the
  lease is broken. This can be reused by any sub-system that follows a
  memory registration semantic.

[1]: https://lkml.org/lkml/2017/8/23/754

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (12):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: introduce i_mapdcount
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      xfs: wire up MAP_DIRECT
      dma-mapping: introduce dma_has_iommu()
      fs, mapdirect: introduce ->lease_direct()
      xfs: wire up ->lease_direct()
      device-dax: wire up ->lease_direct()
      IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
      tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 drivers/base/dma-mapping.c                   |   10 +
 drivers/dax/device.c                         |    4 
 drivers/infiniband/core/umem.c               |   90 ++++++-
 drivers/iommu/amd_iommu.c                    |    6 
 drivers/iommu/intel-iommu.c                  |    6 
 fs/Kconfig                                   |    4 
 fs/Makefile                                  |    1 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  349 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  130 ++++++++++
 fs/xfs/xfs_iomap.c                           |    9 +
 fs/xfs/xfs_layout.c                          |   42 +++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   30 --
 fs/xfs/xfs_pnfs.h                            |   10 -
 include/linux/dma-mapping.h                  |    3 
 include/linux/fs.h                           |   33 ++
 include/linux/mapdirect.h                    |   68 +++++
 include/linux/mm.h                           |   15 +
 include/linux/mman.h                         |   42 +++
 include/rdma/ib_umem.h                       |    8 +
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++
 mm/nommu.c                                   |    5 
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 tools/testing/nvdimm/Kbuild                  |   31 ++
 tools/testing/nvdimm/config_check.c          |    2 
 tools/testing/nvdimm/test/iomap.c            |    6 
 41 files changed, 906 insertions(+), 73 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush
@ 2017-10-06 22:35 ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Dave Chinner, J. Bruce Fields, linux-mm, Sean Hefty,
	Jeff Layton, Marek Szyprowski, Ashok Raj, Darrick J. Wong,
	linux-rdma, Joerg Roedel, Doug Ledford, Christoph Hellwig,
	Linus Torvalds, Jeff Moyer, Ross Zwisler, Hal Rosenstock,
	Arnd Bergmann, Robin Murphy, Alexander Viro, Andy Lutomirski,
	Greg Kroah-Hartman, linux-xfs, linux-api, linux-fsdevel,
	Andrew Morton, David Woodhouse

Changes since v6 [1]:
* Abandon the concept of immutable files and rework the implementation
  to reuse same FL_LAYOUT file lease mechanism that coordinates pnfsd
  layouts vs local filesystem changes. This establishes an interface where
  the kernel is always in control of the block-map and is free to
  invalidate MAP_DIRECT mappings when a lease breaker arrives. (Christoph)

* Introduce a new ->mmap_validate() file operation since we need both
  the original @flags and @fd passed to mmap(2) to setup a MAP_DIRECT
  mapping.

* Introduce a ->lease_direct() vm operation to allow the RDMA core to
  safely register memory against DAX and tear down the mapping when the
  lease is broken. This can be reused by any sub-system that follows a
  memory registration semantic.

[1]: https://lkml.org/lkml/2017/8/23/754

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

---

Dan Williams (12):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: introduce i_mapdcount
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      xfs: wire up MAP_DIRECT
      dma-mapping: introduce dma_has_iommu()
      fs, mapdirect: introduce ->lease_direct()
      xfs: wire up ->lease_direct()
      device-dax: wire up ->lease_direct()
      IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
      tools/testing/nvdimm: enable rdma unit tests


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 drivers/base/dma-mapping.c                   |   10 +
 drivers/dax/device.c                         |    4 
 drivers/infiniband/core/umem.c               |   90 ++++++-
 drivers/iommu/amd_iommu.c                    |    6 
 drivers/iommu/intel-iommu.c                  |    6 
 fs/Kconfig                                   |    4 
 fs/Makefile                                  |    1 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  349 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  130 ++++++++++
 fs/xfs/xfs_iomap.c                           |    9 +
 fs/xfs/xfs_layout.c                          |   42 +++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   30 --
 fs/xfs/xfs_pnfs.h                            |   10 -
 include/linux/dma-mapping.h                  |    3 
 include/linux/fs.h                           |   33 ++
 include/linux/mapdirect.h                    |   68 +++++
 include/linux/mm.h                           |   15 +
 include/linux/mman.h                         |   42 +++
 include/rdma/ib_umem.h                       |    8 +
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++
 mm/nommu.c                                   |    5 
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 tools/testing/nvdimm/Kbuild                  |   31 ++
 tools/testing/nvdimm/config_check.c          |    2 
 tools/testing/nvdimm/test/iomap.c            |    6 
 41 files changed, 906 insertions(+), 73 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Arnd Bergmann, linux-rdma, linux-api, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with OSF/1 defines */
 #define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
 #define MAP_PRIVATE	0x002		/* Changes are private */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME	0x020		/* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b688d52de3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 #define MAP_DENYWRITE	0x0800		/* ETXTBSY */
 #define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..ec597900eec7 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -37,6 +37,7 @@
 #define MAP_PRIVATE	0x002		/* Changes are private */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME	0x020		/* Assign page to file */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e73742e73..51538958f7f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..5c4c98e4adc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..ac55d1c0ec0f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /*
  * Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..a1bcaa9eff42 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case (MAP_SHARED_VALIDATE):
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..202bc4277fb5 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /*
  * Flags for mlock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Arnd Bergmann, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Suggested-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Suggested-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with OSF/1 defines */
 #define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
 #define MAP_PRIVATE	0x002		/* Changes are private */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME	0x020		/* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b688d52de3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 #define MAP_DENYWRITE	0x0800		/* ETXTBSY */
 #define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..ec597900eec7 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -37,6 +37,7 @@
 #define MAP_PRIVATE	0x002		/* Changes are private */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME	0x020		/* Assign page to file */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e73742e73..51538958f7f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..5c4c98e4adc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..ac55d1c0ec0f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /*
  * Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..a1bcaa9eff42 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case (MAP_SHARED_VALIDATE):
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..202bc4277fb5 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /*
  * Flags for mlock

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Arnd Bergmann, linux-rdma, linux-api, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with OSF/1 defines */
 #define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
 #define MAP_PRIVATE	0x002		/* Changes are private */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME	0x020		/* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b688d52de3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 #define MAP_DENYWRITE	0x0800		/* ETXTBSY */
 #define MAP_EXECUTABLE	0x1000		/* mark it as an executable */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..ec597900eec7 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -37,6 +37,7 @@
 #define MAP_PRIVATE	0x002		/* Changes are private */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /* not used by linux, but here to make sure we don't clash with ABI defines */
 #define MAP_RENAME	0x020		/* Assign page to file */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e73742e73..51538958f7f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..5c4c98e4adc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..ac55d1c0ec0f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /*
  * Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..a1bcaa9eff42 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case (MAP_SHARED_VALIDATE):
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 8c27db0c5c08..202bc4277fb5 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
 #else
 # define MAP_UNINITIALIZED 0x0		/* Don't support this flag */
 #endif
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 
 /*
  * Flags for mlock


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner,
	linux-xfs, linux-mm, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 1e2b1692ba2c..585e05eef40a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index a1bcaa9eff42..c2cb6334a7a9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Christoph Hellwig

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 1e2b1692ba2c..585e05eef40a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index a1bcaa9eff42..c2cb6334a7a9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner,
	Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 1e2b1692ba2c..585e05eef40a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index a1bcaa9eff42..c2cb6334a7a9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner,
	Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51538958f7f5..c2b9bf3dc4e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5c4c98e4adc9..0afa19feb755 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 1e2b1692ba2c..585e05eef40a 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index a1bcaa9eff42..c2cb6334a7a9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);
 


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 03/12] fs: introduce i_mapdcount
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:35   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, Christoph Hellwig

When ->iomap_begin() sees this count being non-zero and determines that
the block map of the file needs to be modified to satisfy the I/O
request it will instead return an error. This is needed for MAP_DIRECT
where, due to locking constraints, we can't rely on xfs_break_layouts()
to protect against allocating write-faults either from the process that
setup the MAP_DIRECT mapping nor other processes that have the file
mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
mix with the XFS_MMAPLOCK in the fault path.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_iomap.c |    9 +++++++++
 include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a1909bc064e9..6816f8ebbdcf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
 			goto out_unlock;
 		}
 		/*
+		 * If a file has MAP_DIRECT mappings disable block map
+		 * updates. This should only effect mmap write faults as
+		 * other paths are protected by an FL_LAYOUT lease.
+		 */
+		if (i_mapdcount_read(inode)) {
+			error = -ETXTBSY;
+			goto out_unlock;
+		}
+		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
 		 * with the work writeback does. This is a completely arbitrary
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2b9bf3dc4e9..f83871b188ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,6 +642,9 @@ struct inode {
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
+#ifdef CONFIG_FS_DAX
+	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
+#endif
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
@@ -2784,6 +2787,34 @@ static inline void i_readcount_inc(struct inode *inode)
 	return;
 }
 #endif
+
+#ifdef CONFIG_FS_DAX
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+	BUG_ON(!atomic_read(&inode->i_mapdcount));
+	atomic_dec(&inode->i_mapdcount);
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+	atomic_inc(&inode->i_mapdcount);
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+	return atomic_read(&inode->i_mapdcount);
+}
+#else
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+	return 0;
+}
+#endif
+
 extern int do_pipe_flags(int *, int);
 
 #define __kernel_read_file_id(id) \

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 03/12] fs: introduce i_mapdcount
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

When ->iomap_begin() sees this count being non-zero and determines that
the block map of the file needs to be modified to satisfy the I/O
request it will instead return an error. This is needed for MAP_DIRECT
where, due to locking constraints, we can't rely on xfs_break_layouts()
to protect against allocating write-faults either from the process that
setup the MAP_DIRECT mapping nor other processes that have the file
mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
mix with the XFS_MMAPLOCK in the fault path.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_iomap.c |    9 +++++++++
 include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a1909bc064e9..6816f8ebbdcf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
 			goto out_unlock;
 		}
 		/*
+		 * If a file has MAP_DIRECT mappings disable block map
+		 * updates. This should only effect mmap write faults as
+		 * other paths are protected by an FL_LAYOUT lease.
+		 */
+		if (i_mapdcount_read(inode)) {
+			error = -ETXTBSY;
+			goto out_unlock;
+		}
+		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
 		 * with the work writeback does. This is a completely arbitrary
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2b9bf3dc4e9..f83871b188ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,6 +642,9 @@ struct inode {
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
+#ifdef CONFIG_FS_DAX
+	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
+#endif
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
@@ -2784,6 +2787,34 @@ static inline void i_readcount_inc(struct inode *inode)
 	return;
 }
 #endif
+
+#ifdef CONFIG_FS_DAX
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+	BUG_ON(!atomic_read(&inode->i_mapdcount));
+	atomic_dec(&inode->i_mapdcount);
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+	atomic_inc(&inode->i_mapdcount);
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+	return atomic_read(&inode->i_mapdcount);
+}
+#else
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+	return 0;
+}
+#endif
+
 extern int do_pipe_flags(int *, int);
 
 #define __kernel_read_file_id(id) \

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 03/12] fs: introduce i_mapdcount
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

When ->iomap_begin() sees this count being non-zero and determines that
the block map of the file needs to be modified to satisfy the I/O
request it will instead return an error. This is needed for MAP_DIRECT
where, due to locking constraints, we can't rely on xfs_break_layouts()
to protect against allocating write-faults either from the process that
setup the MAP_DIRECT mapping nor other processes that have the file
mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
mix with the XFS_MMAPLOCK in the fault path.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_iomap.c |    9 +++++++++
 include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index a1909bc064e9..6816f8ebbdcf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
 			goto out_unlock;
 		}
 		/*
+		 * If a file has MAP_DIRECT mappings disable block map
+		 * updates. This should only effect mmap write faults as
+		 * other paths are protected by an FL_LAYOUT lease.
+		 */
+		if (i_mapdcount_read(inode)) {
+			error = -ETXTBSY;
+			goto out_unlock;
+		}
+		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
 		 * with the work writeback does. This is a completely arbitrary
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2b9bf3dc4e9..f83871b188ff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,6 +642,9 @@ struct inode {
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
+#ifdef CONFIG_FS_DAX
+	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
+#endif
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
@@ -2784,6 +2787,34 @@ static inline void i_readcount_inc(struct inode *inode)
 	return;
 }
 #endif
+
+#ifdef CONFIG_FS_DAX
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+	BUG_ON(!atomic_read(&inode->i_mapdcount));
+	atomic_dec(&inode->i_mapdcount);
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+	atomic_inc(&inode->i_mapdcount);
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+	return atomic_read(&inode->i_mapdcount);
+}
+#else
+static inline void i_mapdcount_dec(struct inode *inode)
+{
+}
+static inline void i_mapdcount_inc(struct inode *inode)
+{
+}
+static inline int i_mapdcount_read(struct inode *inode)
+{
+	return 0;
+}
+#endif
+
 extern int do_pipe_flags(int *, int);
 
 #define __kernel_read_file_id(id) \


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 04/12] fs: MAP_DIRECT core
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, Christoph Hellwig

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Makefile               |    2 
 fs/mapdirect.c            |  232 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   45 +++++++++
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9ac7c1d946a2
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return 0;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+	return 1;
+}
+EXPORT_SYMBOL_GPL(put_map_direct_vma);
+
+void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+EXPORT_SYMBOL_GPL(get_map_direct_vma);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	WARN_ON(!(arg & F_UNLCK));
+
+	i_mapdcount_dec(mds->mds_inode);
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	i_mapdcount_inc(inode);
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..724e27d8615e
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+int put_map_direct_vma(struct map_direct_state *mds);
+void get_map_direct_vma(struct map_direct_state *mds);
+bool is_map_direct_valid(struct map_direct_state *mds);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	return 0;
+}
+static inline void get_map_direct_vma(struct map_direct_state *mds)
+{
+}
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#endif
+#endif /* __MAPDIRECT_H__ */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 04/12] fs: MAP_DIRECT core
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	Christoph Hellwig, J. Bruce Fields,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Ross Zwisler

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/Makefile               |    2 
 fs/mapdirect.c            |  232 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   45 +++++++++
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9ac7c1d946a2
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return 0;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+	return 1;
+}
+EXPORT_SYMBOL_GPL(put_map_direct_vma);
+
+void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+EXPORT_SYMBOL_GPL(get_map_direct_vma);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	WARN_ON(!(arg & F_UNLCK));
+
+	i_mapdcount_dec(mds->mds_inode);
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	i_mapdcount_inc(inode);
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..724e27d8615e
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+int put_map_direct_vma(struct map_direct_state *mds);
+void get_map_direct_vma(struct map_direct_state *mds);
+bool is_map_direct_valid(struct map_direct_state *mds);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	return 0;
+}
+static inline void get_map_direct_vma(struct map_direct_state *mds)
+{
+}
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#endif
+#endif /* __MAPDIRECT_H__ */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 04/12] fs: MAP_DIRECT core
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Makefile               |    2 
 fs/mapdirect.c            |  232 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   45 +++++++++
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9ac7c1d946a2
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return 0;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+	return 1;
+}
+EXPORT_SYMBOL_GPL(put_map_direct_vma);
+
+void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+EXPORT_SYMBOL_GPL(get_map_direct_vma);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	WARN_ON(!(arg & F_UNLCK));
+
+	i_mapdcount_dec(mds->mds_inode);
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	i_mapdcount_inc(inode);
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..724e27d8615e
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+int put_map_direct_vma(struct map_direct_state *mds);
+void get_map_direct_vma(struct map_direct_state *mds);
+bool is_map_direct_valid(struct map_direct_state *mds);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	return 0;
+}
+static inline void get_map_direct_vma(struct map_direct_state *mds)
+{
+}
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#endif
+#endif /* __MAPDIRECT_H__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 04/12] fs: MAP_DIRECT core
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Makefile               |    2 
 fs/mapdirect.c            |  232 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   45 +++++++++
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9ac7c1d946a2
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,232 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return 0;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+	return 1;
+}
+EXPORT_SYMBOL_GPL(put_map_direct_vma);
+
+void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+EXPORT_SYMBOL_GPL(get_map_direct_vma);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	WARN_ON(!(arg & F_UNLCK));
+
+	i_mapdcount_dec(mds->mds_inode);
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	i_mapdcount_inc(inode);
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..724e27d8615e
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+int put_map_direct_vma(struct map_direct_state *mds);
+void get_map_direct_vma(struct map_direct_state *mds);
+bool is_map_direct_valid(struct map_direct_state *mds);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+int put_map_direct_vma(struct map_direct_state *mds)
+{
+	return 0;
+}
+static inline void get_map_direct_vma(struct map_direct_state *mds)
+{
+}
+bool is_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#endif
+#endif /* __MAPDIRECT_H__ */


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:35   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner,
	linux-xfs, linux-mm, linux-fsdevel, Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   30 ------------------------------
 fs/xfs/xfs_pnfs.h   |   10 ++--------
 6 files changed, 62 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 2f2dc3c09ad0..8ec72220e73b 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -20,36 +20,6 @@
 #include "xfs_pnfs.h"
 
 /*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
-
-/*
  * Get a unique ID including its location so that the client can identify
  * the exported device.
  */
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..4135b2482697 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -1,19 +1,13 @@
 #ifndef _XFS_PNFS_H
 #define _XFS_PNFS_H 1
 
+#include "xfs_layout.h"
+
 #ifdef CONFIG_EXPORTFS_BLOCK_OPS
 int xfs_fs_get_uuid(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
 int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner,
	linux-xfs, linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   30 ------------------------------
 fs/xfs/xfs_pnfs.h   |   10 ++--------
 6 files changed, 62 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 2f2dc3c09ad0..8ec72220e73b 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -20,36 +20,6 @@
 #include "xfs_pnfs.h"
 
 /*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
-
-/*
  * Get a unique ID including its location so that the client can identify
  * the exported device.
  */
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..4135b2482697 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -1,19 +1,13 @@
 #ifndef _XFS_PNFS_H
 #define _XFS_PNFS_H 1
 
+#include "xfs_layout.h"
+
 #ifdef CONFIG_EXPORTFS_BLOCK_OPS
 int xfs_fs_get_uuid(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
 int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-rdma, linux-api, Dave Chinner,
	linux-xfs, linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   30 ------------------------------
 fs/xfs/xfs_pnfs.h   |   10 ++--------
 6 files changed, 62 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 2f2dc3c09ad0..8ec72220e73b 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -20,36 +20,6 @@
 #include "xfs_pnfs.h"
 
 /*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
-
-/*
  * Get a unique ID including its location so that the client can identify
  * the exported device.
  */
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..4135b2482697 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -1,19 +1,13 @@
 #ifndef _XFS_PNFS_H
 #define _XFS_PNFS_H 1
 
+#include "xfs_layout.h"
+
 #ifdef CONFIG_EXPORTFS_BLOCK_OPS
 int xfs_fs_get_uuid(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
 int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 06/12] xfs: wire up MAP_DIRECT
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:35   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Arnd Bergmann, Darrick J. Wong,
	linux-rdma, linux-api, Dave Chinner, linux-xfs, linux-mm,
	Alexander Viro, linux-fsdevel, Jeff Layton, Christoph Hellwig

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The file is not DAX capable
      * The file has reflinked (copy-on-write) blocks
      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..e35518600e28 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,12 +40,22 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static inline bool
+is_xfs_map_direct(
+		struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1008,6 +1018,26 @@ xfs_file_llseek(
 	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+static int
+xfs_vma_checks(
+	struct vm_area_struct	*vma,
+	struct inode		*inode)
+{
+	if (!is_xfs_map_direct(vma))
+		return 0;
+
+	if (!is_map_direct_valid(vma->vm_private_data))
+		return VM_FAULT_SIGBUS;
+
+	if (xfs_is_reflink_inode(XFS_I(inode)))
+		return VM_FAULT_SIGBUS;
+
+	if (!IS_DAX(inode))
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
+	struct vm_area_struct	*vma = vmf->vma;
 	struct inode		*inode = file_inode(vmf->vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
@@ -1032,10 +1063,14 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	ret = xfs_vma_checks(vma, inode);
+	if (ret)
+		goto out_unlock;
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1044,6 +1079,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1115,6 +1152,31 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static void
+xfs_filemap_direct_open(
+	struct vm_area_struct	*vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+
+static void
+xfs_filemap_direct_close(
+	struct vm_area_struct	*vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= xfs_filemap_direct_open,
+	.close		= xfs_filemap_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1135,6 +1197,43 @@ xfs_file_mmap(
 	return 0;
 }
 
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+STATIC int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	if (IS_DAX(inode))
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	/* flush in-flight faults */
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+
+	vma->vm_private_data = mds;
+
+	return 0;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1146,6 +1245,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, Christoph Hellwig, J. Bruce Fields,
	linux-mm, Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton,
	Ross Zwisler

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The file is not DAX capable
      * The file has reflinked (copy-on-write) blocks
      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..e35518600e28 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,12 +40,22 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static inline bool
+is_xfs_map_direct(
+		struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1008,6 +1018,26 @@ xfs_file_llseek(
 	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+static int
+xfs_vma_checks(
+	struct vm_area_struct	*vma,
+	struct inode		*inode)
+{
+	if (!is_xfs_map_direct(vma))
+		return 0;
+
+	if (!is_map_direct_valid(vma->vm_private_data))
+		return VM_FAULT_SIGBUS;
+
+	if (xfs_is_reflink_inode(XFS_I(inode)))
+		return VM_FAULT_SIGBUS;
+
+	if (!IS_DAX(inode))
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
+	struct vm_area_struct	*vma = vmf->vma;
 	struct inode		*inode = file_inode(vmf->vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
@@ -1032,10 +1063,14 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	ret = xfs_vma_checks(vma, inode);
+	if (ret)
+		goto out_unlock;
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1044,6 +1079,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1115,6 +1152,31 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static void
+xfs_filemap_direct_open(
+	struct vm_area_struct	*vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+
+static void
+xfs_filemap_direct_close(
+	struct vm_area_struct	*vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= xfs_filemap_direct_open,
+	.close		= xfs_filemap_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1135,6 +1197,43 @@ xfs_file_mmap(
 	return 0;
 }
 
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+STATIC int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	if (IS_DAX(inode))
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	/* flush in-flight faults */
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+
+	vma->vm_private_data = mds;
+
+	return 0;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1146,6 +1245,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, Christoph Hellwig, J. Bruce Fields,
	linux-mm, Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton,
	Ross Zwisler

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The file is not DAX capable
      * The file has reflinked (copy-on-write) blocks
      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd2b261..e35518600e28 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,12 +40,22 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static inline bool
+is_xfs_map_direct(
+		struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1008,6 +1018,26 @@ xfs_file_llseek(
 	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+static int
+xfs_vma_checks(
+	struct vm_area_struct	*vma,
+	struct inode		*inode)
+{
+	if (!is_xfs_map_direct(vma))
+		return 0;
+
+	if (!is_map_direct_valid(vma->vm_private_data))
+		return VM_FAULT_SIGBUS;
+
+	if (xfs_is_reflink_inode(XFS_I(inode)))
+		return VM_FAULT_SIGBUS;
+
+	if (!IS_DAX(inode))
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
+	struct vm_area_struct	*vma = vmf->vma;
 	struct inode		*inode = file_inode(vmf->vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
@@ -1032,10 +1063,14 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	ret = xfs_vma_checks(vma, inode);
+	if (ret)
+		goto out_unlock;
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1044,6 +1079,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1115,6 +1152,31 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static void
+xfs_filemap_direct_open(
+	struct vm_area_struct	*vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+
+static void
+xfs_filemap_direct_close(
+	struct vm_area_struct	*vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= xfs_filemap_direct_open,
+	.close		= xfs_filemap_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1135,6 +1197,43 @@ xfs_file_mmap(
 	return 0;
 }
 
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+STATIC int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	if (IS_DAX(inode))
+		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	/* flush in-flight faults */
+	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+
+	vma->vm_private_data = mds;
+
+	return 0;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1146,6 +1245,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:35   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, Robin Murphy,
	linux-xfs, linux-mm, linux-api, linux-fsdevel, David Woodhouse,
	Christoph Hellwig, Marek Szyprowski

Add a helper to determine if the dma mappings set up for a given device
are backed by an iommu. In particular, this lets code paths know that a
dma_unmap operation will revoke access to memory if the device can not
otherwise be quiesced. The need for this knowledge is driven by a need
to make RDMA transfers to DAX mappings safe. If the DAX file's block map
changes we need to be to reliably stop accesses to blocks that have been
freed or re-assigned to a new file.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |    6 ++++++
 drivers/iommu/intel-iommu.c |    6 ++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 25 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..e1b5f103d90e 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+bool dma_has_iommu(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->has_iommu)
+		return ops->has_iommu(dev);
+	return false;
+}
+EXPORT_SYMBOL(dma_has_iommu);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..873f899fcf57 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,11 @@ static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static bool amd_dma_has_iommu(struct device *dev)
+{
+	return !IS_ERR(get_domain(dev));
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2694,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.has_iommu	= amd_dma_has_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..243ef42fdad4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,11 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static bool intel_dma_has_iommu(struct device *dev)
+{
+	return !iommu_no_mapping(dev);
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3877,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.has_iommu = intel_dma_has_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..659f122c18f5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	bool (*has_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern bool dma_has_iommu(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

Add a helper to determine if the dma mappings set up for a given device
are backed by an iommu. In particular, this lets code paths know that a
dma_unmap operation will revoke access to memory if the device can not
otherwise be quiesced. The need for this knowledge is driven by a need
to make RDMA transfers to DAX mappings safe. If the DAX file's block map
changes we need to be to reliably stop accesses to blocks that have been
freed or re-assigned to a new file.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |    6 ++++++
 drivers/iommu/intel-iommu.c |    6 ++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 25 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..e1b5f103d90e 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+bool dma_has_iommu(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->has_iommu)
+		return ops->has_iommu(dev);
+	return false;
+}
+EXPORT_SYMBOL(dma_has_iommu);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..873f899fcf57 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,11 @@ static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static bool amd_dma_has_iommu(struct device *dev)
+{
+	return !IS_ERR(get_domain(dev));
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2694,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.has_iommu	= amd_dma_has_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..243ef42fdad4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,11 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static bool intel_dma_has_iommu(struct device *dev)
+{
+	return !iommu_no_mapping(dev);
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3877,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.has_iommu = intel_dma_has_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..659f122c18f5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	bool (*has_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern bool dma_has_iommu(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:35   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:35 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

Add a helper to determine if the dma mappings set up for a given device
are backed by an iommu. In particular, this lets code paths know that a
dma_unmap operation will revoke access to memory if the device can not
otherwise be quiesced. The need for this knowledge is driven by a need
to make RDMA transfers to DAX mappings safe. If the DAX file's block map
changes we need to be to reliably stop accesses to blocks that have been
freed or re-assigned to a new file.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |    6 ++++++
 drivers/iommu/intel-iommu.c |    6 ++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 25 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..e1b5f103d90e 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+bool dma_has_iommu(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->has_iommu)
+		return ops->has_iommu(dev);
+	return false;
+}
+EXPORT_SYMBOL(dma_has_iommu);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..873f899fcf57 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,11 @@ static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static bool amd_dma_has_iommu(struct device *dev)
+{
+	return !IS_ERR(get_domain(dev));
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2694,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.has_iommu	= amd_dma_has_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..243ef42fdad4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,11 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static bool intel_dma_has_iommu(struct device *dev)
+{
+	return !iommu_no_mapping(dev);
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3877,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.has_iommu = intel_dma_has_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..659f122c18f5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	bool (*has_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern bool dma_has_iommu(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, Christoph Hellwig

Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/mapdirect.c            |  117 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   23 +++++++++
 include/linux/mm.h        |    6 ++
 3 files changed, 146 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9ac7c1d946a2..338cbe055fc7 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 
@@ -32,12 +33,26 @@ struct map_direct_state {
 	struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+	void *lds_owner;
+	struct file *lds_file;
+	unsigned long lds_state;
+	void (*lds_break_fn)(void *lds_owner);
+	struct work_struct lds_work;
+};
+
 bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
 }
 EXPORT_SYMBOL_GPL(is_map_direct_valid);
 
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_BREAK, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_broken);
+
 static void put_map_direct(struct map_direct_state *mds)
 {
 	if (!atomic_dec_and_test(&mds->mds_ref))
@@ -162,6 +177,108 @@ static const struct lock_manager_operations map_direct_lm_ops = {
 	.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+	struct lease_direct_state *lds;
+	void *owner;
+
+	lds = container_of(work, typeof(*lds), lds_work);
+	owner = lds;
+	lds->lds_break_fn(lds->lds_owner);
+	vfs_setlease(lds->lds_file, F_UNLCK, NULL, &owner);
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+	struct lease_direct_state *lds = fl->fl_owner;
+
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &lds->lds_state))
+		schedule_work(&lds->lds_work);
+	return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+	return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+	.lm_break = lease_direct_lm_break,
+	.lm_change = lease_direct_lm_change,
+};
+
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*lds_break_fn)(void *), void *lds_owner)
+{
+	struct file *file = vma->vm_file;
+	struct lease_direct_state *lds;
+	struct lease_direct *ld;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+	void *owner;
+
+	ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+	if (!ld)
+		return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&ld->list);
+	lds = (struct lease_direct_state *)(ld + 1);
+	owner = lds;
+	ld->lds = lds;
+	lds->lds_break_fn = lds_break_fn;
+	lds->lds_owner = lds_owner;
+	INIT_WORK(&lds->lds_work, lease_direct_invalidate);
+	lds->lds_file = get_file(file);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &lease_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = lds;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = lds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return ld;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(lds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease);
+
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+	struct lease_direct_state *lds = ld->lds;
+	struct file *file = lds->lds_file;
+	void *owner = lds;
+
+	vfs_setlease(file, F_UNLCK, NULL, &owner);
+	flush_work(&lds->lds_work);
+	fput(file);
+	WARN_ON(!list_empty(&ld->list));
+	kfree(ld);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease_destroy);
+
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
 {
 	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index 724e27d8615e..dc4d4ba677d0 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -13,17 +13,28 @@
 #ifndef __MAPDIRECT_H__
 #define __MAPDIRECT_H__
 #include <linux/err.h>
+#include <linux/list.h>
 
 struct inode;
 struct work_struct;
 struct vm_area_struct;
 struct map_direct_state;
+struct list_direct_state;
+
+struct lease_direct {
+	struct list_head list;
+	struct lease_direct_state *lds;
+};
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);
 bool is_map_direct_valid(struct map_direct_state *mds);
+bool is_map_direct_broken(struct map_direct_state *mds);
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner);
+void map_direct_lease_destroy(struct lease_direct *ld);
 #else
 static inline struct map_direct_state *map_direct_register(int fd,
 		struct vm_area_struct *vma)
@@ -41,5 +52,17 @@ bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return false;
 }
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return false;
+}
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+}
 #endif
 #endif /* __MAPDIRECT_H__ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0afa19feb755..d03953f91ce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -420,6 +420,12 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/*
+	 * Called by rdma memory registration to subscribe for "break"
+	 * events that require any ongoing rdma accesses to quiesce.
+	 */
+	struct lease_direct *(*lease_direct)(struct vm_area_struct *vma,
+			void (*break_fn)(void *), void *owner);
 };
 
 struct mmu_gather;

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	Christoph Hellwig, J. Bruce Fields,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Ross Zwisler

Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/mapdirect.c            |  117 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   23 +++++++++
 include/linux/mm.h        |    6 ++
 3 files changed, 146 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9ac7c1d946a2..338cbe055fc7 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 
@@ -32,12 +33,26 @@ struct map_direct_state {
 	struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+	void *lds_owner;
+	struct file *lds_file;
+	unsigned long lds_state;
+	void (*lds_break_fn)(void *lds_owner);
+	struct work_struct lds_work;
+};
+
 bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
 }
 EXPORT_SYMBOL_GPL(is_map_direct_valid);
 
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_BREAK, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_broken);
+
 static void put_map_direct(struct map_direct_state *mds)
 {
 	if (!atomic_dec_and_test(&mds->mds_ref))
@@ -162,6 +177,108 @@ static const struct lock_manager_operations map_direct_lm_ops = {
 	.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+	struct lease_direct_state *lds;
+	void *owner;
+
+	lds = container_of(work, typeof(*lds), lds_work);
+	owner = lds;
+	lds->lds_break_fn(lds->lds_owner);
+	vfs_setlease(lds->lds_file, F_UNLCK, NULL, &owner);
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+	struct lease_direct_state *lds = fl->fl_owner;
+
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &lds->lds_state))
+		schedule_work(&lds->lds_work);
+	return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+	return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+	.lm_break = lease_direct_lm_break,
+	.lm_change = lease_direct_lm_change,
+};
+
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*lds_break_fn)(void *), void *lds_owner)
+{
+	struct file *file = vma->vm_file;
+	struct lease_direct_state *lds;
+	struct lease_direct *ld;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+	void *owner;
+
+	ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+	if (!ld)
+		return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&ld->list);
+	lds = (struct lease_direct_state *)(ld + 1);
+	owner = lds;
+	ld->lds = lds;
+	lds->lds_break_fn = lds_break_fn;
+	lds->lds_owner = lds_owner;
+	INIT_WORK(&lds->lds_work, lease_direct_invalidate);
+	lds->lds_file = get_file(file);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &lease_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = lds;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = lds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return ld;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(lds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease);
+
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+	struct lease_direct_state *lds = ld->lds;
+	struct file *file = lds->lds_file;
+	void *owner = lds;
+
+	vfs_setlease(file, F_UNLCK, NULL, &owner);
+	flush_work(&lds->lds_work);
+	fput(file);
+	WARN_ON(!list_empty(&ld->list));
+	kfree(ld);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease_destroy);
+
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
 {
 	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index 724e27d8615e..dc4d4ba677d0 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -13,17 +13,28 @@
 #ifndef __MAPDIRECT_H__
 #define __MAPDIRECT_H__
 #include <linux/err.h>
+#include <linux/list.h>
 
 struct inode;
 struct work_struct;
 struct vm_area_struct;
 struct map_direct_state;
+struct list_direct_state;
+
+struct lease_direct {
+	struct list_head list;
+	struct lease_direct_state *lds;
+};
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);
 bool is_map_direct_valid(struct map_direct_state *mds);
+bool is_map_direct_broken(struct map_direct_state *mds);
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner);
+void map_direct_lease_destroy(struct lease_direct *ld);
 #else
 static inline struct map_direct_state *map_direct_register(int fd,
 		struct vm_area_struct *vma)
@@ -41,5 +52,17 @@ bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return false;
 }
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return false;
+}
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+}
 #endif
 #endif /* __MAPDIRECT_H__ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0afa19feb755..d03953f91ce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -420,6 +420,12 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/*
+	 * Called by rdma memory registration to subscribe for "break"
+	 * events that require any ongoing rdma accesses to quiesce.
+	 */
+	struct lease_direct *(*lease_direct)(struct vm_area_struct *vma,
+			void (*break_fn)(void *), void *owner);
 };
 
 struct mmu_gather;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/mapdirect.c            |  117 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   23 +++++++++
 include/linux/mm.h        |    6 ++
 3 files changed, 146 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9ac7c1d946a2..338cbe055fc7 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 
@@ -32,12 +33,26 @@ struct map_direct_state {
 	struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+	void *lds_owner;
+	struct file *lds_file;
+	unsigned long lds_state;
+	void (*lds_break_fn)(void *lds_owner);
+	struct work_struct lds_work;
+};
+
 bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
 }
 EXPORT_SYMBOL_GPL(is_map_direct_valid);
 
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_BREAK, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_broken);
+
 static void put_map_direct(struct map_direct_state *mds)
 {
 	if (!atomic_dec_and_test(&mds->mds_ref))
@@ -162,6 +177,108 @@ static const struct lock_manager_operations map_direct_lm_ops = {
 	.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+	struct lease_direct_state *lds;
+	void *owner;
+
+	lds = container_of(work, typeof(*lds), lds_work);
+	owner = lds;
+	lds->lds_break_fn(lds->lds_owner);
+	vfs_setlease(lds->lds_file, F_UNLCK, NULL, &owner);
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+	struct lease_direct_state *lds = fl->fl_owner;
+
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &lds->lds_state))
+		schedule_work(&lds->lds_work);
+	return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+	return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+	.lm_break = lease_direct_lm_break,
+	.lm_change = lease_direct_lm_change,
+};
+
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*lds_break_fn)(void *), void *lds_owner)
+{
+	struct file *file = vma->vm_file;
+	struct lease_direct_state *lds;
+	struct lease_direct *ld;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+	void *owner;
+
+	ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+	if (!ld)
+		return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&ld->list);
+	lds = (struct lease_direct_state *)(ld + 1);
+	owner = lds;
+	ld->lds = lds;
+	lds->lds_break_fn = lds_break_fn;
+	lds->lds_owner = lds_owner;
+	INIT_WORK(&lds->lds_work, lease_direct_invalidate);
+	lds->lds_file = get_file(file);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &lease_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = lds;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = lds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return ld;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(lds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease);
+
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+	struct lease_direct_state *lds = ld->lds;
+	struct file *file = lds->lds_file;
+	void *owner = lds;
+
+	vfs_setlease(file, F_UNLCK, NULL, &owner);
+	flush_work(&lds->lds_work);
+	fput(file);
+	WARN_ON(!list_empty(&ld->list));
+	kfree(ld);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease_destroy);
+
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
 {
 	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index 724e27d8615e..dc4d4ba677d0 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -13,17 +13,28 @@
 #ifndef __MAPDIRECT_H__
 #define __MAPDIRECT_H__
 #include <linux/err.h>
+#include <linux/list.h>
 
 struct inode;
 struct work_struct;
 struct vm_area_struct;
 struct map_direct_state;
+struct list_direct_state;
+
+struct lease_direct {
+	struct list_head list;
+	struct lease_direct_state *lds;
+};
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);
 bool is_map_direct_valid(struct map_direct_state *mds);
+bool is_map_direct_broken(struct map_direct_state *mds);
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner);
+void map_direct_lease_destroy(struct lease_direct *ld);
 #else
 static inline struct map_direct_state *map_direct_register(int fd,
 		struct vm_area_struct *vma)
@@ -41,5 +52,17 @@ bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return false;
 }
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return false;
+}
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+}
 #endif
 #endif /* __MAPDIRECT_H__ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0afa19feb755..d03953f91ce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -420,6 +420,12 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/*
+	 * Called by rdma memory registration to subscribe for "break"
+	 * events that require any ongoing rdma accesses to quiesce.
+	 */
+	struct lease_direct *(*lease_direct)(struct vm_area_struct *vma,
+			void (*break_fn)(void *), void *owner);
 };
 
 struct mmu_gather;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

Provide a vma operation that registers a lease that is broken by
break_layout(). This is motivated by a need to stop in-progress RDMA
when the block-map of a DAX-file changes. I.e. since DAX gives
direct-access to filesystem blocks we can not allow those blocks to move
or change state while they are under active RDMA. So, if the filesystem
determines it needs to move blocks it can revoke device access before
proceeding.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/mapdirect.c            |  117 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   23 +++++++++
 include/linux/mm.h        |    6 ++
 3 files changed, 146 insertions(+)

diff --git a/fs/mapdirect.c b/fs/mapdirect.c
index 9ac7c1d946a2..338cbe055fc7 100644
--- a/fs/mapdirect.c
+++ b/fs/mapdirect.c
@@ -16,6 +16,7 @@
 #include <linux/mutex.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 
@@ -32,12 +33,26 @@ struct map_direct_state {
 	struct vm_area_struct *mds_vma;
 };
 
+struct lease_direct_state {
+	void *lds_owner;
+	struct file *lds_file;
+	unsigned long lds_state;
+	void (*lds_break_fn)(void *lds_owner);
+	struct work_struct lds_work;
+};
+
 bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
 }
 EXPORT_SYMBOL_GPL(is_map_direct_valid);
 
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_BREAK, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(is_map_direct_broken);
+
 static void put_map_direct(struct map_direct_state *mds)
 {
 	if (!atomic_dec_and_test(&mds->mds_ref))
@@ -162,6 +177,108 @@ static const struct lock_manager_operations map_direct_lm_ops = {
 	.lm_setup = map_direct_lm_setup,
 };
 
+static void lease_direct_invalidate(struct work_struct *work)
+{
+	struct lease_direct_state *lds;
+	void *owner;
+
+	lds = container_of(work, typeof(*lds), lds_work);
+	owner = lds;
+	lds->lds_break_fn(lds->lds_owner);
+	vfs_setlease(lds->lds_file, F_UNLCK, NULL, &owner);
+}
+
+static bool lease_direct_lm_break(struct file_lock *fl)
+{
+	struct lease_direct_state *lds = fl->fl_owner;
+
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &lds->lds_state))
+		schedule_work(&lds->lds_work);
+	return false;
+}
+
+static int lease_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+	return lease_modify(fl, arg, dispose);
+}
+
+static const struct lock_manager_operations lease_direct_lm_ops = {
+	.lm_break = lease_direct_lm_break,
+	.lm_change = lease_direct_lm_change,
+};
+
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*lds_break_fn)(void *), void *lds_owner)
+{
+	struct file *file = vma->vm_file;
+	struct lease_direct_state *lds;
+	struct lease_direct *ld;
+	struct file_lock *fl;
+	int rc = -ENOMEM;
+	void *owner;
+
+	ld = kzalloc(sizeof(*ld) + sizeof(*lds), GFP_KERNEL);
+	if (!ld)
+		return ERR_PTR(-ENOMEM);
+	INIT_LIST_HEAD(&ld->list);
+	lds = (struct lease_direct_state *)(ld + 1);
+	owner = lds;
+	ld->lds = lds;
+	lds->lds_break_fn = lds_break_fn;
+	lds->lds_owner = lds_owner;
+	INIT_WORK(&lds->lds_work, lease_direct_invalidate);
+	lds->lds_file = get_file(file);
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &lease_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = lds;
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = lds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return ld;
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	kfree(lds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease);
+
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+	struct lease_direct_state *lds = ld->lds;
+	struct file *file = lds->lds_file;
+	void *owner = lds;
+
+	vfs_setlease(file, F_UNLCK, NULL, &owner);
+	flush_work(&lds->lds_work);
+	fput(file);
+	WARN_ON(!list_empty(&ld->list));
+	kfree(ld);
+}
+EXPORT_SYMBOL_GPL(map_direct_lease_destroy);
+
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
 {
 	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index 724e27d8615e..dc4d4ba677d0 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -13,17 +13,28 @@
 #ifndef __MAPDIRECT_H__
 #define __MAPDIRECT_H__
 #include <linux/err.h>
+#include <linux/list.h>
 
 struct inode;
 struct work_struct;
 struct vm_area_struct;
 struct map_direct_state;
+struct list_direct_state;
+
+struct lease_direct {
+	struct list_head list;
+	struct lease_direct_state *lds;
+};
 
 #if IS_ENABLED(CONFIG_FS_DAX)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);
 bool is_map_direct_valid(struct map_direct_state *mds);
+bool is_map_direct_broken(struct map_direct_state *mds);
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner);
+void map_direct_lease_destroy(struct lease_direct *ld);
 #else
 static inline struct map_direct_state *map_direct_register(int fd,
 		struct vm_area_struct *vma)
@@ -41,5 +52,17 @@ bool is_map_direct_valid(struct map_direct_state *mds)
 {
 	return false;
 }
+bool is_map_direct_broken(struct map_direct_state *mds)
+{
+	return false;
+}
+struct lease_direct *map_direct_lease(struct vm_area_struct *vma,
+		void (*ld_break_fn)(void *), void *ld_owner)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+void map_direct_lease_destroy(struct lease_direct *ld)
+{
+}
 #endif
 #endif /* __MAPDIRECT_H__ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0afa19feb755..d03953f91ce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -420,6 +420,12 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/*
+	 * Called by rdma memory registration to subscribe for "break"
+	 * events that require any ongoing rdma accesses to quiesce.
+	 */
+	struct lease_direct *(*lease_direct)(struct vm_area_struct *vma,
+			void (*break_fn)(void *), void *owner);
 };
 
 struct mmu_gather;


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 09/12] xfs: wire up ->lease_direct()
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:36   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, Christoph Hellwig

A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
mapping established. For xfs we establish a new lease and then check if
the MAP_DIRECT mapping has been broken. We want to be sure that the
process will receive notification that the MAP_DIRECT mapping is being
torn down so it knows why other code paths are throwing failures.

For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
MAP_DIRECT mapping is invalid or in the process of being invalidated.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e35518600e28..823b65f17429 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
 	put_map_direct_vma(vma->vm_private_data);
 }
 
+static struct lease_direct *
+xfs_filemap_direct_lease(
+	struct vm_area_struct	*vma,
+	void			(*break_fn)(void *),
+	void			*owner)
+{
+	struct lease_direct	*ld;
+
+	ld = map_direct_lease(vma, break_fn, owner);
+
+	if (IS_ERR(ld))
+		return ld;
+
+	/*
+	 * We now have an established lease while the base MAP_DIRECT
+	 * lease was not broken. So, we know that the "lease holder" will
+	 * receive a SIGIO notification when the lease is broken and
+	 * take any necessary cleanup actions.
+	 */
+	if (!is_map_direct_broken(vma->vm_private_data))
+		return ld;
+
+	map_direct_lease_destroy(ld);
+
+	return ERR_PTR(-ENXIO);
+}
+
 static const struct vm_operations_struct xfs_file_vm_direct_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1175,6 +1202,7 @@ static const struct vm_operations_struct xfs_file_vm_direct_ops = {
 
 	.open		= xfs_filemap_direct_open,
 	.close		= xfs_filemap_direct_close,
+	.lease_direct	= xfs_filemap_direct_lease,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
mapping established. For xfs we establish a new lease and then check if
the MAP_DIRECT mapping has been broken. We want to be sure that the
process will receive notification that the MAP_DIRECT mapping is being
torn down so it knows why other code paths are throwing failures.

For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
MAP_DIRECT mapping is invalid or in the process of being invalidated.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e35518600e28..823b65f17429 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
 	put_map_direct_vma(vma->vm_private_data);
 }
 
+static struct lease_direct *
+xfs_filemap_direct_lease(
+	struct vm_area_struct	*vma,
+	void			(*break_fn)(void *),
+	void			*owner)
+{
+	struct lease_direct	*ld;
+
+	ld = map_direct_lease(vma, break_fn, owner);
+
+	if (IS_ERR(ld))
+		return ld;
+
+	/*
+	 * We now have an established lease while the base MAP_DIRECT
+	 * lease was not broken. So, we know that the "lease holder" will
+	 * receive a SIGIO notification when the lease is broken and
+	 * take any necessary cleanup actions.
+	 */
+	if (!is_map_direct_broken(vma->vm_private_data))
+		return ld;
+
+	map_direct_lease_destroy(ld);
+
+	return ERR_PTR(-ENXIO);
+}
+
 static const struct vm_operations_struct xfs_file_vm_direct_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1175,6 +1202,7 @@ static const struct vm_operations_struct xfs_file_vm_direct_ops = {
 
 	.open		= xfs_filemap_direct_open,
 	.close		= xfs_filemap_direct_close,
+	.lease_direct	= xfs_filemap_direct_lease,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
mapping established. For xfs we establish a new lease and then check if
the MAP_DIRECT mapping has been broken. We want to be sure that the
process will receive notification that the MAP_DIRECT mapping is being
torn down so it knows why other code paths are throwing failures.

For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
MAP_DIRECT mapping is invalid or in the process of being invalidated.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e35518600e28..823b65f17429 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
 	put_map_direct_vma(vma->vm_private_data);
 }
 
+static struct lease_direct *
+xfs_filemap_direct_lease(
+	struct vm_area_struct	*vma,
+	void			(*break_fn)(void *),
+	void			*owner)
+{
+	struct lease_direct	*ld;
+
+	ld = map_direct_lease(vma, break_fn, owner);
+
+	if (IS_ERR(ld))
+		return ld;
+
+	/*
+	 * We now have an established lease while the base MAP_DIRECT
+	 * lease was not broken. So, we know that the "lease holder" will
+	 * receive a SIGIO notification when the lease is broken and
+	 * take any necessary cleanup actions.
+	 */
+	if (!is_map_direct_broken(vma->vm_private_data))
+		return ld;
+
+	map_direct_lease_destroy(ld);
+
+	return ERR_PTR(-ENXIO);
+}
+
 static const struct vm_operations_struct xfs_file_vm_direct_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1175,6 +1202,7 @@ static const struct vm_operations_struct xfs_file_vm_direct_ops = {
 
 	.open		= xfs_filemap_direct_open,
 	.close		= xfs_filemap_direct_close,
+	.lease_direct	= xfs_filemap_direct_lease,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 10/12] device-dax: wire up ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-rdma, linux-api, linux-xfs, linux-mm,
	linux-fsdevel, Christoph Hellwig

The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c      |    4 ++++
 fs/Kconfig                |    4 ++++
 fs/Makefile               |    3 ++-
 include/linux/mapdirect.h |    2 +-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/mapdirect.h>
 #include <linux/pagemap.h>
 #include <linux/module.h>
 #include <linux/device.h>
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
 	.fault = dev_dax_fault,
 	.huge_fault = dev_dax_huge_fault,
+	.lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct inode *inode = dax_inode(dax_dev);
+	const bool wait = true;
 
 	kill_dax(dax_dev);
+	break_layout(inode, wait);
 	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..2e3784ae1bc4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,10 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+	bool
+	default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)	+= mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index dc4d4ba677d0..bafa78a6085f 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,7 +26,7 @@ struct lease_direct {
 	struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 10/12] device-dax: wire up ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ross Zwisler,
	Christoph Hellwig

The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/dax/device.c      |    4 ++++
 fs/Kconfig                |    4 ++++
 fs/Makefile               |    3 ++-
 include/linux/mapdirect.h |    2 +-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/mapdirect.h>
 #include <linux/pagemap.h>
 #include <linux/module.h>
 #include <linux/device.h>
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
 	.fault = dev_dax_fault,
 	.huge_fault = dev_dax_huge_fault,
+	.lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct inode *inode = dax_inode(dax_dev);
+	const bool wait = true;
 
 	kill_dax(dax_dev);
+	break_layout(inode, wait);
 	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..2e3784ae1bc4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,10 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+	bool
+	default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)	+= mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index dc4d4ba677d0..bafa78a6085f 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,7 +26,7 @@ struct lease_direct {
 	struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 10/12] device-dax: wire up ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-rdma, linux-api, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig

The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c      |    4 ++++
 fs/Kconfig                |    4 ++++
 fs/Makefile               |    3 ++-
 include/linux/mapdirect.h |    2 +-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/mapdirect.h>
 #include <linux/pagemap.h>
 #include <linux/module.h>
 #include <linux/device.h>
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
 	.fault = dev_dax_fault,
 	.huge_fault = dev_dax_huge_fault,
+	.lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct inode *inode = dax_inode(dax_dev);
+	const bool wait = true;
 
 	kill_dax(dax_dev);
+	break_layout(inode, wait);
 	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..2e3784ae1bc4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,10 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+	bool
+	default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)	+= mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index dc4d4ba677d0..bafa78a6085f 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,7 +26,7 @@ struct lease_direct {
 	struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 10/12] device-dax: wire up ->lease_direct()
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, linux-rdma, linux-api, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Ross Zwisler, Christoph Hellwig

The only event that will break a lease_direct lease in the device-dax
case is the device shutdown path where the physical pages might get
assigned to another device.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c      |    4 ++++
 fs/Kconfig                |    4 ++++
 fs/Makefile               |    3 ++-
 include/linux/mapdirect.h |    2 +-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e9f3b3e4bbf4..fa75004185c4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -10,6 +10,7 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/mapdirect.h>
 #include <linux/pagemap.h>
 #include <linux/module.h>
 #include <linux/device.h>
@@ -430,6 +431,7 @@ static int dev_dax_fault(struct vm_fault *vmf)
 static const struct vm_operations_struct dax_vm_ops = {
 	.fault = dev_dax_fault,
 	.huge_fault = dev_dax_huge_fault,
+	.lease_direct = map_direct_lease,
 };
 
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
@@ -540,8 +542,10 @@ static void kill_dev_dax(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct inode *inode = dax_inode(dax_dev);
+	const bool wait = true;
 
 	kill_dax(dax_dev);
+	break_layout(inode, wait);
 	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 }
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..2e3784ae1bc4 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,10 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+config DAX_MAP_DIRECT
+	bool
+	default FS_DAX || DEV_DAX
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index c0e791d235d8..21b8fb104656 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,8 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
+obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_DAX_MAP_DIRECT)	+= mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
index dc4d4ba677d0..bafa78a6085f 100644
--- a/include/linux/mapdirect.h
+++ b/include/linux/mapdirect.h
@@ -26,7 +26,7 @@ struct lease_direct {
 	struct lease_direct_state *lds;
 };
 
-#if IS_ENABLED(CONFIG_FS_DAX)
+#if IS_ENABLED(CONFIG_DAX_MAP_DIRECT)
 struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
 int put_map_direct_vma(struct map_direct_state *mds);
 void get_map_direct_vma(struct map_direct_state *mds);


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:36   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, Doug Ledford, Hal Rosenstock, linux-xfs,
	linux-mm, Jeff Layton, linux-fsdevel, Sean Hefty,
	Christoph Hellwig

Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the new dma_has_iommu() helper to determine if ib_dma_unmap_sg() is
sufficient for revoking access. Once we have that assurance and a
->lease_direct() lease we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   90 ++++++++++++++++++++++++++++++++++------
 include/rdma/ib_umem.h         |    8 ++++
 2 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..dc3ae1bee669 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/mapdirect.h>
 #include <linux/export.h>
 #include <linux/hugetlb.h>
 #include <linux/slab.h>
@@ -46,11 +47,12 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	struct lease_direct *ld, *_ld;
 	struct scatterlist *sg;
 	struct page *page;
 	int i;
 
-	if (umem->nmap > 0)
+	if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
 				umem->npages,
 				DMA_BIDIRECTIONAL);
@@ -64,8 +66,22 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	}
 
 	sg_free_table(&umem->sg_head);
-	return;
 
+	list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+		list_del_init(&ld->list);
+		map_direct_lease_destroy(ld);
+	}
+}
+
+static void ib_umem_lease_break(void *__umem)
+{
+	struct ib_umem *umem = umem;
+	struct ib_device *dev = umem->context->device;
+
+	if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
+		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
+				umem->npages,
+				DMA_BIDIRECTIONAL);
 }
 
 /**
@@ -96,7 +112,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	struct scatterlist *sg, *sg_list_start;
 	int need_release = 0;
 	unsigned int gup_flags = FOLL_WRITE;
+	struct vm_area_struct *vma_prev = NULL;
+	struct device *dma_dev;
 
+	dma_dev = context->device->dma_device;
 	if (dmasync)
 		dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +139,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->address    = addr;
 	umem->page_shift = PAGE_SHIFT;
 	umem->pid	 = get_task_pid(current, PIDTYPE_PID);
+	INIT_LIST_HEAD(&umem->leases);
+	set_bit(IB_UMEM_MAPPED, &umem->state);
 	/*
 	 * We ask for writable memory if any of the following
 	 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to setup a ->lease_direct()
+	 * lease to protect against file modifications, otherwise we can
+	 * tolerate a failure to allocate the vma_list and just assume
+	 * that all vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +222,50 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			const struct vm_operations_struct *vm_ops;
+			struct vm_area_struct *vma;
+			struct lease_direct *ld;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (vma == vma_prev)
+				continue;
+			vma_prev = vma;
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			vm_ops = vma->vm_ops;
+			if (!vm_ops->lease_direct) {
+				dev_info(dma_dev, "DAX-RDMA needs lease_direct\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+
+			if (!dma_has_iommu(dma_dev)) {
+				dev_info(dma_dev, "DAX-RDMA needs iommu\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			ld = vm_ops->lease_direct(vma, ib_umem_lease_break,
+					umem);
+			if (IS_ERR(ld)) {
+				ret = PTR_ERR(ld);
+				goto out;
+			}
+			list_add(&ld->list, &umem->leases);
 		}
 
 		/* preparing for next loop */
@@ -242,6 +300,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 23159dd5be18..796ffe5b8dc3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -40,6 +40,7 @@
 struct ib_ucontext;
 struct ib_umem_odp;
 
+#define IB_UMEM_MAPPED 0
 struct ib_umem {
 	struct ib_ucontext     *context;
 	size_t			length;
@@ -55,6 +56,13 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/*
+	 * Note: no lock protects this list since we assume memory
+	 * registration never races unregistration for a given ib_umem
+	 * instance.
+	 */
+	struct list_head	leases;
+	unsigned long		state;
 };
 
 /* Returns the offset of the umem start relative to the first page. */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Sean Hefty, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, Jeff Moyer, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Doug Ledford, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Hal Rosenstock

Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the new dma_has_iommu() helper to determine if ib_dma_unmap_sg() is
sufficient for revoking access. Once we have that assurance and a
->lease_direct() lease we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   90 ++++++++++++++++++++++++++++++++++------
 include/rdma/ib_umem.h         |    8 ++++
 2 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..dc3ae1bee669 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/mapdirect.h>
 #include <linux/export.h>
 #include <linux/hugetlb.h>
 #include <linux/slab.h>
@@ -46,11 +47,12 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	struct lease_direct *ld, *_ld;
 	struct scatterlist *sg;
 	struct page *page;
 	int i;
 
-	if (umem->nmap > 0)
+	if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
 				umem->npages,
 				DMA_BIDIRECTIONAL);
@@ -64,8 +66,22 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	}
 
 	sg_free_table(&umem->sg_head);
-	return;
 
+	list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+		list_del_init(&ld->list);
+		map_direct_lease_destroy(ld);
+	}
+}
+
+static void ib_umem_lease_break(void *__umem)
+{
+	struct ib_umem *umem = umem;
+	struct ib_device *dev = umem->context->device;
+
+	if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
+		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
+				umem->npages,
+				DMA_BIDIRECTIONAL);
 }
 
 /**
@@ -96,7 +112,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	struct scatterlist *sg, *sg_list_start;
 	int need_release = 0;
 	unsigned int gup_flags = FOLL_WRITE;
+	struct vm_area_struct *vma_prev = NULL;
+	struct device *dma_dev;
 
+	dma_dev = context->device->dma_device;
 	if (dmasync)
 		dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +139,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->address    = addr;
 	umem->page_shift = PAGE_SHIFT;
 	umem->pid	 = get_task_pid(current, PIDTYPE_PID);
+	INIT_LIST_HEAD(&umem->leases);
+	set_bit(IB_UMEM_MAPPED, &umem->state);
 	/*
 	 * We ask for writable memory if any of the following
 	 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to setup a ->lease_direct()
+	 * lease to protect against file modifications, otherwise we can
+	 * tolerate a failure to allocate the vma_list and just assume
+	 * that all vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +222,50 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			const struct vm_operations_struct *vm_ops;
+			struct vm_area_struct *vma;
+			struct lease_direct *ld;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (vma == vma_prev)
+				continue;
+			vma_prev = vma;
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			vm_ops = vma->vm_ops;
+			if (!vm_ops->lease_direct) {
+				dev_info(dma_dev, "DAX-RDMA needs lease_direct\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+
+			if (!dma_has_iommu(dma_dev)) {
+				dev_info(dma_dev, "DAX-RDMA needs iommu\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			ld = vm_ops->lease_direct(vma, ib_umem_lease_break,
+					umem);
+			if (IS_ERR(ld)) {
+				ret = PTR_ERR(ld);
+				goto out;
+			}
+			list_add(&ld->list, &umem->leases);
 		}
 
 		/* preparing for next loop */
@@ -242,6 +300,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 23159dd5be18..796ffe5b8dc3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -40,6 +40,7 @@
 struct ib_ucontext;
 struct ib_umem_odp;
 
+#define IB_UMEM_MAPPED 0
 struct ib_umem {
 	struct ib_ucontext     *context;
 	size_t			length;
@@ -55,6 +56,13 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/*
+	 * Note: no lock protects this list since we assume memory
+	 * registration never races unregistration for a given ib_umem
+	 * instance.
+	 */
+	struct list_head	leases;
+	unsigned long		state;
 };
 
 /* Returns the offset of the umem start relative to the first page. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Sean Hefty, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Dave Chinner, Jeff Moyer, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Doug Ledford, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Hal Rosenstock

Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the new dma_has_iommu() helper to determine if ib_dma_unmap_sg() is
sufficient for revoking access. Once we have that assurance and a
->lease_direct() lease we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/infiniband/core/umem.c |   90 ++++++++++++++++++++++++++++++++++------
 include/rdma/ib_umem.h         |    8 ++++
 2 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..dc3ae1bee669 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/mapdirect.h>
 #include <linux/export.h>
 #include <linux/hugetlb.h>
 #include <linux/slab.h>
@@ -46,11 +47,12 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	struct lease_direct *ld, *_ld;
 	struct scatterlist *sg;
 	struct page *page;
 	int i;
 
-	if (umem->nmap > 0)
+	if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
 				umem->npages,
 				DMA_BIDIRECTIONAL);
@@ -64,8 +66,22 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	}
 
 	sg_free_table(&umem->sg_head);
-	return;
 
+	list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+		list_del_init(&ld->list);
+		map_direct_lease_destroy(ld);
+	}
+}
+
+static void ib_umem_lease_break(void *__umem)
+{
+	struct ib_umem *umem = umem;
+	struct ib_device *dev = umem->context->device;
+
+	if (umem->nmap > 0 && test_and_clear_bit(IB_UMEM_MAPPED, &umem->state))
+		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
+				umem->npages,
+				DMA_BIDIRECTIONAL);
 }
 
 /**
@@ -96,7 +112,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	struct scatterlist *sg, *sg_list_start;
 	int need_release = 0;
 	unsigned int gup_flags = FOLL_WRITE;
+	struct vm_area_struct *vma_prev = NULL;
+	struct device *dma_dev;
 
+	dma_dev = context->device->dma_device;
 	if (dmasync)
 		dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +139,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->address    = addr;
 	umem->page_shift = PAGE_SHIFT;
 	umem->pid	 = get_task_pid(current, PIDTYPE_PID);
+	INIT_LIST_HEAD(&umem->leases);
+	set_bit(IB_UMEM_MAPPED, &umem->state);
 	/*
 	 * We ask for writable memory if any of the following
 	 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to setup a ->lease_direct()
+	 * lease to protect against file modifications, otherwise we can
+	 * tolerate a failure to allocate the vma_list and just assume
+	 * that all vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +222,50 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			const struct vm_operations_struct *vm_ops;
+			struct vm_area_struct *vma;
+			struct lease_direct *ld;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (vma == vma_prev)
+				continue;
+			vma_prev = vma;
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			vm_ops = vma->vm_ops;
+			if (!vm_ops->lease_direct) {
+				dev_info(dma_dev, "DAX-RDMA needs lease_direct\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+
+			if (!dma_has_iommu(dma_dev)) {
+				dev_info(dma_dev, "DAX-RDMA needs iommu\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			ld = vm_ops->lease_direct(vma, ib_umem_lease_break,
+					umem);
+			if (IS_ERR(ld)) {
+				ret = PTR_ERR(ld);
+				goto out;
+			}
+			list_add(&ld->list, &umem->leases);
 		}
 
 		/* preparing for next loop */
@@ -242,6 +300,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 23159dd5be18..796ffe5b8dc3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -40,6 +40,7 @@
 struct ib_ucontext;
 struct ib_umem_odp;
 
+#define IB_UMEM_MAPPED 0
 struct ib_umem {
 	struct ib_ucontext     *context;
 	size_t			length;
@@ -55,6 +56,13 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/*
+	 * Note: no lock protects this list since we assume memory
+	 * registration never races unregistration for a given ib_umem
+	 * instance.
+	 */
+	struct list_head	leases;
+	unsigned long		state;
 };
 
 /* Returns the offset of the umem start relative to the first page. */


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests
  2017-10-06 22:35 ` Dan Williams
  (?)
@ 2017-10-06 22:36   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-xfs, linux-mm, linux-rdma, linux-fsdevel, linux-api

Provide a mock dma_has_iommu() for the ibverbs core. Enable
ib_umem_get() to satisfy its DAX safety checks for a controlled test.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild         |   31 +++++++++++++++++++++++++++++++
 tools/testing/nvdimm/config_check.c |    2 ++
 tools/testing/nvdimm/test/iomap.c   |    6 ++++++
 3 files changed, 39 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..e4ee7f482ac0 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=dma_has_iommu
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
 ACPI_SRC := $(DRIVERS)/acpi/nfit
 DAX_SRC := $(DRIVERS)/dax
+IBCORE := $(DRIVERS)/infiniband/core
 ccflags-y := -I$(src)/$(NVDIMM_SRC)/
 
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
@@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o
 endif
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_INFINIBAND) += ib_core.o
 
 nfit-y := $(ACPI_SRC)/core.o
 nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o
@@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
 libnvdimm-y += config_check.o
 
+ib_core-y := $(IBCORE)/packer.o
+ib_core-y += $(IBCORE)/ud_header.o
+ib_core-y += $(IBCORE)/verbs.o
+ib_core-y += $(IBCORE)/cq.o
+ib_core-y += $(IBCORE)/rw.o
+ib_core-y += $(IBCORE)/sysfs.o
+ib_core-y += $(IBCORE)/device.o
+ib_core-y += $(IBCORE)/fmr_pool.o
+ib_core-y += $(IBCORE)/cache.o
+ib_core-y += $(IBCORE)/netlink.o
+ib_core-y += $(IBCORE)/roce_gid_mgmt.o
+ib_core-y += $(IBCORE)/mr_pool.o
+ib_core-y += $(IBCORE)/addr.o
+ib_core-y += $(IBCORE)/sa_query.o
+ib_core-y += $(IBCORE)/multicast.o
+ib_core-y += $(IBCORE)/mad.o
+ib_core-y += $(IBCORE)/smi.o
+ib_core-y += $(IBCORE)/agent.o
+ib_core-y += $(IBCORE)/mad_rmpp.o
+ib_core-y += $(IBCORE)/security.o
+ib_core-y += $(IBCORE)/nldev.o
+
+ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o
+ib_core-y += config_check.o
+
 obj-m += test/
diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c
index 7dc5a0af9b54..33e7c805bfd6 100644
--- a/tools/testing/nvdimm/config_check.c
+++ b/tools/testing/nvdimm/config_check.c
@@ -14,4 +14,6 @@ void check(void)
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX));
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM));
+	BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND));
 }
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1c240328ee5b 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -388,4 +388,10 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+bool __wrap_dma_has_iommu(struct device *dev)
+{
+	return true;
+}
+EXPORT_SYMBOL(__wrap_dma_has_iommu);
+
 MODULE_LICENSE("GPL v2");

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-xfs, linux-mm, linux-fsdevel, linux-api, linux-rdma

Provide a mock dma_has_iommu() for the ibverbs core. Enable
ib_umem_get() to satisfy its DAX safety checks for a controlled test.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild         |   31 +++++++++++++++++++++++++++++++
 tools/testing/nvdimm/config_check.c |    2 ++
 tools/testing/nvdimm/test/iomap.c   |    6 ++++++
 3 files changed, 39 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..e4ee7f482ac0 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=dma_has_iommu
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
 ACPI_SRC := $(DRIVERS)/acpi/nfit
 DAX_SRC := $(DRIVERS)/dax
+IBCORE := $(DRIVERS)/infiniband/core
 ccflags-y := -I$(src)/$(NVDIMM_SRC)/
 
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
@@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o
 endif
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_INFINIBAND) += ib_core.o
 
 nfit-y := $(ACPI_SRC)/core.o
 nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o
@@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
 libnvdimm-y += config_check.o
 
+ib_core-y := $(IBCORE)/packer.o
+ib_core-y += $(IBCORE)/ud_header.o
+ib_core-y += $(IBCORE)/verbs.o
+ib_core-y += $(IBCORE)/cq.o
+ib_core-y += $(IBCORE)/rw.o
+ib_core-y += $(IBCORE)/sysfs.o
+ib_core-y += $(IBCORE)/device.o
+ib_core-y += $(IBCORE)/fmr_pool.o
+ib_core-y += $(IBCORE)/cache.o
+ib_core-y += $(IBCORE)/netlink.o
+ib_core-y += $(IBCORE)/roce_gid_mgmt.o
+ib_core-y += $(IBCORE)/mr_pool.o
+ib_core-y += $(IBCORE)/addr.o
+ib_core-y += $(IBCORE)/sa_query.o
+ib_core-y += $(IBCORE)/multicast.o
+ib_core-y += $(IBCORE)/mad.o
+ib_core-y += $(IBCORE)/smi.o
+ib_core-y += $(IBCORE)/agent.o
+ib_core-y += $(IBCORE)/mad_rmpp.o
+ib_core-y += $(IBCORE)/security.o
+ib_core-y += $(IBCORE)/nldev.o
+
+ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o
+ib_core-y += config_check.o
+
 obj-m += test/
diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c
index 7dc5a0af9b54..33e7c805bfd6 100644
--- a/tools/testing/nvdimm/config_check.c
+++ b/tools/testing/nvdimm/config_check.c
@@ -14,4 +14,6 @@ void check(void)
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX));
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM));
+	BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND));
 }
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1c240328ee5b 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -388,4 +388,10 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+bool __wrap_dma_has_iommu(struct device *dev)
+{
+	return true;
+}
+EXPORT_SYMBOL(__wrap_dma_has_iommu);
+
 MODULE_LICENSE("GPL v2");

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests
@ 2017-10-06 22:36   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:36 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-xfs, linux-mm, linux-fsdevel, linux-api, linux-rdma

Provide a mock dma_has_iommu() for the ibverbs core. Enable
ib_umem_get() to satisfy its DAX safety checks for a controlled test.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 tools/testing/nvdimm/Kbuild         |   31 +++++++++++++++++++++++++++++++
 tools/testing/nvdimm/config_check.c |    2 ++
 tools/testing/nvdimm/test/iomap.c   |    6 ++++++
 3 files changed, 39 insertions(+)

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index d870520da68b..e4ee7f482ac0 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,11 +15,13 @@ ldflags-y += --wrap=insert_resource
 ldflags-y += --wrap=remove_resource
 ldflags-y += --wrap=acpi_evaluate_object
 ldflags-y += --wrap=acpi_evaluate_dsm
+ldflags-y += --wrap=dma_has_iommu
 
 DRIVERS := ../../../drivers
 NVDIMM_SRC := $(DRIVERS)/nvdimm
 ACPI_SRC := $(DRIVERS)/acpi/nfit
 DAX_SRC := $(DRIVERS)/dax
+IBCORE := $(DRIVERS)/infiniband/core
 ccflags-y := -I$(src)/$(NVDIMM_SRC)/
 
 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
@@ -33,6 +35,7 @@ obj-$(CONFIG_DAX) += dax.o
 endif
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_INFINIBAND) += ib_core.o
 
 nfit-y := $(ACPI_SRC)/core.o
 nfit-$(CONFIG_X86_MCE) += $(ACPI_SRC)/mce.o
@@ -75,4 +78,32 @@ libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-$(CONFIG_NVDIMM_DAX) += $(NVDIMM_SRC)/dax_devs.o
 libnvdimm-y += config_check.o
 
+ib_core-y := $(IBCORE)/packer.o
+ib_core-y += $(IBCORE)/ud_header.o
+ib_core-y += $(IBCORE)/verbs.o
+ib_core-y += $(IBCORE)/cq.o
+ib_core-y += $(IBCORE)/rw.o
+ib_core-y += $(IBCORE)/sysfs.o
+ib_core-y += $(IBCORE)/device.o
+ib_core-y += $(IBCORE)/fmr_pool.o
+ib_core-y += $(IBCORE)/cache.o
+ib_core-y += $(IBCORE)/netlink.o
+ib_core-y += $(IBCORE)/roce_gid_mgmt.o
+ib_core-y += $(IBCORE)/mr_pool.o
+ib_core-y += $(IBCORE)/addr.o
+ib_core-y += $(IBCORE)/sa_query.o
+ib_core-y += $(IBCORE)/multicast.o
+ib_core-y += $(IBCORE)/mad.o
+ib_core-y += $(IBCORE)/smi.o
+ib_core-y += $(IBCORE)/agent.o
+ib_core-y += $(IBCORE)/mad_rmpp.o
+ib_core-y += $(IBCORE)/security.o
+ib_core-y += $(IBCORE)/nldev.o
+
+ib_core-$(CONFIG_INFINIBAND_USER_MEM) += $(IBCORE)/umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += $(IBCORE)/umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += $(IBCORE)/cgroup.o
+ib_core-y += config_check.o
+
 obj-m += test/
diff --git a/tools/testing/nvdimm/config_check.c b/tools/testing/nvdimm/config_check.c
index 7dc5a0af9b54..33e7c805bfd6 100644
--- a/tools/testing/nvdimm/config_check.c
+++ b/tools/testing/nvdimm/config_check.c
@@ -14,4 +14,6 @@ void check(void)
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_ACPI_NFIT));
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX));
 	BUILD_BUG_ON(!IS_MODULE(CONFIG_DEV_DAX_PMEM));
+	BUILD_BUG_ON(!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM));
+	BUILD_BUG_ON(!IS_MODULE(CONFIG_INFINIBAND));
 }
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index e1f75a1914a1..1c240328ee5b 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -388,4 +388,10 @@ union acpi_object * __wrap_acpi_evaluate_dsm(acpi_handle handle, const guid_t *g
 }
 EXPORT_SYMBOL(__wrap_acpi_evaluate_dsm);
 
+bool __wrap_dma_has_iommu(struct device *dev)
+{
+	return true;
+}
+EXPORT_SYMBOL(__wrap_dma_has_iommu);
+
 MODULE_LICENSE("GPL v2");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-06 22:35   ` Dan Williams
  (?)
@ 2017-10-06 22:45     ` David Woodhouse
  -1 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-06 22:45 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Jan Kara, Ashok Raj,
	Darrick J. Wong  <darrick.wong@oracle.com>,
	linux-rdma@vger.kernel.org, Greg Kroah-Hartman, Joerg Roedel,
	Dave Chinner, linux-xfs, linux-mm, Jeff Moyer, linux-api,
	linux-fsdevel, Ross Zwisler, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> Add a helper to determine if the dma mappings set up for a given device
> are backed by an iommu. In particular, this lets code paths know that a
> dma_unmap operation will revoke access to memory if the device can not
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

"a dma_unmap operation revoke access to memory"... but it's OK that the
next *map* will give the same DMA address to someone else, right?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:45     ` David Woodhouse
  0 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-06 22:45 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> Add a helper to determine if the dma mappings set up for a given device
> are backed by an iommu. In particular, this lets code paths know that a
> dma_unmap operation will revoke access to memory if the device can not
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

"a dma_unmap operation revoke access to memory"... but it's OK that the
next *map* will give the same DMA address to someone else, right?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:45     ` David Woodhouse
  0 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-06 22:45 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> Add a helper to determine if the dma mappings set up for a given device
> are backed by an iommu. In particular, this lets code paths know that a
> dma_unmap operation will revoke access to memory if the device can not
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

"a dma_unmap operation revoke access to memory"... but it's OK that the
next *map* will give the same DMA address to someone else, right?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:52       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Linux API, linux-fsdevel, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> Add a helper to determine if the dma mappings set up for a given device
>> are backed by an iommu. In particular, this lets code paths know that a
>> dma_unmap operation will revoke access to memory if the device can not
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> "a dma_unmap operation revoke access to memory"... but it's OK that the
> next *map* will give the same DMA address to someone else, right?

I'm assuming the next map will be to other physical addresses and a
different requester device since the memory is still registered
exclusively.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:52       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API,
	linux-fsdevel, Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> Add a helper to determine if the dma mappings set up for a given device
>> are backed by an iommu. In particular, this lets code paths know that a
>> dma_unmap operation will revoke access to memory if the device can not
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> "a dma_unmap operation revoke access to memory"... but it's OK that the
> next *map* will give the same DMA address to someone else, right?

I'm assuming the next map will be to other physical addresses and a
different requester device since the memory is still registered
exclusively.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:52       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> Add a helper to determine if the dma mappings set up for a given device
>> are backed by an iommu. In particular, this lets code paths know that a
>> dma_unmap operation will revoke access to memory if the device can not
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> "a dma_unmap operation revoke access to memory"... but it's OK that the
> next *map* will give the same DMA address to someone else, right?

I'm assuming the next map will be to other physical addresses and a
different requester device since the memory is still registered
exclusively.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 22:52       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 22:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> Add a helper to determine if the dma mappings set up for a given device
>> are backed by an iommu. In particular, this lets code paths know that a
>> dma_unmap operation will revoke access to memory if the device can not
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> "a dma_unmap operation revoke access to memory"... but it's OK that the
> next *map* will give the same DMA address to someone else, right?

I'm assuming the next map will be to other physical addresses and a
different requester device since the memory is still registered
exclusively.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:10         ` David Woodhouse
  0 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-06 23:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> > > 
> > > Add a helper to determine if the dma mappings set up for a given device
> > > are backed by an iommu. In particular, this lets code paths know that a
> > > dma_unmap operation will revoke access to memory if the device can not
> > > otherwise be quiesced. The need for this knowledge is driven by a need
> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> > > changes we need to be to reliably stop accesses to blocks that have been
> > > freed or re-assigned to a new file.
> > "a dma_unmap operation revoke access to memory"... but it's OK that the
> > next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

I meant the next map for this device/group.

It may well use the same virtual DMA address as the one you just
unmapped, yet actually map to a different physical address. So if the
DMA still occurs to the "old" address, that isn't revoked at all — it's
just going to the wrong physical location.

And if you are sure that the DMA will never happen, why do you need to
revoke the mapping in the first place?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:10         ` David Woodhouse
  0 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-06 23:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 1477 bytes --]

On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> > 
> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> > > 
> > > Add a helper to determine if the dma mappings set up for a given device
> > > are backed by an iommu. In particular, this lets code paths know that a
> > > dma_unmap operation will revoke access to memory if the device can not
> > > otherwise be quiesced. The need for this knowledge is driven by a need
> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> > > changes we need to be to reliably stop accesses to blocks that have been
> > > freed or re-assigned to a new file.
> > "a dma_unmap operation revoke access to memory"... but it's OK that the
> > next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

I meant the next map for this device/group.

It may well use the same virtual DMA address as the one you just
unmapped, yet actually map to a different physical address. So if the
DMA still occurs to the "old" address, that isn't revoked at all — it's
just going to the wrong physical location.

And if you are sure that the DMA will never happen, why do you need to
revoke the mapping in the first place?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:10         ` David Woodhouse
  0 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-06 23:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
> > > 
> > > Add a helper to determine if the dma mappings set up for a given device
> > > are backed by an iommu. In particular, this lets code paths know that a
> > > dma_unmap operation will revoke access to memory if the device can not
> > > otherwise be quiesced. The need for this knowledge is driven by a need
> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> > > changes we need to be to reliably stop accesses to blocks that have been
> > > freed or re-assigned to a new file.
> > "a dma_unmap operation revoke access to memory"... but it's OK that the
> > next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

I meant the next map for this device/group.

It may well use the same virtual DMA address as the one you just
unmapped, yet actually map to a different physical address. So if the
DMA still occurs to the "old" address, that isn't revoked at all — it's
just going to the wrong physical location.

And if you are sure that the DMA will never happen, why do you need to
revoke the mapping in the first place?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-06 22:52       ` Dan Williams
@ 2017-10-06 23:12         ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 23:12 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 3:52 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>>> Add a helper to determine if the dma mappings set up for a given device
>>> are backed by an iommu. In particular, this lets code paths know that a
>>> dma_unmap operation will revoke access to memory if the device can not
>>> otherwise be quiesced. The need for this knowledge is driven by a need
>>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>>> changes we need to be to reliably stop accesses to blocks that have been
>>> freed or re-assigned to a new file.
>>
>> "a dma_unmap operation revoke access to memory"... but it's OK that the
>> next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

[ chatted with Ashok ]

Yes, it seems we need a way to pin that IOVA as in use, but invalidate
it. Then we can wait for the unmap to occur when the memory is
unregistered to avoid this IOVA reuse problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:12         ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 23:12 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 3:52 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>>> Add a helper to determine if the dma mappings set up for a given device
>>> are backed by an iommu. In particular, this lets code paths know that a
>>> dma_unmap operation will revoke access to memory if the device can not
>>> otherwise be quiesced. The need for this knowledge is driven by a need
>>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>>> changes we need to be to reliably stop accesses to blocks that have been
>>> freed or re-assigned to a new file.
>>
>> "a dma_unmap operation revoke access to memory"... but it's OK that the
>> next *map* will give the same DMA address to someone else, right?
>
> I'm assuming the next map will be to other physical addresses and a
> different requester device since the memory is still registered
> exclusively.

[ chatted with Ashok ]

Yes, it seems we need a way to pin that IOVA as in use, but invalidate
it. Then we can wait for the unmap to occur when the memory is
unregistered to avoid this IOVA reuse problem.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:15           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 23:15 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Linux API, linux-fsdevel, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
>> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >
>> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> > >
>> > > Add a helper to determine if the dma mappings set up for a given device
>> > > are backed by an iommu. In particular, this lets code paths know that a
>> > > dma_unmap operation will revoke access to memory if the device can not
>> > > otherwise be quiesced. The need for this knowledge is driven by a need
>> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> > > changes we need to be to reliably stop accesses to blocks that have been
>> > > freed or re-assigned to a new file.
>> > "a dma_unmap operation revoke access to memory"... but it's OK that the
>> > next *map* will give the same DMA address to someone else, right?
>>
>> I'm assuming the next map will be to other physical addresses and a
>> different requester device since the memory is still registered
>> exclusively.
>
> I meant the next map for this device/group.
>
> It may well use the same virtual DMA address as the one you just
> unmapped, yet actually map to a different physical address. So if the
> DMA still occurs to the "old" address, that isn't revoked at all — it's
> just going to the wrong physical location.
>
> And if you are sure that the DMA will never happen, why do you need to
> revoke the mapping in the first place?

Right, crossed mails. The semantic I want is that the IOVA is
invalidated / starts throwing errors to the device because the address
it thought it was talking to has been remapped in the file. Once
userspace wakes up and responds to this invalidation event it can do
the actual unmap to make the IOVA reusable again.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:15           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 23:15 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Linux API,
	linux-fsdevel, Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
>> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >
>> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> > >
>> > > Add a helper to determine if the dma mappings set up for a given device
>> > > are backed by an iommu. In particular, this lets code paths know that a
>> > > dma_unmap operation will revoke access to memory if the device can not
>> > > otherwise be quiesced. The need for this knowledge is driven by a need
>> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> > > changes we need to be to reliably stop accesses to blocks that have been
>> > > freed or re-assigned to a new file.
>> > "a dma_unmap operation revoke access to memory"... but it's OK that the
>> > next *map* will give the same DMA address to someone else, right?
>>
>> I'm assuming the next map will be to other physical addresses and a
>> different requester device since the memory is still registered
>> exclusively.
>
> I meant the next map for this device/group.
>
> It may well use the same virtual DMA address as the one you just
> unmapped, yet actually map to a different physical address. So if the
> DMA still occurs to the "old" address, that isn't revoked at all — it's
> just going to the wrong physical location.
>
> And if you are sure that the DMA will never happen, why do you need to
> revoke the mapping in the first place?

Right, crossed mails. The semantic I want is that the IOVA is
invalidated / starts throwing errors to the device because the address
it thought it was talking to has been remapped in the file. Once
userspace wakes up and responds to this invalidation event it can do
the actual unmap to make the IOVA reusable again.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:15           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 23:15 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
>> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >
>> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> > >
>> > > Add a helper to determine if the dma mappings set up for a given device
>> > > are backed by an iommu. In particular, this lets code paths know that a
>> > > dma_unmap operation will revoke access to memory if the device can not
>> > > otherwise be quiesced. The need for this knowledge is driven by a need
>> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> > > changes we need to be to reliably stop accesses to blocks that have been
>> > > freed or re-assigned to a new file.
>> > "a dma_unmap operation revoke access to memory"... but it's OK that the
>> > next *map* will give the same DMA address to someone else, right?
>>
>> I'm assuming the next map will be to other physical addresses and a
>> different requester device since the memory is still registered
>> exclusively.
>
> I meant the next map for this device/group.
>
> It may well use the same virtual DMA address as the one you just
> unmapped, yet actually map to a different physical address. So if the
> DMA still occurs to the "old" address, that isn't revoked at all — it's
> just going to the wrong physical location.
>
> And if you are sure that the DMA will never happen, why do you need to
> revoke the mapping in the first place?

Right, crossed mails. The semantic I want is that the IOVA is
invalidated / starts throwing errors to the device because the address
it thought it was talking to has been remapped in the file. Once
userspace wakes up and responds to this invalidation event it can do
the actual unmap to make the IOVA reusable again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-06 23:15           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-06 23:15 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 6, 2017 at 4:10 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 15:52 -0700, Dan Williams wrote:
>> On Fri, Oct 6, 2017 at 3:45 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >
>> > On Fri, 2017-10-06 at 15:35 -0700, Dan Williams wrote:
>> > >
>> > > Add a helper to determine if the dma mappings set up for a given device
>> > > are backed by an iommu. In particular, this lets code paths know that a
>> > > dma_unmap operation will revoke access to memory if the device can not
>> > > otherwise be quiesced. The need for this knowledge is driven by a need
>> > > to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> > > changes we need to be to reliably stop accesses to blocks that have been
>> > > freed or re-assigned to a new file.
>> > "a dma_unmap operation revoke access to memory"... but it's OK that the
>> > next *map* will give the same DMA address to someone else, right?
>>
>> I'm assuming the next map will be to other physical addresses and a
>> different requester device since the memory is still registered
>> exclusively.
>
> I meant the next map for this device/group.
>
> It may well use the same virtual DMA address as the one you just
> unmapped, yet actually map to a different physical address. So if the
> DMA still occurs to the "old" address, that isn't revoked at all — it's
> just going to the wrong physical location.
>
> And if you are sure that the DMA will never happen, why do you need to
> revoke the mapping in the first place?

Right, crossed mails. The semantic I want is that the IOVA is
invalidated / starts throwing errors to the device because the address
it thought it was talking to has been remapped in the file. Once
userspace wakes up and responds to this invalidation event it can do
the actual unmap to make the IOVA reusable again.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-06 23:15           ` Dan Williams
@ 2017-10-07 11:08             ` David Woodhouse
  -1 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-07 11:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 808 bytes --]

On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
> 
> Right, crossed mails. The semantic I want is that the IOVA is
> invalidated / starts throwing errors to the device because the address
> it thought it was talking to has been remapped in the file. Once
> userspace wakes up and responds to this invalidation event it can do
> the actual unmap to make the IOVA reusable again.

So basically you want to unmap it by removing it from the page tables
and flushing the IOTLB, but you want the IOVA to still be reserved.

The normal device-facing DMA API doesn't give you that today. You could
do it with the IOMMU API though — that one does let you manage the IOVA
space yourself. You don't want that IOVA used again? Well don't use it
as the IOVA in a subsequent iommu_map() call then :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-07 11:08             ` David Woodhouse
  0 siblings, 0 replies; 158+ messages in thread
From: David Woodhouse @ 2017-10-07 11:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

[-- Attachment #1: Type: text/plain, Size: 808 bytes --]

On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
> 
> Right, crossed mails. The semantic I want is that the IOVA is
> invalidated / starts throwing errors to the device because the address
> it thought it was talking to has been remapped in the file. Once
> userspace wakes up and responds to this invalidation event it can do
> the actual unmap to make the IOVA reusable again.

So basically you want to unmap it by removing it from the page tables
and flushing the IOTLB, but you want the IOVA to still be reserved.

The normal device-facing DMA API doesn't give you that today. You could
do it with the IOMMU API though — that one does let you manage the IOVA
space yourself. You don't want that IOVA used again? Well don't use it
as the IOVA in a subsequent iommu_map() call then :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 4938 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-07 23:33               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-07 23:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Linux API, linux-fsdevel, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

On Sat, Oct 7, 2017 at 4:08 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
>>
>> Right, crossed mails. The semantic I want is that the IOVA is
>> invalidated / starts throwing errors to the device because the address
>> it thought it was talking to has been remapped in the file. Once
>> userspace wakes up and responds to this invalidation event it can do
>> the actual unmap to make the IOVA reusable again.
>
> So basically you want to unmap it by removing it from the page tables
> and flushing the IOTLB, but you want the IOVA to still be reserved.
>
> The normal device-facing DMA API doesn't give you that today. You could
> do it with the IOMMU API though — that one does let you manage the IOVA
> space yourself. You don't want that IOVA used again? Well don't use it
> as the IOVA in a subsequent iommu_map() call then :)

Ah, nice. So I think I'll just add a dma_get_iommu_domain() so the
dma_ops implementation can exclude identity-mapped devices, and then
iommu_unmap() does the rest. Thanks for the pointer.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-07 23:33               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-07 23:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

On Sat, Oct 7, 2017 at 4:08 AM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
>>
>> Right, crossed mails. The semantic I want is that the IOVA is
>> invalidated / starts throwing errors to the device because the address
>> it thought it was talking to has been remapped in the file. Once
>> userspace wakes up and responds to this invalidation event it can do
>> the actual unmap to make the IOVA reusable again.
>
> So basically you want to unmap it by removing it from the page tables
> and flushing the IOTLB, but you want the IOVA to still be reserved.
>
> The normal device-facing DMA API doesn't give you that today. You could
> do it with the IOMMU API though — that one does let you manage the IOVA
> space yourself. You don't want that IOVA used again? Well don't use it
> as the IOVA in a subsequent iommu_map() call then :)

Ah, nice. So I think I'll just add a dma_get_iommu_domain() so the
dma_ops implementation can exclude identity-mapped devices, and then
iommu_unmap() does the rest. Thanks for the pointer.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-07 23:33               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-07 23:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Sat, Oct 7, 2017 at 4:08 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
>>
>> Right, crossed mails. The semantic I want is that the IOVA is
>> invalidated / starts throwing errors to the device because the address
>> it thought it was talking to has been remapped in the file. Once
>> userspace wakes up and responds to this invalidation event it can do
>> the actual unmap to make the IOVA reusable again.
>
> So basically you want to unmap it by removing it from the page tables
> and flushing the IOTLB, but you want the IOVA to still be reserved.
>
> The normal device-facing DMA API doesn't give you that today. You could
> do it with the IOMMU API though — that one does let you manage the IOVA
> space yourself. You don't want that IOVA used again? Well don't use it
> as the IOVA in a subsequent iommu_map() call then :)

Ah, nice. So I think I'll just add a dma_get_iommu_domain() so the
dma_ops implementation can exclude identity-mapped devices, and then
iommu_unmap() does the rest. Thanks for the pointer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-07 23:33               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-07 23:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Sat, Oct 7, 2017 at 4:08 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2017-10-06 at 16:15 -0700, Dan Williams wrote:
>>
>> Right, crossed mails. The semantic I want is that the IOVA is
>> invalidated / starts throwing errors to the device because the address
>> it thought it was talking to has been remapped in the file. Once
>> userspace wakes up and responds to this invalidation event it can do
>> the actual unmap to make the IOVA reusable again.
>
> So basically you want to unmap it by removing it from the page tables
> and flushing the IOTLB, but you want the IOVA to still be reserved.
>
> The normal device-facing DMA API doesn't give you that today. You could
> do it with the IOMMU API though — that one does let you manage the IOVA
> space yourself. You don't want that IOVA used again? Well don't use it
> as the IOVA in a subsequent iommu_map() call then :)

Ah, nice. So I think I'll just add a dma_get_iommu_domain() so the
dma_ops implementation can exclude identity-mapped devices, and then
iommu_unmap() does the rest. Thanks for the pointer.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
  2017-10-06 22:35   ` Dan Williams
  (?)
@ 2017-10-08  3:45     ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  3:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, Robin Murphy,
	linux-xfs, linux-mm, linux-api, linux-fsdevel, David Woodhouse,
	Christoph Hellwig, Marek Szyprowski

Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
The motivation for this interface is making RDMA transfers to DAX mappings
safe. If the DAX file's block map changes we need to be to reliably stop
accesses to blocks that have been freed or re-assigned to a new file. With the
iommu_domain and a callback from the DAX filesystem the kernel can safely
revoke access to a DMA device. The process that performed the RDMA memory
registration is also notified of this revocation event, but the kernel can not
otherwise be in the position of waiting for userspace to quiesce the device.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v7:
* retrieve the iommu_domain so that we can later pass the results of
  dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.

 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |   10 ++++++++++
 drivers/iommu/intel-iommu.c |   15 +++++++++++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 38 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..fdb9764f95a4 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+struct iommu_domain *dma_get_iommu_domain(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->get_iommu)
+		return ops->get_iommu(dev);
+	return NULL;
+}
+EXPORT_SYMBOL(dma_get_iommu_domain);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..c8e1a45af182 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
+{
+	struct protection_domain *domain = get_domain(dev);
+
+	if (IS_ERR(domain))
+		return NULL;
+	return &domain->domain;
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.get_iommu	= amd_dma_get_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..f3f4939cebad 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
+{
+	struct dmar_domain *domain;
+
+	if (iommu_no_mapping(dev))
+		return NULL;
+
+	domain = get_valid_domain_for_dev(dev);
+	if (!domain)
+		return NULL;
+
+	return &domain->domain;
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.get_iommu = intel_dma_get_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..aa62df1d0d72 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	struct iommu_domain *(*get_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-08  3:45     ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  3:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
The motivation for this interface is making RDMA transfers to DAX mappings
safe. If the DAX file's block map changes we need to be to reliably stop
accesses to blocks that have been freed or re-assigned to a new file. With the
iommu_domain and a callback from the DAX filesystem the kernel can safely
revoke access to a DMA device. The process that performed the RDMA memory
registration is also notified of this revocation event, but the kernel can not
otherwise be in the position of waiting for userspace to quiesce the device.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v7:
* retrieve the iommu_domain so that we can later pass the results of
  dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.

 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |   10 ++++++++++
 drivers/iommu/intel-iommu.c |   15 +++++++++++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 38 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..fdb9764f95a4 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+struct iommu_domain *dma_get_iommu_domain(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->get_iommu)
+		return ops->get_iommu(dev);
+	return NULL;
+}
+EXPORT_SYMBOL(dma_get_iommu_domain);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..c8e1a45af182 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
+{
+	struct protection_domain *domain = get_domain(dev);
+
+	if (IS_ERR(domain))
+		return NULL;
+	return &domain->domain;
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.get_iommu	= amd_dma_get_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..f3f4939cebad 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
+{
+	struct dmar_domain *domain;
+
+	if (iommu_no_mapping(dev))
+		return NULL;
+
+	domain = get_valid_domain_for_dev(dev);
+	if (!domain)
+		return NULL;
+
+	return &domain->domain;
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.get_iommu = intel_dma_get_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..aa62df1d0d72 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	struct iommu_domain *(*get_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-08  3:45     ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  3:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
The motivation for this interface is making RDMA transfers to DAX mappings
safe. If the DAX file's block map changes we need to be to reliably stop
accesses to blocks that have been freed or re-assigned to a new file. With the
iommu_domain and a callback from the DAX filesystem the kernel can safely
revoke access to a DMA device. The process that performed the RDMA memory
registration is also notified of this revocation event, but the kernel can not
otherwise be in the position of waiting for userspace to quiesce the device.

Since PMEM+DAX is currently only enabled for x86, we only update the x86
iommu drivers.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v7:
* retrieve the iommu_domain so that we can later pass the results of
  dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.

 drivers/base/dma-mapping.c  |   10 ++++++++++
 drivers/iommu/amd_iommu.c   |   10 ++++++++++
 drivers/iommu/intel-iommu.c |   15 +++++++++++++++
 include/linux/dma-mapping.h |    3 +++
 4 files changed, 38 insertions(+)

diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index e584eddef0a7..fdb9764f95a4 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
 	of_dma_deconfigure(dev);
 	acpi_dma_deconfigure(dev);
 }
+
+struct iommu_domain *dma_get_iommu_domain(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (ops && ops->get_iommu)
+		return ops->get_iommu(dev);
+	return NULL;
+}
+EXPORT_SYMBOL(dma_get_iommu_domain);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 51f8215877f5..c8e1a45af182 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
 	return domain;
 }
 
+static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
+{
+	struct protection_domain *domain = get_domain(dev);
+
+	if (IS_ERR(domain))
+		return NULL;
+	return &domain->domain;
+}
+
 static void update_device_table(struct protection_domain *domain)
 {
 	struct iommu_dev_data *dev_data;
@@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
 	.unmap_sg	= unmap_sg,
 	.dma_supported	= amd_iommu_dma_supported,
 	.mapping_error	= amd_iommu_mapping_error,
+	.get_iommu	= amd_dma_get_iommu,
 };
 
 static int init_reserved_iova_ranges(void)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..f3f4939cebad 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
+static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
+{
+	struct dmar_domain *domain;
+
+	if (iommu_no_mapping(dev))
+		return NULL;
+
+	domain = get_valid_domain_for_dev(dev);
+	if (!domain)
+		return NULL;
+
+	return &domain->domain;
+}
+
 static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
 				     size_t size, int dir, u64 dma_mask)
 {
@@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
+	.get_iommu = intel_dma_get_iommu,
 #ifdef CONFIG_X86
 	.dma_supported = x86_dma_supported,
 #endif
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 29ce9815da87..aa62df1d0d72 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -128,6 +128,7 @@ struct dma_map_ops {
 				   enum dma_data_direction dir);
 	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
 	int (*dma_supported)(struct device *dev, u64 mask);
+	struct iommu_domain *(*get_iommu)(struct device *dev);
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
@@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
 }
 #endif
 
+extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
+
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
 					      size_t size,
 					      enum dma_data_direction dir,


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu
  2017-10-06 22:36   ` Dan Williams
  (?)
@ 2017-10-08  4:02     ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  4:02 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-rdma, linux-api, Joerg Roedel, linux-xfs, linux-mm, iommu,
	linux-fsdevel

iommu_sg_num_pages() is a helper that walks a scattlerlist and counts
pages taking segment boundaries and iommu_num_pages() into account.
Up-level it for determining the IOVA range that dma_map_ops established
at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in
advance of freeing IOVA range.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
New patch in v8.

 drivers/iommu/amd_iommu.c |   30 ++----------------------------
 drivers/iommu/iommu.c     |   27 +++++++++++++++++++++++++++
 include/linux/iommu.h     |    2 ++
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c8e1a45af182..4795b0823469 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size,
 	__unmap_single(dma_dom, dma_addr, size, dir);
 }
 
-static int sg_num_pages(struct device *dev,
-			struct scatterlist *sglist,
-			int nelems)
-{
-	unsigned long mask, boundary_size;
-	struct scatterlist *s;
-	int i, npages = 0;
-
-	mask          = dma_get_seg_boundary(dev);
-	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
-				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
-
-	for_each_sg(sglist, s, nelems, i) {
-		int p, n;
-
-		s->dma_address = npages << PAGE_SHIFT;
-		p = npages % boundary_size;
-		n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
-		if (p + n > boundary_size)
-			npages += boundary_size - p;
-		npages += n;
-	}
-
-	return npages;
-}
-
 /*
  * The exported map_sg function for dma_ops (handles scatter-gather
  * lists).
@@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist *sglist,
 	dma_dom  = to_dma_ops_domain(domain);
 	dma_mask = *dev->dma_mask;
 
-	npages = sg_num_pages(dev, sglist, nelems);
+	npages = iommu_sg_num_pages(dev, sglist, nelems);
 
 	address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask);
 	if (address == AMD_IOMMU_MAPPING_ERROR)
@@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist,
 
 	startaddr = sg_dma_address(sglist) & PAGE_MASK;
 	dma_dom   = to_dma_ops_domain(domain);
-	npages    = sg_num_pages(dev, sglist, nelems);
+	npages    = iommu_sg_num_pages(dev, sglist, nelems);
 
 	__unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0bcb5cc..cfe6eeea3578 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -33,6 +33,7 @@
 #include <linux/bitops.h>
 #include <linux/property.h>
 #include <trace/events/iommu.h>
+#include <linux/iommu-helper.h>
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+		int nelems)
+{
+	unsigned long mask, boundary_size;
+	struct scatterlist *s;
+	int i, npages = 0;
+
+	mask = dma_get_seg_boundary(dev);
+	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT
+		: 1UL << (BITS_PER_LONG - PAGE_SHIFT);
+
+	for_each_sg(sglist, s, nelems, i) {
+		int p, n;
+
+		s->dma_address = npages << PAGE_SHIFT;
+		p = npages % boundary_size;
+		n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
+		if (p + n > boundary_size)
+			npages += boundary_size - p;
+		npages += n;
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(iommu_sg_num_pages);
+
 size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			 struct scatterlist *sg, unsigned int nents, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a7f2ac689d29..5b2d20e1475a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
 			       unsigned long iova, size_t size);
+extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+		int nelems);
 extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 				struct scatterlist *sg,unsigned int nents,
 				int prot);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu
@ 2017-10-08  4:02     ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  4:02 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-rdma, linux-api, Joerg Roedel, linux-xfs, linux-mm, iommu,
	linux-fsdevel

iommu_sg_num_pages() is a helper that walks a scattlerlist and counts
pages taking segment boundaries and iommu_num_pages() into account.
Up-level it for determining the IOVA range that dma_map_ops established
at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in
advance of freeing IOVA range.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
New patch in v8.

 drivers/iommu/amd_iommu.c |   30 ++----------------------------
 drivers/iommu/iommu.c     |   27 +++++++++++++++++++++++++++
 include/linux/iommu.h     |    2 ++
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c8e1a45af182..4795b0823469 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size,
 	__unmap_single(dma_dom, dma_addr, size, dir);
 }
 
-static int sg_num_pages(struct device *dev,
-			struct scatterlist *sglist,
-			int nelems)
-{
-	unsigned long mask, boundary_size;
-	struct scatterlist *s;
-	int i, npages = 0;
-
-	mask          = dma_get_seg_boundary(dev);
-	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
-				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
-
-	for_each_sg(sglist, s, nelems, i) {
-		int p, n;
-
-		s->dma_address = npages << PAGE_SHIFT;
-		p = npages % boundary_size;
-		n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
-		if (p + n > boundary_size)
-			npages += boundary_size - p;
-		npages += n;
-	}
-
-	return npages;
-}
-
 /*
  * The exported map_sg function for dma_ops (handles scatter-gather
  * lists).
@@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist *sglist,
 	dma_dom  = to_dma_ops_domain(domain);
 	dma_mask = *dev->dma_mask;
 
-	npages = sg_num_pages(dev, sglist, nelems);
+	npages = iommu_sg_num_pages(dev, sglist, nelems);
 
 	address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask);
 	if (address == AMD_IOMMU_MAPPING_ERROR)
@@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist,
 
 	startaddr = sg_dma_address(sglist) & PAGE_MASK;
 	dma_dom   = to_dma_ops_domain(domain);
-	npages    = sg_num_pages(dev, sglist, nelems);
+	npages    = iommu_sg_num_pages(dev, sglist, nelems);
 
 	__unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0bcb5cc..cfe6eeea3578 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -33,6 +33,7 @@
 #include <linux/bitops.h>
 #include <linux/property.h>
 #include <trace/events/iommu.h>
+#include <linux/iommu-helper.h>
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+		int nelems)
+{
+	unsigned long mask, boundary_size;
+	struct scatterlist *s;
+	int i, npages = 0;
+
+	mask = dma_get_seg_boundary(dev);
+	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT
+		: 1UL << (BITS_PER_LONG - PAGE_SHIFT);
+
+	for_each_sg(sglist, s, nelems, i) {
+		int p, n;
+
+		s->dma_address = npages << PAGE_SHIFT;
+		p = npages % boundary_size;
+		n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
+		if (p + n > boundary_size)
+			npages += boundary_size - p;
+		npages += n;
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(iommu_sg_num_pages);
+
 size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			 struct scatterlist *sg, unsigned int nents, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a7f2ac689d29..5b2d20e1475a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
 			       unsigned long iova, size_t size);
+extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+		int nelems);
 extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 				struct scatterlist *sg,unsigned int nents,
 				int prot);

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu
@ 2017-10-08  4:02     ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  4:02 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-rdma, linux-api, Joerg Roedel, linux-xfs, linux-mm, iommu,
	linux-fsdevel

iommu_sg_num_pages() is a helper that walks a scattlerlist and counts
pages taking segment boundaries and iommu_num_pages() into account.
Up-level it for determining the IOVA range that dma_map_ops established
at dma_map_sg() time. The intent is to iommu_unmap() the IOVA range in
advance of freeing IOVA range.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
New patch in v8.

 drivers/iommu/amd_iommu.c |   30 ++----------------------------
 drivers/iommu/iommu.c     |   27 +++++++++++++++++++++++++++
 include/linux/iommu.h     |    2 ++
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index c8e1a45af182..4795b0823469 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2459,32 +2459,6 @@ static void unmap_page(struct device *dev, dma_addr_t dma_addr, size_t size,
 	__unmap_single(dma_dom, dma_addr, size, dir);
 }
 
-static int sg_num_pages(struct device *dev,
-			struct scatterlist *sglist,
-			int nelems)
-{
-	unsigned long mask, boundary_size;
-	struct scatterlist *s;
-	int i, npages = 0;
-
-	mask          = dma_get_seg_boundary(dev);
-	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT :
-				   1UL << (BITS_PER_LONG - PAGE_SHIFT);
-
-	for_each_sg(sglist, s, nelems, i) {
-		int p, n;
-
-		s->dma_address = npages << PAGE_SHIFT;
-		p = npages % boundary_size;
-		n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
-		if (p + n > boundary_size)
-			npages += boundary_size - p;
-		npages += n;
-	}
-
-	return npages;
-}
-
 /*
  * The exported map_sg function for dma_ops (handles scatter-gather
  * lists).
@@ -2507,7 +2481,7 @@ static int map_sg(struct device *dev, struct scatterlist *sglist,
 	dma_dom  = to_dma_ops_domain(domain);
 	dma_mask = *dev->dma_mask;
 
-	npages = sg_num_pages(dev, sglist, nelems);
+	npages = iommu_sg_num_pages(dev, sglist, nelems);
 
 	address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask);
 	if (address == AMD_IOMMU_MAPPING_ERROR)
@@ -2585,7 +2559,7 @@ static void unmap_sg(struct device *dev, struct scatterlist *sglist,
 
 	startaddr = sg_dma_address(sglist) & PAGE_MASK;
 	dma_dom   = to_dma_ops_domain(domain);
-	npages    = sg_num_pages(dev, sglist, nelems);
+	npages    = iommu_sg_num_pages(dev, sglist, nelems);
 
 	__unmap_single(dma_dom, startaddr, npages << PAGE_SHIFT, dir);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3de5c0bcb5cc..cfe6eeea3578 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -33,6 +33,7 @@
 #include <linux/bitops.h>
 #include <linux/property.h>
 #include <trace/events/iommu.h>
+#include <linux/iommu-helper.h>
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -1631,6 +1632,32 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
+int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+		int nelems)
+{
+	unsigned long mask, boundary_size;
+	struct scatterlist *s;
+	int i, npages = 0;
+
+	mask = dma_get_seg_boundary(dev);
+	boundary_size = mask + 1 ? ALIGN(mask + 1, PAGE_SIZE) >> PAGE_SHIFT
+		: 1UL << (BITS_PER_LONG - PAGE_SHIFT);
+
+	for_each_sg(sglist, s, nelems, i) {
+		int p, n;
+
+		s->dma_address = npages << PAGE_SHIFT;
+		p = npages % boundary_size;
+		n = iommu_num_pages(sg_phys(s), s->length, PAGE_SIZE);
+		if (p + n > boundary_size)
+			npages += boundary_size - p;
+		npages += n;
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(iommu_sg_num_pages);
+
 size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			 struct scatterlist *sg, unsigned int nents, int prot)
 {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a7f2ac689d29..5b2d20e1475a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -303,6 +303,8 @@ extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
 			       unsigned long iova, size_t size);
+extern int iommu_sg_num_pages(struct device *dev, struct scatterlist *sglist,
+		int nelems);
 extern size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 				struct scatterlist *sg,unsigned int nents,
 				int prot);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
  2017-10-06 22:36   ` Dan Williams
  (?)
@ 2017-10-08  4:04     ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  4:04 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Doug Ledford, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma, linux-api, Joerg Roedel,
	Dave Chinner, iommu, Hal Rosenstock, linux-xfs, linux-mm,
	Jeff Layton, linux-fsdevel, Sean Hefty, David Woodhouse,
	Christoph Hellwig

Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the dma_get_iommu_domain() to both check if the device has domain
mappings that can be invalidated and retrieve the iommu_domain for use
with iommu_unmap.

Once we have that assurance that we can block in-flight I/O when the
file's block map changes then we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v7:
* Switch from dma_has_iommu() to dma_get_iommu_domain().
* Switch from dma_unmap_sg() at lease break time, to iommu_unmap() so
  that the IOVA remains allocated while the device might still be
  sending transactions.

 drivers/infiniband/core/umem.c |   90 +++++++++++++++++++++++++++++++++++-----
 include/rdma/ib_umem.h         |    8 ++++
 2 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..5e4598982359 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/mapdirect.h>
 #include <linux/export.h>
 #include <linux/hugetlb.h>
 #include <linux/slab.h>
@@ -46,10 +47,16 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	struct lease_direct *ld, *_ld;
 	struct scatterlist *sg;
 	struct page *page;
 	int i;
 
+	list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+		list_del_init(&ld->list);
+		map_direct_lease_destroy(ld);
+	}
+
 	if (umem->nmap > 0)
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
 				umem->npages,
@@ -64,10 +71,20 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	}
 
 	sg_free_table(&umem->sg_head);
-	return;
 
 }
 
+static void ib_umem_lease_break(void *__umem)
+{
+	struct ib_umem *umem = umem;
+	struct ib_device *idev = umem->context->device;
+	struct device *dev = idev->dma_device;
+	struct scatterlist *sgl = umem->sg_head.sgl;
+
+	iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
+			iommu_sg_num_pages(dev, sgl, umem->npages));
+}
+
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
@@ -96,7 +113,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	struct scatterlist *sg, *sg_list_start;
 	int need_release = 0;
 	unsigned int gup_flags = FOLL_WRITE;
+	struct vm_area_struct *vma_prev = NULL;
+	struct device *dma_dev;
 
+	dma_dev = context->device->dma_device;
 	if (dmasync)
 		dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +140,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->address    = addr;
 	umem->page_shift = PAGE_SHIFT;
 	umem->pid	 = get_task_pid(current, PIDTYPE_PID);
+	INIT_LIST_HEAD(&umem->leases);
 	/*
 	 * We ask for writable memory if any of the following
 	 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to setup a ->lease_direct()
+	 * lease to protect against file modifications, otherwise we can
+	 * tolerate a failure to allocate the vma_list and just assume
+	 * that all vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +222,52 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			const struct vm_operations_struct *vm_ops;
+			struct vm_area_struct *vma;
+			struct lease_direct *ld;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (vma == vma_prev)
+				continue;
+			vma_prev = vma;
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			vm_ops = vma->vm_ops;
+			if (!vm_ops->lease_direct) {
+				dev_info(dma_dev, "DAX-RDMA requires a MAP_DIRECT mapping\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+
+			if (!umem->iommu)
+				umem->iommu = dma_get_iommu_domain(dma_dev);
+			if (!umem->iommu) {
+				dev_info(dma_dev, "DAX-RDMA requires an iommu protected device\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			ld = vm_ops->lease_direct(vma, ib_umem_lease_break,
+					umem);
+			if (IS_ERR(ld)) {
+				ret = PTR_ERR(ld);
+				goto out;
+			}
+			list_add(&ld->list, &umem->leases);
 		}
 
 		/* preparing for next loop */
@@ -242,6 +302,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 23159dd5be18..5048be012f96 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -34,6 +34,7 @@
 #define IB_UMEM_H
 
 #include <linux/list.h>
+#include <linux/iommu.h>
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 
@@ -55,6 +56,13 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/*
+	 * Note: no lock protects this list since we assume memory
+	 * registration never races unregistration for a given ib_umem
+	 * instance.
+	 */
+	struct list_head	leases;
+	struct iommu_domain	*iommu;
 };
 
 /* Returns the offset of the umem start relative to the first page. */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08  4:04     ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  4:04 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Sean Hefty, linux-xfs, Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma, linux-api, Joerg Roedel, Dave Chinner, Jeff Moyer,
	iommu, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Doug Ledford, Ross Zwisler, linux-fsdevel, Jeff Layton,
	David Woodhouse, Hal Rosenstock

Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the dma_get_iommu_domain() to both check if the device has domain
mappings that can be invalidated and retrieve the iommu_domain for use
with iommu_unmap.

Once we have that assurance that we can block in-flight I/O when the
file's block map changes then we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v7:
* Switch from dma_has_iommu() to dma_get_iommu_domain().
* Switch from dma_unmap_sg() at lease break time, to iommu_unmap() so
  that the IOVA remains allocated while the device might still be
  sending transactions.

 drivers/infiniband/core/umem.c |   90 +++++++++++++++++++++++++++++++++++-----
 include/rdma/ib_umem.h         |    8 ++++
 2 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..5e4598982359 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/mapdirect.h>
 #include <linux/export.h>
 #include <linux/hugetlb.h>
 #include <linux/slab.h>
@@ -46,10 +47,16 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	struct lease_direct *ld, *_ld;
 	struct scatterlist *sg;
 	struct page *page;
 	int i;
 
+	list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+		list_del_init(&ld->list);
+		map_direct_lease_destroy(ld);
+	}
+
 	if (umem->nmap > 0)
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
 				umem->npages,
@@ -64,10 +71,20 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	}
 
 	sg_free_table(&umem->sg_head);
-	return;
 
 }
 
+static void ib_umem_lease_break(void *__umem)
+{
+	struct ib_umem *umem = umem;
+	struct ib_device *idev = umem->context->device;
+	struct device *dev = idev->dma_device;
+	struct scatterlist *sgl = umem->sg_head.sgl;
+
+	iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
+			iommu_sg_num_pages(dev, sgl, umem->npages));
+}
+
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
@@ -96,7 +113,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	struct scatterlist *sg, *sg_list_start;
 	int need_release = 0;
 	unsigned int gup_flags = FOLL_WRITE;
+	struct vm_area_struct *vma_prev = NULL;
+	struct device *dma_dev;
 
+	dma_dev = context->device->dma_device;
 	if (dmasync)
 		dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +140,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->address    = addr;
 	umem->page_shift = PAGE_SHIFT;
 	umem->pid	 = get_task_pid(current, PIDTYPE_PID);
+	INIT_LIST_HEAD(&umem->leases);
 	/*
 	 * We ask for writable memory if any of the following
 	 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to setup a ->lease_direct()
+	 * lease to protect against file modifications, otherwise we can
+	 * tolerate a failure to allocate the vma_list and just assume
+	 * that all vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +222,52 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			const struct vm_operations_struct *vm_ops;
+			struct vm_area_struct *vma;
+			struct lease_direct *ld;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (vma == vma_prev)
+				continue;
+			vma_prev = vma;
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			vm_ops = vma->vm_ops;
+			if (!vm_ops->lease_direct) {
+				dev_info(dma_dev, "DAX-RDMA requires a MAP_DIRECT mapping\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+
+			if (!umem->iommu)
+				umem->iommu = dma_get_iommu_domain(dma_dev);
+			if (!umem->iommu) {
+				dev_info(dma_dev, "DAX-RDMA requires an iommu protected device\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			ld = vm_ops->lease_direct(vma, ib_umem_lease_break,
+					umem);
+			if (IS_ERR(ld)) {
+				ret = PTR_ERR(ld);
+				goto out;
+			}
+			list_add(&ld->list, &umem->leases);
 		}
 
 		/* preparing for next loop */
@@ -242,6 +302,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 23159dd5be18..5048be012f96 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -34,6 +34,7 @@
 #define IB_UMEM_H
 
 #include <linux/list.h>
+#include <linux/iommu.h>
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 
@@ -55,6 +56,13 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/*
+	 * Note: no lock protects this list since we assume memory
+	 * registration never races unregistration for a given ib_umem
+	 * instance.
+	 */
+	struct list_head	leases;
+	struct iommu_domain	*iommu;
 };
 
 /* Returns the offset of the umem start relative to the first page. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08  4:04     ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08  4:04 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Sean Hefty, linux-xfs, Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma, linux-api, Joerg Roedel, Dave Chinner, Jeff Moyer,
	iommu, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Doug Ledford, Ross Zwisler, linux-fsdevel, Jeff Layton,
	David Woodhouse, Hal Rosenstock

Currently the ibverbs core in the kernel is completely unaware of the
dangers of filesystem-DAX mappings. Specifically, the filesystem is free
to move file blocks at will. In the case of DAX, it means that RDMA to a
given file offset can dynamically switch to another file offset, another
file, or free space with no notification to RDMA device to cease
operations. Historically, this lack of communication between the ibverbs
core and filesystem was not a problem because RDMA always targeted
dynamically allocated page cache, so at least the RDMA device would have
valid memory to target even if the file was being modified. With DAX we
need to add coordination since RDMA is bypassing page-cache and going
direct to on-media pages of the file. RDMA to DAX can cause damage if
filesystem blocks move / change state.

Use the new ->lease_direct() operation to get a notification when the
filesystem is invalidating the block map of the file and needs RDMA
operations to stop. Given that the kernel can not be in a position where
it needs to wait indefinitely for userspace to stop a device we need a
mechanism where the kernel can force-revoke access. Towards that end, use
the dma_get_iommu_domain() to both check if the device has domain
mappings that can be invalidated and retrieve the iommu_domain for use
with iommu_unmap.

Once we have that assurance that we can block in-flight I/O when the
file's block map changes then we can safely allow RDMA to DAX.

Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v7:
* Switch from dma_has_iommu() to dma_get_iommu_domain().
* Switch from dma_unmap_sg() at lease break time, to iommu_unmap() so
  that the IOVA remains allocated while the device might still be
  sending transactions.

 drivers/infiniband/core/umem.c |   90 +++++++++++++++++++++++++++++++++++-----
 include/rdma/ib_umem.h         |    8 ++++
 2 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 21e60b1e2ff4..5e4598982359 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -36,6 +36,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/mapdirect.h>
 #include <linux/export.h>
 #include <linux/hugetlb.h>
 #include <linux/slab.h>
@@ -46,10 +47,16 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	struct lease_direct *ld, *_ld;
 	struct scatterlist *sg;
 	struct page *page;
 	int i;
 
+	list_for_each_entry_safe(ld, _ld, &umem->leases, list) {
+		list_del_init(&ld->list);
+		map_direct_lease_destroy(ld);
+	}
+
 	if (umem->nmap > 0)
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl,
 				umem->npages,
@@ -64,10 +71,20 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 	}
 
 	sg_free_table(&umem->sg_head);
-	return;
 
 }
 
+static void ib_umem_lease_break(void *__umem)
+{
+	struct ib_umem *umem = umem;
+	struct ib_device *idev = umem->context->device;
+	struct device *dev = idev->dma_device;
+	struct scatterlist *sgl = umem->sg_head.sgl;
+
+	iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
+			iommu_sg_num_pages(dev, sgl, umem->npages));
+}
+
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
  *
@@ -96,7 +113,10 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	struct scatterlist *sg, *sg_list_start;
 	int need_release = 0;
 	unsigned int gup_flags = FOLL_WRITE;
+	struct vm_area_struct *vma_prev = NULL;
+	struct device *dma_dev;
 
+	dma_dev = context->device->dma_device;
 	if (dmasync)
 		dma_attrs |= DMA_ATTR_WRITE_BARRIER;
 
@@ -120,6 +140,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->address    = addr;
 	umem->page_shift = PAGE_SHIFT;
 	umem->pid	 = get_task_pid(current, PIDTYPE_PID);
+	INIT_LIST_HEAD(&umem->leases);
 	/*
 	 * We ask for writable memory if any of the following
 	 * access flags are set.  "Local write" and "remote write"
@@ -147,19 +168,21 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	umem->hugetlb   = 1;
 
 	page_list = (struct page **) __get_free_page(GFP_KERNEL);
-	if (!page_list) {
-		put_pid(umem->pid);
-		kfree(umem);
-		return ERR_PTR(-ENOMEM);
-	}
+	if (!page_list)
+		goto err_pagelist;
 
 	/*
-	 * if we can't alloc the vma_list, it's not so bad;
-	 * just assume the memory is not hugetlb memory
+	 * If DAX is enabled we need the vma to setup a ->lease_direct()
+	 * lease to protect against file modifications, otherwise we can
+	 * tolerate a failure to allocate the vma_list and just assume
+	 * that all vmas are not hugetlb-vmas.
 	 */
 	vma_list = (struct vm_area_struct **) __get_free_page(GFP_KERNEL);
-	if (!vma_list)
+	if (!vma_list) {
+		if (IS_ENABLED(CONFIG_DAX_MAP_DIRECT))
+			goto err_vmalist;
 		umem->hugetlb = 0;
+	}
 
 	npages = ib_umem_num_pages(umem);
 
@@ -199,15 +222,52 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		if (ret < 0)
 			goto out;
 
-		umem->npages += ret;
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
 
 		for_each_sg(sg_list_start, sg, ret, i) {
-			if (vma_list && !is_vm_hugetlb_page(vma_list[i]))
-				umem->hugetlb = 0;
+			const struct vm_operations_struct *vm_ops;
+			struct vm_area_struct *vma;
+			struct lease_direct *ld;
 
 			sg_set_page(sg, page_list[i], PAGE_SIZE, 0);
+			umem->npages++;
+
+			if (!vma_list)
+				continue;
+			vma = vma_list[i];
+
+			if (vma == vma_prev)
+				continue;
+			vma_prev = vma;
+
+			if (!is_vm_hugetlb_page(vma))
+				umem->hugetlb = 0;
+
+			if (!vma_is_dax(vma))
+				continue;
+
+			vm_ops = vma->vm_ops;
+			if (!vm_ops->lease_direct) {
+				dev_info(dma_dev, "DAX-RDMA requires a MAP_DIRECT mapping\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+
+			if (!umem->iommu)
+				umem->iommu = dma_get_iommu_domain(dma_dev);
+			if (!umem->iommu) {
+				dev_info(dma_dev, "DAX-RDMA requires an iommu protected device\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			ld = vm_ops->lease_direct(vma, ib_umem_lease_break,
+					umem);
+			if (IS_ERR(ld)) {
+				ret = PTR_ERR(ld);
+				goto out;
+			}
+			list_add(&ld->list, &umem->leases);
 		}
 
 		/* preparing for next loop */
@@ -242,6 +302,12 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 	free_page((unsigned long) page_list);
 
 	return ret < 0 ? ERR_PTR(ret) : umem;
+err_vmalist:
+	free_page((unsigned long) page_list);
+err_pagelist:
+	put_pid(umem->pid);
+	kfree(umem);
+	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(ib_umem_get);
 
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 23159dd5be18..5048be012f96 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -34,6 +34,7 @@
 #define IB_UMEM_H
 
 #include <linux/list.h>
+#include <linux/iommu.h>
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 
@@ -55,6 +56,13 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	int             npages;
+	/*
+	 * Note: no lock protects this list since we assume memory
+	 * registration never races unregistration for a given ib_umem
+	 * instance.
+	 */
+	struct list_head	leases;
+	struct iommu_domain	*iommu;
 };
 
 /* Returns the offset of the umem start relative to the first page. */


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08  6:45       ` kbuild test robot
  0 siblings, 0 replies; 158+ messages in thread
From: kbuild test robot @ 2017-10-08  6:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, J. Bruce Fields, linux-mm, Jeff Layton,
	Sean Hefty, Ashok Raj, linux-nvdimm, linux-rdma, Joerg Roedel,
	Christoph Hellwig, Doug Ledford, Hal Rosenstock, linux-api,
	Darrick J. Wong, linux-xfs, iommu, kbuild-all, linux-fsdevel,
	David Woodhouse

Hi Dan,

[auto build test ERROR on rdma/master]
[also build test ERROR on v4.14-rc3 next-20170929]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master
config: i386-randconfig-n0-201741 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory
    #include <linux/mapdirect.h>
                                ^
   compilation terminated.

vim +39 drivers/infiniband/core/umem.c

  > 39	#include <linux/mapdirect.h>
    40	#include <linux/export.h>
    41	#include <linux/hugetlb.h>
    42	#include <linux/slab.h>
    43	#include <rdma/ib_umem_odp.h>
    44	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08  6:45       ` kbuild test robot
  0 siblings, 0 replies; 158+ messages in thread
From: kbuild test robot @ 2017-10-08  6:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, J. Bruce Fields,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Layton, Sean Hefty,
	Ashok Raj, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Joerg Roedel,
	Christoph Hellwig, Doug Ledford, Hal Rosenstock,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Darrick J. Wong,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kbuild-all-JC7UmRfGjtg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	David Woodhouse

Hi Dan,

[auto build test ERROR on rdma/master]
[also build test ERROR on v4.14-rc3 next-20170929]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master
config: i386-randconfig-n0-201741 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory
    #include <linux/mapdirect.h>
                                ^
   compilation terminated.

vim +39 drivers/infiniband/core/umem.c

  > 39	#include <linux/mapdirect.h>
    40	#include <linux/export.h>
    41	#include <linux/hugetlb.h>
    42	#include <linux/slab.h>
    43	#include <rdma/ib_umem_odp.h>
    44	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08  6:45       ` kbuild test robot
  0 siblings, 0 replies; 158+ messages in thread
From: kbuild test robot @ 2017-10-08  6:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: kbuild-all, linux-nvdimm, Sean Hefty, linux-xfs, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, linux-api, Joerg Roedel,
	Dave Chinner, Jeff Moyer, iommu, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Doug Ledford, Ross Zwisler,
	linux-fsdevel, Jeff Layton, David Woodhouse, Hal Rosenstock

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Hi Dan,

[auto build test ERROR on rdma/master]
[also build test ERROR on v4.14-rc3 next-20170929]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master
config: i386-randconfig-n0-201741 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory
    #include <linux/mapdirect.h>
                                ^
   compilation terminated.

vim +39 drivers/infiniband/core/umem.c

  > 39	#include <linux/mapdirect.h>
    40	#include <linux/export.h>
    41	#include <linux/hugetlb.h>
    42	#include <linux/slab.h>
    43	#include <rdma/ib_umem_odp.h>
    44	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30864 bytes --]

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08 15:49         ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08 15:49 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-nvdimm, Sean Hefty, linux-xfs, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, Linux API, Joerg Roedel,
	Dave Chinner, Jeff Moyer, iommu, Christoph Hellwig,
	J. Bruce Fields, Linux MM, Doug Ledford, Ross Zwisler,
	linux-fsdevel, Jeff Layton, David Woodhouse, Hal Rosenstock

On Sat, Oct 7, 2017 at 11:45 PM, kbuild test robot <lkp@intel.com> wrote:
> Hi Dan,
>
> [auto build test ERROR on rdma/master]
> [also build test ERROR on v4.14-rc3 next-20170929]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

This was a fixed up resend of patch [v7 11/12]. It's not clear how to
teach the kbuild robot to be aware of patch-replies to individual
patches in the series. I.e. reworked patches without resending the
complete series?

> url:    https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master
> config: i386-randconfig-n0-201741 (attached as .config)
> compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory
>     #include <linux/mapdirect.h>

mapdirect.h indeed does not exist when missing the earlier patches in
the series. It would be slick if the 0day-robot read the the
"in-reply-to" header and auto replaced a patch in a series, but that
would be a feature approaching magic.

>    compilation terminated.
>
> vim +39 drivers/infiniband/core/umem.c
>
>   > 39  #include <linux/mapdirect.h>
>     40  #include <linux/export.h>
>     41  #include <linux/hugetlb.h>
>     42  #include <linux/slab.h>
>     43  #include <rdma/ib_umem_odp.h>
>     44
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08 15:49         ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08 15:49 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all-JC7UmRfGjtg, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Sean Hefty, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Linux API, Joerg Roedel, Dave Chinner, Jeff Moyer,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Christoph Hellwig, J. Bruce Fields, Linux MM, Doug Ledford,
	Ross Zwisler, linux-fsdevel, Jeff Layton, David Woodhouse, Hal

On Sat, Oct 7, 2017 at 11:45 PM, kbuild test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> Hi Dan,
>
> [auto build test ERROR on rdma/master]
> [also build test ERROR on v4.14-rc3 next-20170929]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

This was a fixed up resend of patch [v7 11/12]. It's not clear how to
teach the kbuild robot to be aware of patch-replies to individual
patches in the series. I.e. reworked patches without resending the
complete series?

> url:    https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master
> config: i386-randconfig-n0-201741 (attached as .config)
> compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory
>     #include <linux/mapdirect.h>

mapdirect.h indeed does not exist when missing the earlier patches in
the series. It would be slick if the 0day-robot read the the
"in-reply-to" header and auto replaced a patch in a series, but that
would be a feature approaching magic.

>    compilation terminated.
>
> vim +39 drivers/infiniband/core/umem.c
>
>   > 39  #include <linux/mapdirect.h>
>     40  #include <linux/export.h>
>     41  #include <linux/hugetlb.h>
>     42  #include <linux/slab.h>
>     43  #include <rdma/ib_umem_odp.h>
>     44
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings
@ 2017-10-08 15:49         ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-08 15:49 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-nvdimm, Sean Hefty, linux-xfs, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, Linux API, Joerg Roedel,
	Dave Chinner, Jeff Moyer, iommu, Christoph Hellwig,
	J. Bruce Fields, Linux MM, Doug Ledford, Ross Zwisler,
	linux-fsdevel, Jeff Layton, David Woodhouse, Hal Rosenstock

On Sat, Oct 7, 2017 at 11:45 PM, kbuild test robot <lkp@intel.com> wrote:
> Hi Dan,
>
> [auto build test ERROR on rdma/master]
> [also build test ERROR on v4.14-rc3 next-20170929]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

This was a fixed up resend of patch [v7 11/12]. It's not clear how to
teach the kbuild robot to be aware of patch-replies to individual
patches in the series. I.e. reworked patches without resending the
complete series?

> url:    https://github.com/0day-ci/linux/commits/Dan-Williams/iommu-up-level-sg_num_pages-from-amd-iommu/20171008-133505
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git master
> config: i386-randconfig-n0-201741 (attached as .config)
> compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>>> drivers/infiniband/core/umem.c:39:29: fatal error: linux/mapdirect.h: No such file or directory
>     #include <linux/mapdirect.h>

mapdirect.h indeed does not exist when missing the earlier patches in
the series. It would be slick if the 0day-robot read the the
"in-reply-to" header and auto replaced a patch in a series, but that
would be a feature approaching magic.

>    compilation terminated.
>
> vim +39 drivers/infiniband/core/umem.c
>
>   > 39  #include <linux/mapdirect.h>
>     40  #include <linux/export.h>
>     41  #include <linux/hugetlb.h>
>     42  #include <linux/slab.h>
>     43  #include <rdma/ib_umem_odp.h>
>     44
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 03/12] fs: introduce i_mapdcount
@ 2017-10-09  3:08     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, linux-nvdimm, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, Christoph Hellwig

On Fri, Oct 06, 2017 at 03:35:32PM -0700, Dan Williams wrote:
> When ->iomap_begin() sees this count being non-zero and determines that
> the block map of the file needs to be modified to satisfy the I/O
> request it will instead return an error. This is needed for MAP_DIRECT
> where, due to locking constraints, we can't rely on xfs_break_layouts()
> to protect against allocating write-faults either from the process that
> setup the MAP_DIRECT mapping nor other processes that have the file
> mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
> mix with the XFS_MMAPLOCK in the fault path.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_iomap.c |    9 +++++++++
>  include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index a1909bc064e9..6816f8ebbdcf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
>  			goto out_unlock;
>  		}
>  		/*
> +		 * If a file has MAP_DIRECT mappings disable block map
> +		 * updates. This should only effect mmap write faults as
> +		 * other paths are protected by an FL_LAYOUT lease.
> +		 */
> +		if (i_mapdcount_read(inode)) {
> +			error = -ETXTBSY;
> +			goto out_unlock;
> +		}

That looks really fragile. For one, it's going to miss modifications
to reflinked files altogether. Ignoring that, however, I don't want to
have to care one bit about the internals of the MAP_DIRECT
implementation in the filesystem code. Hide it behind something with
an obvious name that returns the appropriate error and the
filesystem code becomes self documenting:

	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
		.....
		error = iomap_can_allocate(inode);
		if (error)
			goto out_unlock;

Then you can put all the MAP_DIRECT stuff and the comments
explaining what is does inside the generic function that determines
if we are allowed to allocate on that inode or not.

> +		/*
>  		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
>  		 * pages to keep the chunks of work done where somewhat symmetric
>  		 * with the work writeback does. This is a completely arbitrary
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c2b9bf3dc4e9..f83871b188ff 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -642,6 +642,9 @@ struct inode {
>  	atomic_t		i_count;
>  	atomic_t		i_dio_count;
>  	atomic_t		i_writecount;
> +#ifdef CONFIG_FS_DAX
> +	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
> +#endif

Is there any way to avoid growing the struct inode for this?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 03/12] fs: introduce i_mapdcount
@ 2017-10-09  3:08     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	J. Bruce Fields, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:35:32PM -0700, Dan Williams wrote:
> When ->iomap_begin() sees this count being non-zero and determines that
> the block map of the file needs to be modified to satisfy the I/O
> request it will instead return an error. This is needed for MAP_DIRECT
> where, due to locking constraints, we can't rely on xfs_break_layouts()
> to protect against allocating write-faults either from the process that
> setup the MAP_DIRECT mapping nor other processes that have the file
> mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
> mix with the XFS_MMAPLOCK in the fault path.
> 
> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  fs/xfs/xfs_iomap.c |    9 +++++++++
>  include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index a1909bc064e9..6816f8ebbdcf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
>  			goto out_unlock;
>  		}
>  		/*
> +		 * If a file has MAP_DIRECT mappings disable block map
> +		 * updates. This should only effect mmap write faults as
> +		 * other paths are protected by an FL_LAYOUT lease.
> +		 */
> +		if (i_mapdcount_read(inode)) {
> +			error = -ETXTBSY;
> +			goto out_unlock;
> +		}

That looks really fragile. For one, it's going to miss modifications
to reflinked files altogether. Ignoring that, however, I don't want to
have to care one bit about the internals of the MAP_DIRECT
implementation in the filesystem code. Hide it behind something with
an obvious name that returns the appropriate error and the
filesystem code becomes self documenting:

	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
		.....
		error = iomap_can_allocate(inode);
		if (error)
			goto out_unlock;

Then you can put all the MAP_DIRECT stuff and the comments
explaining what is does inside the generic function that determines
if we are allowed to allocate on that inode or not.

> +		/*
>  		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
>  		 * pages to keep the chunks of work done where somewhat symmetric
>  		 * with the work writeback does. This is a completely arbitrary
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c2b9bf3dc4e9..f83871b188ff 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -642,6 +642,9 @@ struct inode {
>  	atomic_t		i_count;
>  	atomic_t		i_dio_count;
>  	atomic_t		i_writecount;
> +#ifdef CONFIG_FS_DAX
> +	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
> +#endif

Is there any way to avoid growing the struct inode for this?

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 03/12] fs: introduce i_mapdcount
@ 2017-10-09  3:08     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:35:32PM -0700, Dan Williams wrote:
> When ->iomap_begin() sees this count being non-zero and determines that
> the block map of the file needs to be modified to satisfy the I/O
> request it will instead return an error. This is needed for MAP_DIRECT
> where, due to locking constraints, we can't rely on xfs_break_layouts()
> to protect against allocating write-faults either from the process that
> setup the MAP_DIRECT mapping nor other processes that have the file
> mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
> mix with the XFS_MMAPLOCK in the fault path.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_iomap.c |    9 +++++++++
>  include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index a1909bc064e9..6816f8ebbdcf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
>  			goto out_unlock;
>  		}
>  		/*
> +		 * If a file has MAP_DIRECT mappings disable block map
> +		 * updates. This should only effect mmap write faults as
> +		 * other paths are protected by an FL_LAYOUT lease.
> +		 */
> +		if (i_mapdcount_read(inode)) {
> +			error = -ETXTBSY;
> +			goto out_unlock;
> +		}

That looks really fragile. For one, it's going to miss modifications
to reflinked files altogether. Ignoring that, however, I don't want to
have to care one bit about the internals of the MAP_DIRECT
implementation in the filesystem code. Hide it behind something with
an obvious name that returns the appropriate error and the
filesystem code becomes self documenting:

	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
		.....
		error = iomap_can_allocate(inode);
		if (error)
			goto out_unlock;

Then you can put all the MAP_DIRECT stuff and the comments
explaining what is does inside the generic function that determines
if we are allowed to allocate on that inode or not.

> +		/*
>  		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
>  		 * pages to keep the chunks of work done where somewhat symmetric
>  		 * with the work writeback does. This is a completely arbitrary
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c2b9bf3dc4e9..f83871b188ff 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -642,6 +642,9 @@ struct inode {
>  	atomic_t		i_count;
>  	atomic_t		i_dio_count;
>  	atomic_t		i_writecount;
> +#ifdef CONFIG_FS_DAX
> +	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
> +#endif

Is there any way to avoid growing the struct inode for this?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 03/12] fs: introduce i_mapdcount
@ 2017-10-09  3:08     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:35:32PM -0700, Dan Williams wrote:
> When ->iomap_begin() sees this count being non-zero and determines that
> the block map of the file needs to be modified to satisfy the I/O
> request it will instead return an error. This is needed for MAP_DIRECT
> where, due to locking constraints, we can't rely on xfs_break_layouts()
> to protect against allocating write-faults either from the process that
> setup the MAP_DIRECT mapping nor other processes that have the file
> mapped.  xfs_break_layouts() requires XFS_IOLOCK which is problematic to
> mix with the XFS_MMAPLOCK in the fault path.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_iomap.c |    9 +++++++++
>  include/linux/fs.h |   31 +++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index a1909bc064e9..6816f8ebbdcf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1053,6 +1053,15 @@ xfs_file_iomap_begin(
>  			goto out_unlock;
>  		}
>  		/*
> +		 * If a file has MAP_DIRECT mappings disable block map
> +		 * updates. This should only effect mmap write faults as
> +		 * other paths are protected by an FL_LAYOUT lease.
> +		 */
> +		if (i_mapdcount_read(inode)) {
> +			error = -ETXTBSY;
> +			goto out_unlock;
> +		}

That looks really fragile. For one, it's going to miss modifications
to reflinked files altogether. Ignoring that, however, I don't want to
have to care one bit about the internals of the MAP_DIRECT
implementation in the filesystem code. Hide it behind something with
an obvious name that returns the appropriate error and the
filesystem code becomes self documenting:

	if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
		.....
		error = iomap_can_allocate(inode);
		if (error)
			goto out_unlock;

Then you can put all the MAP_DIRECT stuff and the comments
explaining what is does inside the generic function that determines
if we are allowed to allocate on that inode or not.

> +		/*
>  		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
>  		 * pages to keep the chunks of work done where somewhat symmetric
>  		 * with the work writeback does. This is a completely arbitrary
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c2b9bf3dc4e9..f83871b188ff 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -642,6 +642,9 @@ struct inode {
>  	atomic_t		i_count;
>  	atomic_t		i_dio_count;
>  	atomic_t		i_writecount;
> +#ifdef CONFIG_FS_DAX
> +	atomic_t		i_mapdcount;	/* count of MAP_DIRECT vmas */
> +#endif

Is there any way to avoid growing the struct inode for this?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
  2017-10-06 22:35   ` Dan Williams
  (?)
@ 2017-10-09  3:40     ` Dave Chinner
  -1 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Arnd Bergmann, Darrick J. Wong,
	linux-rdma, linux-api, linux-nvdimm, linux-xfs, linux-mm,
	Alexander Viro, linux-fsdevel, Jeff Layton, Christoph Hellwig

On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote:
> MAP_DIRECT is an mmap(2) flag with the following semantics:
> 
>   MAP_DIRECT
>   When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
>   same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
>   is broken when a "lease breaker" attempts to write(2), change the block
>   map (fallocate), or change the size of the file. Otherwise the mechanism
>   of a lease break is identical to the typical lease break case where the
>   lease needs to be removed (munmap) within the number of seconds
>   specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
>   remove the lease in time the kernel will invalidate the mapping and
>   force all future accesses to the mapping to trigger SIGBUS.
> 
>   In addition to lease break timeouts causing faults in the mapping to
>   result in SIGBUS, other states of the file will trigger SIGBUS at fault
>   time:
> 
>       * The file is not DAX capable
>       * The file has reflinked (copy-on-write) blocks
>       * The fault would trigger the filesystem to allocate blocks
>       * The fault would trigger the filesystem to perform extent conversion
> 
>   In other words, MAP_DIRECT expects and enforces a fully allocated file
>   where faults can be satisfied without modifying block map metadata.
> 
>   An unprivileged process may establish a MAP_DIRECT mapping on a file
>   whose UID (owner) matches the filesystem UID of the  process. A process
>   with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
>   arbitrary files
> 
>   ERRORS
>   EACCES Beyond the typical mmap(2) conditions that trigger EACCES
>   MAP_DIRECT also requires the permission to set a file lease.
> 
>   EOPNOTSUPP The filesystem explicitly does not support the flag
> 
>   SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
>          might require block-map updates, or the lease timed out and the
>          kernel invalidated the mapping.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/Kconfig                  |    2 -
>  fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
>  include/linux/mman.h            |    3 +
>  include/uapi/asm-generic/mman.h |    1 
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index f62fc6629abb..f8765653a438 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
>  
>  config XFS_LAYOUT
>  	def_bool y
> -	depends on EXPORTFS_BLOCK_OPS
> +	depends on EXPORTFS_BLOCK_OPS || FS_DAX
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ebdd0bd2b261..e35518600e28 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -40,12 +40,22 @@
>  #include "xfs_iomap.h"
>  #include "xfs_reflink.h"
>  
> +#include <linux/mman.h>
>  #include <linux/dcache.h>
>  #include <linux/falloc.h>
>  #include <linux/pagevec.h>
> +#include <linux/mapdirect.h>
>  #include <linux/backing-dev.h>
>  
>  static const struct vm_operations_struct xfs_file_vm_ops;
> +static const struct vm_operations_struct xfs_file_vm_direct_ops;
> +
> +static inline bool
> +is_xfs_map_direct(
> +		struct vm_area_struct *vma)
> +{
> +	return vma->vm_ops == &xfs_file_vm_direct_ops;
> +}

Namespacing (xfs_vma_is_direct) and whitespace damage.

>  
>  /*
>   * Clear the specified ranges to zero through either the pagecache or DAX.
> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
>  	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
>  }
>  
> +static int
> +xfs_vma_checks(
> +	struct vm_area_struct	*vma,
> +	struct inode		*inode)

Exactly what are we checking for - function name doesn't tell me,
and there's no comments, either?

> +{
> +	if (!is_xfs_map_direct(vma))
> +		return 0;
> +
> +	if (!is_map_direct_valid(vma->vm_private_data))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (xfs_is_reflink_inode(XFS_I(inode)))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (!IS_DAX(inode))
> +		return VM_FAULT_SIGBUS;

And how do we get is_xfs_map_direct() set to true if we don't have a
DAX inode or the inode has shared extents?

> +
> +	return 0;
> +}
> +
>  /*
>   * Locking for serialisation of IO during page faults. This results in a lock
>   * ordering of:
> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
>  	enum page_entry_size	pe_size,
>  	bool			write_fault)
>  {
> +	struct vm_area_struct	*vma = vmf->vma;
>  	struct inode		*inode = file_inode(vmf->vma->vm_file);

You missed this vmf->vma....

.....
>  
> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
> +
> +STATIC int
> +xfs_file_mmap_validate(
> +	struct file		*filp,
> +	struct vm_area_struct	*vma,
> +	unsigned long		map_flags,
> +	int			fd)
> +{
> +	struct inode		*inode = file_inode(filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct map_direct_state	*mds;
> +
> +	if (map_flags & ~(XFS_MAP_SUPPORTED))
> +		return -EOPNOTSUPP;
> +
> +	if ((map_flags & MAP_DIRECT) == 0)
> +		return xfs_file_mmap(filp, vma);
> +
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_direct_ops;
> +	if (IS_DAX(inode))
> +		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;

And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?

> +	mds = map_direct_register(fd, vma);
> +	if (IS_ERR(mds))
> +		return PTR_ERR(mds);
> +
> +	/* flush in-flight faults */
> +	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);

Urk. That's nasty. And why is it even necessary? Please explain why
this is necessary in the comment, because it's not at all obvious to
me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-09  3:40     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-rdma, linux-api, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote:
> MAP_DIRECT is an mmap(2) flag with the following semantics:
> 
>   MAP_DIRECT
>   When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
>   same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
>   is broken when a "lease breaker" attempts to write(2), change the block
>   map (fallocate), or change the size of the file. Otherwise the mechanism
>   of a lease break is identical to the typical lease break case where the
>   lease needs to be removed (munmap) within the number of seconds
>   specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
>   remove the lease in time the kernel will invalidate the mapping and
>   force all future accesses to the mapping to trigger SIGBUS.
> 
>   In addition to lease break timeouts causing faults in the mapping to
>   result in SIGBUS, other states of the file will trigger SIGBUS at fault
>   time:
> 
>       * The file is not DAX capable
>       * The file has reflinked (copy-on-write) blocks
>       * The fault would trigger the filesystem to allocate blocks
>       * The fault would trigger the filesystem to perform extent conversion
> 
>   In other words, MAP_DIRECT expects and enforces a fully allocated file
>   where faults can be satisfied without modifying block map metadata.
> 
>   An unprivileged process may establish a MAP_DIRECT mapping on a file
>   whose UID (owner) matches the filesystem UID of the  process. A process
>   with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
>   arbitrary files
> 
>   ERRORS
>   EACCES Beyond the typical mmap(2) conditions that trigger EACCES
>   MAP_DIRECT also requires the permission to set a file lease.
> 
>   EOPNOTSUPP The filesystem explicitly does not support the flag
> 
>   SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
>          might require block-map updates, or the lease timed out and the
>          kernel invalidated the mapping.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/Kconfig                  |    2 -
>  fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
>  include/linux/mman.h            |    3 +
>  include/uapi/asm-generic/mman.h |    1 
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index f62fc6629abb..f8765653a438 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
>  
>  config XFS_LAYOUT
>  	def_bool y
> -	depends on EXPORTFS_BLOCK_OPS
> +	depends on EXPORTFS_BLOCK_OPS || FS_DAX
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ebdd0bd2b261..e35518600e28 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -40,12 +40,22 @@
>  #include "xfs_iomap.h"
>  #include "xfs_reflink.h"
>  
> +#include <linux/mman.h>
>  #include <linux/dcache.h>
>  #include <linux/falloc.h>
>  #include <linux/pagevec.h>
> +#include <linux/mapdirect.h>
>  #include <linux/backing-dev.h>
>  
>  static const struct vm_operations_struct xfs_file_vm_ops;
> +static const struct vm_operations_struct xfs_file_vm_direct_ops;
> +
> +static inline bool
> +is_xfs_map_direct(
> +		struct vm_area_struct *vma)
> +{
> +	return vma->vm_ops == &xfs_file_vm_direct_ops;
> +}

Namespacing (xfs_vma_is_direct) and whitespace damage.

>  
>  /*
>   * Clear the specified ranges to zero through either the pagecache or DAX.
> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
>  	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
>  }
>  
> +static int
> +xfs_vma_checks(
> +	struct vm_area_struct	*vma,
> +	struct inode		*inode)

Exactly what are we checking for - function name doesn't tell me,
and there's no comments, either?

> +{
> +	if (!is_xfs_map_direct(vma))
> +		return 0;
> +
> +	if (!is_map_direct_valid(vma->vm_private_data))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (xfs_is_reflink_inode(XFS_I(inode)))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (!IS_DAX(inode))
> +		return VM_FAULT_SIGBUS;

And how do we get is_xfs_map_direct() set to true if we don't have a
DAX inode or the inode has shared extents?

> +
> +	return 0;
> +}
> +
>  /*
>   * Locking for serialisation of IO during page faults. This results in a lock
>   * ordering of:
> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
>  	enum page_entry_size	pe_size,
>  	bool			write_fault)
>  {
> +	struct vm_area_struct	*vma = vmf->vma;
>  	struct inode		*inode = file_inode(vmf->vma->vm_file);

You missed this vmf->vma....

.....
>  
> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
> +
> +STATIC int
> +xfs_file_mmap_validate(
> +	struct file		*filp,
> +	struct vm_area_struct	*vma,
> +	unsigned long		map_flags,
> +	int			fd)
> +{
> +	struct inode		*inode = file_inode(filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct map_direct_state	*mds;
> +
> +	if (map_flags & ~(XFS_MAP_SUPPORTED))
> +		return -EOPNOTSUPP;
> +
> +	if ((map_flags & MAP_DIRECT) == 0)
> +		return xfs_file_mmap(filp, vma);
> +
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_direct_ops;
> +	if (IS_DAX(inode))
> +		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;

And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?

> +	mds = map_direct_register(fd, vma);
> +	if (IS_ERR(mds))
> +		return PTR_ERR(mds);
> +
> +	/* flush in-flight faults */
> +	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);

Urk. That's nasty. And why is it even necessary? Please explain why
this is necessary in the comment, because it's not at all obvious to
me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-09  3:40     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-rdma, linux-api, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote:
> MAP_DIRECT is an mmap(2) flag with the following semantics:
> 
>   MAP_DIRECT
>   When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
>   same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
>   is broken when a "lease breaker" attempts to write(2), change the block
>   map (fallocate), or change the size of the file. Otherwise the mechanism
>   of a lease break is identical to the typical lease break case where the
>   lease needs to be removed (munmap) within the number of seconds
>   specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
>   remove the lease in time the kernel will invalidate the mapping and
>   force all future accesses to the mapping to trigger SIGBUS.
> 
>   In addition to lease break timeouts causing faults in the mapping to
>   result in SIGBUS, other states of the file will trigger SIGBUS at fault
>   time:
> 
>       * The file is not DAX capable
>       * The file has reflinked (copy-on-write) blocks
>       * The fault would trigger the filesystem to allocate blocks
>       * The fault would trigger the filesystem to perform extent conversion
> 
>   In other words, MAP_DIRECT expects and enforces a fully allocated file
>   where faults can be satisfied without modifying block map metadata.
> 
>   An unprivileged process may establish a MAP_DIRECT mapping on a file
>   whose UID (owner) matches the filesystem UID of the  process. A process
>   with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
>   arbitrary files
> 
>   ERRORS
>   EACCES Beyond the typical mmap(2) conditions that trigger EACCES
>   MAP_DIRECT also requires the permission to set a file lease.
> 
>   EOPNOTSUPP The filesystem explicitly does not support the flag
> 
>   SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
>          might require block-map updates, or the lease timed out and the
>          kernel invalidated the mapping.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/Kconfig                  |    2 -
>  fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
>  include/linux/mman.h            |    3 +
>  include/uapi/asm-generic/mman.h |    1 
>  4 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index f62fc6629abb..f8765653a438 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
>  
>  config XFS_LAYOUT
>  	def_bool y
> -	depends on EXPORTFS_BLOCK_OPS
> +	depends on EXPORTFS_BLOCK_OPS || FS_DAX
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ebdd0bd2b261..e35518600e28 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -40,12 +40,22 @@
>  #include "xfs_iomap.h"
>  #include "xfs_reflink.h"
>  
> +#include <linux/mman.h>
>  #include <linux/dcache.h>
>  #include <linux/falloc.h>
>  #include <linux/pagevec.h>
> +#include <linux/mapdirect.h>
>  #include <linux/backing-dev.h>
>  
>  static const struct vm_operations_struct xfs_file_vm_ops;
> +static const struct vm_operations_struct xfs_file_vm_direct_ops;
> +
> +static inline bool
> +is_xfs_map_direct(
> +		struct vm_area_struct *vma)
> +{
> +	return vma->vm_ops == &xfs_file_vm_direct_ops;
> +}

Namespacing (xfs_vma_is_direct) and whitespace damage.

>  
>  /*
>   * Clear the specified ranges to zero through either the pagecache or DAX.
> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
>  	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
>  }
>  
> +static int
> +xfs_vma_checks(
> +	struct vm_area_struct	*vma,
> +	struct inode		*inode)

Exactly what are we checking for - function name doesn't tell me,
and there's no comments, either?

> +{
> +	if (!is_xfs_map_direct(vma))
> +		return 0;
> +
> +	if (!is_map_direct_valid(vma->vm_private_data))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (xfs_is_reflink_inode(XFS_I(inode)))
> +		return VM_FAULT_SIGBUS;
> +
> +	if (!IS_DAX(inode))
> +		return VM_FAULT_SIGBUS;

And how do we get is_xfs_map_direct() set to true if we don't have a
DAX inode or the inode has shared extents?

> +
> +	return 0;
> +}
> +
>  /*
>   * Locking for serialisation of IO during page faults. This results in a lock
>   * ordering of:
> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
>  	enum page_entry_size	pe_size,
>  	bool			write_fault)
>  {
> +	struct vm_area_struct	*vma = vmf->vma;
>  	struct inode		*inode = file_inode(vmf->vma->vm_file);

You missed this vmf->vma....

.....
>  
> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
> +
> +STATIC int
> +xfs_file_mmap_validate(
> +	struct file		*filp,
> +	struct vm_area_struct	*vma,
> +	unsigned long		map_flags,
> +	int			fd)
> +{
> +	struct inode		*inode = file_inode(filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct map_direct_state	*mds;
> +
> +	if (map_flags & ~(XFS_MAP_SUPPORTED))
> +		return -EOPNOTSUPP;
> +
> +	if ((map_flags & MAP_DIRECT) == 0)
> +		return xfs_file_mmap(filp, vma);
> +
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_direct_ops;
> +	if (IS_DAX(inode))
> +		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;

And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?

> +	mds = map_direct_register(fd, vma);
> +	if (IS_ERR(mds))
> +		return PTR_ERR(mds);
> +
> +	/* flush in-flight faults */
> +	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);

Urk. That's nasty. And why is it even necessary? Please explain why
this is necessary in the comment, because it's not at all obvious to
me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-09  3:45     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, linux-nvdimm, linux-xfs, linux-mm, linux-fsdevel,
	Jeff Layton, Christoph Hellwig

On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote:
> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
> mapping established. For xfs we establish a new lease and then check if
> the MAP_DIRECT mapping has been broken. We want to be sure that the
> process will receive notification that the MAP_DIRECT mapping is being
> torn down so it knows why other code paths are throwing failures.
> 
> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
> MAP_DIRECT mapping is invalid or in the process of being invalidated.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e35518600e28..823b65f17429 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
>  	put_map_direct_vma(vma->vm_private_data);
>  }
>  
> +static struct lease_direct *
> +xfs_filemap_direct_lease(
> +	struct vm_area_struct	*vma,
> +	void			(*break_fn)(void *),
> +	void			*owner)
> +{
> +	struct lease_direct	*ld;
> +
> +	ld = map_direct_lease(vma, break_fn, owner);
> +
> +	if (IS_ERR(ld))
> +		return ld;
> +
> +	/*
> +	 * We now have an established lease while the base MAP_DIRECT
> +	 * lease was not broken. So, we know that the "lease holder" will
> +	 * receive a SIGIO notification when the lease is broken and
> +	 * take any necessary cleanup actions.
> +	 */
> +	if (!is_map_direct_broken(vma->vm_private_data))
> +		return ld;
> +
> +	map_direct_lease_destroy(ld);
> +
> +	return ERR_PTR(-ENXIO);

What's any of this got to do with XFS? Shouldn't it be in generic
code, and called generic_filemap_direct_lease()?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-09  3:45     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton,
	Christoph Hellwig

On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote:
> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
> mapping established. For xfs we establish a new lease and then check if
> the MAP_DIRECT mapping has been broken. We want to be sure that the
> process will receive notification that the MAP_DIRECT mapping is being
> torn down so it knows why other code paths are throwing failures.
> 
> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
> MAP_DIRECT mapping is invalid or in the process of being invalidated.
> 
> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e35518600e28..823b65f17429 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
>  	put_map_direct_vma(vma->vm_private_data);
>  }
>  
> +static struct lease_direct *
> +xfs_filemap_direct_lease(
> +	struct vm_area_struct	*vma,
> +	void			(*break_fn)(void *),
> +	void			*owner)
> +{
> +	struct lease_direct	*ld;
> +
> +	ld = map_direct_lease(vma, break_fn, owner);
> +
> +	if (IS_ERR(ld))
> +		return ld;
> +
> +	/*
> +	 * We now have an established lease while the base MAP_DIRECT
> +	 * lease was not broken. So, we know that the "lease holder" will
> +	 * receive a SIGIO notification when the lease is broken and
> +	 * take any necessary cleanup actions.
> +	 */
> +	if (!is_map_direct_broken(vma->vm_private_data))
> +		return ld;
> +
> +	map_direct_lease_destroy(ld);
> +
> +	return ERR_PTR(-ENXIO);

What's any of this got to do with XFS? Shouldn't it be in generic
code, and called generic_filemap_direct_lease()?

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-09  3:45     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote:
> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
> mapping established. For xfs we establish a new lease and then check if
> the MAP_DIRECT mapping has been broken. We want to be sure that the
> process will receive notification that the MAP_DIRECT mapping is being
> torn down so it knows why other code paths are throwing failures.
> 
> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
> MAP_DIRECT mapping is invalid or in the process of being invalidated.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e35518600e28..823b65f17429 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
>  	put_map_direct_vma(vma->vm_private_data);
>  }
>  
> +static struct lease_direct *
> +xfs_filemap_direct_lease(
> +	struct vm_area_struct	*vma,
> +	void			(*break_fn)(void *),
> +	void			*owner)
> +{
> +	struct lease_direct	*ld;
> +
> +	ld = map_direct_lease(vma, break_fn, owner);
> +
> +	if (IS_ERR(ld))
> +		return ld;
> +
> +	/*
> +	 * We now have an established lease while the base MAP_DIRECT
> +	 * lease was not broken. So, we know that the "lease holder" will
> +	 * receive a SIGIO notification when the lease is broken and
> +	 * take any necessary cleanup actions.
> +	 */
> +	if (!is_map_direct_broken(vma->vm_private_data))
> +		return ld;
> +
> +	map_direct_lease_destroy(ld);
> +
> +	return ERR_PTR(-ENXIO);

What's any of this got to do with XFS? Shouldn't it be in generic
code, and called generic_filemap_direct_lease()?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-09  3:45     ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09  3:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	linux-api, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote:
> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
> mapping established. For xfs we establish a new lease and then check if
> the MAP_DIRECT mapping has been broken. We want to be sure that the
> process will receive notification that the MAP_DIRECT mapping is being
> torn down so it knows why other code paths are throwing failures.
> 
> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
> MAP_DIRECT mapping is invalid or in the process of being invalidated.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Jeff Layton <jlayton@poochiereds.net>
> Cc: "J. Bruce Fields" <bfields@fieldses.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e35518600e28..823b65f17429 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
>  	put_map_direct_vma(vma->vm_private_data);
>  }
>  
> +static struct lease_direct *
> +xfs_filemap_direct_lease(
> +	struct vm_area_struct	*vma,
> +	void			(*break_fn)(void *),
> +	void			*owner)
> +{
> +	struct lease_direct	*ld;
> +
> +	ld = map_direct_lease(vma, break_fn, owner);
> +
> +	if (IS_ERR(ld))
> +		return ld;
> +
> +	/*
> +	 * We now have an established lease while the base MAP_DIRECT
> +	 * lease was not broken. So, we know that the "lease holder" will
> +	 * receive a SIGIO notification when the lease is broken and
> +	 * take any necessary cleanup actions.
> +	 */
> +	if (!is_map_direct_broken(vma->vm_private_data))
> +		return ld;
> +
> +	map_direct_lease_destroy(ld);
> +
> +	return ERR_PTR(-ENXIO);

What's any of this got to do with XFS? Shouldn't it be in generic
code, and called generic_filemap_direct_lease()?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-09 10:37       ` Robin Murphy
  0 siblings, 0 replies; 158+ messages in thread
From: Robin Murphy @ 2017-10-09 10:37 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, linux-api, linux-fsdevel, David Woodhouse,
	Christoph Hellwig, Marek Szyprowski

Hi Dan,

On 08/10/17 04:45, Dan Williams wrote:
> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
> The motivation for this interface is making RDMA transfers to DAX mappings
> safe. If the DAX file's block map changes we need to be to reliably stop
> accesses to blocks that have been freed or re-assigned to a new file.

...which is also going to require some way to force the IOMMU drivers
(on x86 at least) to do a fully-synchronous unmap, instead of just
throwing the IOVA onto a flush queue to invalidate the TLBs at some
point in the future. Assuming of course that there's an IOMMU both
present and performing DMA translation in the first place.

> With the
> iommu_domain and a callback from the DAX filesystem the kernel can safely
> revoke access to a DMA device. The process that performed the RDMA memory
> registration is also notified of this revocation event, but the kernel can not
> otherwise be in the position of waiting for userspace to quiesce the device.

OK, but why reinvent iommu_get_domain_for_dev()?

> Since PMEM+DAX is currently only enabled for x86, we only update the x86
> iommu drivers.

Note in particular that those two drivers happen to be the *only* place
this approach could work - everyone else is going to have to fall back
to the generic IOMMU API function anyway.

Robin.

> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since v7:
> * retrieve the iommu_domain so that we can later pass the results of
>   dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.
> 
>  drivers/base/dma-mapping.c  |   10 ++++++++++
>  drivers/iommu/amd_iommu.c   |   10 ++++++++++
>  drivers/iommu/intel-iommu.c |   15 +++++++++++++++
>  include/linux/dma-mapping.h |    3 +++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
> index e584eddef0a7..fdb9764f95a4 100644
> --- a/drivers/base/dma-mapping.c
> +++ b/drivers/base/dma-mapping.c
> @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
>  	of_dma_deconfigure(dev);
>  	acpi_dma_deconfigure(dev);
>  }
> +
> +struct iommu_domain *dma_get_iommu_domain(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (ops && ops->get_iommu)
> +		return ops->get_iommu(dev);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(dma_get_iommu_domain);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..c8e1a45af182 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
>  	return domain;
>  }
>  
> +static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
> +{
> +	struct protection_domain *domain = get_domain(dev);
> +
> +	if (IS_ERR(domain))
> +		return NULL;
> +	return &domain->domain;
> +}
> +
>  static void update_device_table(struct protection_domain *domain)
>  {
>  	struct iommu_dev_data *dev_data;
> @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
>  	.unmap_sg	= unmap_sg,
>  	.dma_supported	= amd_iommu_dma_supported,
>  	.mapping_error	= amd_iommu_mapping_error,
> +	.get_iommu	= amd_dma_get_iommu,
>  };
>  
>  static int init_reserved_iova_ranges(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..f3f4939cebad 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> +static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
> +{
> +	struct dmar_domain *domain;
> +
> +	if (iommu_no_mapping(dev))
> +		return NULL;
> +
> +	domain = get_valid_domain_for_dev(dev);
> +	if (!domain)
> +		return NULL;
> +
> +	return &domain->domain;
> +}
> +
>  static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>  				     size_t size, int dir, u64 dma_mask)
>  {
> @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
>  	.mapping_error = intel_mapping_error,
> +	.get_iommu = intel_dma_get_iommu,
>  #ifdef CONFIG_X86
>  	.dma_supported = x86_dma_supported,
>  #endif
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 29ce9815da87..aa62df1d0d72 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -128,6 +128,7 @@ struct dma_map_ops {
>  				   enum dma_data_direction dir);
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
> +	struct iommu_domain *(*get_iommu)(struct device *dev);
>  #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
>  	u64 (*get_required_mask)(struct device *dev);
>  #endif
> @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>  
> +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  					      size_t size,
>  					      enum dma_data_direction dir,
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-09 10:37       ` Robin Murphy
  0 siblings, 0 replies; 158+ messages in thread
From: Robin Murphy @ 2017-10-09 10:37 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ross Zwisler,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

Hi Dan,

On 08/10/17 04:45, Dan Williams wrote:
> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
> The motivation for this interface is making RDMA transfers to DAX mappings
> safe. If the DAX file's block map changes we need to be to reliably stop
> accesses to blocks that have been freed or re-assigned to a new file.

...which is also going to require some way to force the IOMMU drivers
(on x86 at least) to do a fully-synchronous unmap, instead of just
throwing the IOVA onto a flush queue to invalidate the TLBs at some
point in the future. Assuming of course that there's an IOMMU both
present and performing DMA translation in the first place.

> With the
> iommu_domain and a callback from the DAX filesystem the kernel can safely
> revoke access to a DMA device. The process that performed the RDMA memory
> registration is also notified of this revocation event, but the kernel can not
> otherwise be in the position of waiting for userspace to quiesce the device.

OK, but why reinvent iommu_get_domain_for_dev()?

> Since PMEM+DAX is currently only enabled for x86, we only update the x86
> iommu drivers.

Note in particular that those two drivers happen to be the *only* place
this approach could work - everyone else is going to have to fall back
to the generic IOMMU API function anyway.

Robin.

> Cc: Marek Szyprowski <m.szyprowski-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
> Cc: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
> Cc: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
> Cc: Joerg Roedel <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
> Cc: David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Ashok Raj <ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
> Changes since v7:
> * retrieve the iommu_domain so that we can later pass the results of
>   dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.
> 
>  drivers/base/dma-mapping.c  |   10 ++++++++++
>  drivers/iommu/amd_iommu.c   |   10 ++++++++++
>  drivers/iommu/intel-iommu.c |   15 +++++++++++++++
>  include/linux/dma-mapping.h |    3 +++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
> index e584eddef0a7..fdb9764f95a4 100644
> --- a/drivers/base/dma-mapping.c
> +++ b/drivers/base/dma-mapping.c
> @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
>  	of_dma_deconfigure(dev);
>  	acpi_dma_deconfigure(dev);
>  }
> +
> +struct iommu_domain *dma_get_iommu_domain(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (ops && ops->get_iommu)
> +		return ops->get_iommu(dev);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(dma_get_iommu_domain);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..c8e1a45af182 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
>  	return domain;
>  }
>  
> +static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
> +{
> +	struct protection_domain *domain = get_domain(dev);
> +
> +	if (IS_ERR(domain))
> +		return NULL;
> +	return &domain->domain;
> +}
> +
>  static void update_device_table(struct protection_domain *domain)
>  {
>  	struct iommu_dev_data *dev_data;
> @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
>  	.unmap_sg	= unmap_sg,
>  	.dma_supported	= amd_iommu_dma_supported,
>  	.mapping_error	= amd_iommu_mapping_error,
> +	.get_iommu	= amd_dma_get_iommu,
>  };
>  
>  static int init_reserved_iova_ranges(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..f3f4939cebad 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> +static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
> +{
> +	struct dmar_domain *domain;
> +
> +	if (iommu_no_mapping(dev))
> +		return NULL;
> +
> +	domain = get_valid_domain_for_dev(dev);
> +	if (!domain)
> +		return NULL;
> +
> +	return &domain->domain;
> +}
> +
>  static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>  				     size_t size, int dir, u64 dma_mask)
>  {
> @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
>  	.mapping_error = intel_mapping_error,
> +	.get_iommu = intel_dma_get_iommu,
>  #ifdef CONFIG_X86
>  	.dma_supported = x86_dma_supported,
>  #endif
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 29ce9815da87..aa62df1d0d72 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -128,6 +128,7 @@ struct dma_map_ops {
>  				   enum dma_data_direction dir);
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
> +	struct iommu_domain *(*get_iommu)(struct device *dev);
>  #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
>  	u64 (*get_required_mask)(struct device *dev);
>  #endif
> @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>  
> +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  					      size_t size,
>  					      enum dma_data_direction dir,
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-09 10:37       ` Robin Murphy
  0 siblings, 0 replies; 158+ messages in thread
From: Robin Murphy @ 2017-10-09 10:37 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

Hi Dan,

On 08/10/17 04:45, Dan Williams wrote:
> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
> The motivation for this interface is making RDMA transfers to DAX mappings
> safe. If the DAX file's block map changes we need to be to reliably stop
> accesses to blocks that have been freed or re-assigned to a new file.

...which is also going to require some way to force the IOMMU drivers
(on x86 at least) to do a fully-synchronous unmap, instead of just
throwing the IOVA onto a flush queue to invalidate the TLBs at some
point in the future. Assuming of course that there's an IOMMU both
present and performing DMA translation in the first place.

> With the
> iommu_domain and a callback from the DAX filesystem the kernel can safely
> revoke access to a DMA device. The process that performed the RDMA memory
> registration is also notified of this revocation event, but the kernel can not
> otherwise be in the position of waiting for userspace to quiesce the device.

OK, but why reinvent iommu_get_domain_for_dev()?

> Since PMEM+DAX is currently only enabled for x86, we only update the x86
> iommu drivers.

Note in particular that those two drivers happen to be the *only* place
this approach could work - everyone else is going to have to fall back
to the generic IOMMU API function anyway.

Robin.

> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since v7:
> * retrieve the iommu_domain so that we can later pass the results of
>   dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.
> 
>  drivers/base/dma-mapping.c  |   10 ++++++++++
>  drivers/iommu/amd_iommu.c   |   10 ++++++++++
>  drivers/iommu/intel-iommu.c |   15 +++++++++++++++
>  include/linux/dma-mapping.h |    3 +++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
> index e584eddef0a7..fdb9764f95a4 100644
> --- a/drivers/base/dma-mapping.c
> +++ b/drivers/base/dma-mapping.c
> @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
>  	of_dma_deconfigure(dev);
>  	acpi_dma_deconfigure(dev);
>  }
> +
> +struct iommu_domain *dma_get_iommu_domain(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (ops && ops->get_iommu)
> +		return ops->get_iommu(dev);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(dma_get_iommu_domain);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..c8e1a45af182 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
>  	return domain;
>  }
>  
> +static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
> +{
> +	struct protection_domain *domain = get_domain(dev);
> +
> +	if (IS_ERR(domain))
> +		return NULL;
> +	return &domain->domain;
> +}
> +
>  static void update_device_table(struct protection_domain *domain)
>  {
>  	struct iommu_dev_data *dev_data;
> @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
>  	.unmap_sg	= unmap_sg,
>  	.dma_supported	= amd_iommu_dma_supported,
>  	.mapping_error	= amd_iommu_mapping_error,
> +	.get_iommu	= amd_dma_get_iommu,
>  };
>  
>  static int init_reserved_iova_ranges(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..f3f4939cebad 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> +static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
> +{
> +	struct dmar_domain *domain;
> +
> +	if (iommu_no_mapping(dev))
> +		return NULL;
> +
> +	domain = get_valid_domain_for_dev(dev);
> +	if (!domain)
> +		return NULL;
> +
> +	return &domain->domain;
> +}
> +
>  static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>  				     size_t size, int dir, u64 dma_mask)
>  {
> @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
>  	.mapping_error = intel_mapping_error,
> +	.get_iommu = intel_dma_get_iommu,
>  #ifdef CONFIG_X86
>  	.dma_supported = x86_dma_supported,
>  #endif
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 29ce9815da87..aa62df1d0d72 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -128,6 +128,7 @@ struct dma_map_ops {
>  				   enum dma_data_direction dir);
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
> +	struct iommu_domain *(*get_iommu)(struct device *dev);
>  #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
>  	u64 (*get_required_mask)(struct device *dev);
>  #endif
> @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>  
> +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  					      size_t size,
>  					      enum dma_data_direction dir,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-09 10:37       ` Robin Murphy
  0 siblings, 0 replies; 158+ messages in thread
From: Robin Murphy @ 2017-10-09 10:37 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

Hi Dan,

On 08/10/17 04:45, Dan Williams wrote:
> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
> The motivation for this interface is making RDMA transfers to DAX mappings
> safe. If the DAX file's block map changes we need to be to reliably stop
> accesses to blocks that have been freed or re-assigned to a new file.

...which is also going to require some way to force the IOMMU drivers
(on x86 at least) to do a fully-synchronous unmap, instead of just
throwing the IOVA onto a flush queue to invalidate the TLBs at some
point in the future. Assuming of course that there's an IOMMU both
present and performing DMA translation in the first place.

> With the
> iommu_domain and a callback from the DAX filesystem the kernel can safely
> revoke access to a DMA device. The process that performed the RDMA memory
> registration is also notified of this revocation event, but the kernel can not
> otherwise be in the position of waiting for userspace to quiesce the device.

OK, but why reinvent iommu_get_domain_for_dev()?

> Since PMEM+DAX is currently only enabled for x86, we only update the x86
> iommu drivers.

Note in particular that those two drivers happen to be the *only* place
this approach could work - everyone else is going to have to fall back
to the generic IOMMU API function anyway.

Robin.

> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since v7:
> * retrieve the iommu_domain so that we can later pass the results of
>   dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.
> 
>  drivers/base/dma-mapping.c  |   10 ++++++++++
>  drivers/iommu/amd_iommu.c   |   10 ++++++++++
>  drivers/iommu/intel-iommu.c |   15 +++++++++++++++
>  include/linux/dma-mapping.h |    3 +++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
> index e584eddef0a7..fdb9764f95a4 100644
> --- a/drivers/base/dma-mapping.c
> +++ b/drivers/base/dma-mapping.c
> @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
>  	of_dma_deconfigure(dev);
>  	acpi_dma_deconfigure(dev);
>  }
> +
> +struct iommu_domain *dma_get_iommu_domain(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (ops && ops->get_iommu)
> +		return ops->get_iommu(dev);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(dma_get_iommu_domain);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..c8e1a45af182 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
>  	return domain;
>  }
>  
> +static struct iommu_domain *amd_dma_get_iommu(struct device *dev)
> +{
> +	struct protection_domain *domain = get_domain(dev);
> +
> +	if (IS_ERR(domain))
> +		return NULL;
> +	return &domain->domain;
> +}
> +
>  static void update_device_table(struct protection_domain *domain)
>  {
>  	struct iommu_dev_data *dev_data;
> @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
>  	.unmap_sg	= unmap_sg,
>  	.dma_supported	= amd_iommu_dma_supported,
>  	.mapping_error	= amd_iommu_mapping_error,
> +	.get_iommu	= amd_dma_get_iommu,
>  };
>  
>  static int init_reserved_iova_ranges(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..f3f4939cebad 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> +static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
> +{
> +	struct dmar_domain *domain;
> +
> +	if (iommu_no_mapping(dev))
> +		return NULL;
> +
> +	domain = get_valid_domain_for_dev(dev);
> +	if (!domain)
> +		return NULL;
> +
> +	return &domain->domain;
> +}
> +
>  static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>  				     size_t size, int dir, u64 dma_mask)
>  {
> @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
>  	.mapping_error = intel_mapping_error,
> +	.get_iommu = intel_dma_get_iommu,
>  #ifdef CONFIG_X86
>  	.dma_supported = x86_dma_supported,
>  #endif
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 29ce9815da87..aa62df1d0d72 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -128,6 +128,7 @@ struct dma_map_ops {
>  				   enum dma_data_direction dir);
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
> +	struct iommu_domain *(*get_iommu)(struct device *dev);
>  #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
>  	u64 (*get_required_mask)(struct device *dev);
>  #endif
> @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>  
> +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  					      size_t size,
>  					      enum dma_data_direction dir,
> 


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
  2017-10-09  3:40     ` Dave Chinner
  (?)
@ 2017-10-09 17:08       ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-rdma, Linux API, Christoph Hellwig,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Jeff Layton, Ross Zwisler

On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david@fromorbit.com> wrote:

Thanks for the review Dave.

> On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote:
>> MAP_DIRECT is an mmap(2) flag with the following semantics:
>>
>>   MAP_DIRECT
>>   When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
>>   same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
>>   is broken when a "lease breaker" attempts to write(2), change the block
>>   map (fallocate), or change the size of the file. Otherwise the mechanism
>>   of a lease break is identical to the typical lease break case where the
>>   lease needs to be removed (munmap) within the number of seconds
>>   specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
>>   remove the lease in time the kernel will invalidate the mapping and
>>   force all future accesses to the mapping to trigger SIGBUS.
>>
>>   In addition to lease break timeouts causing faults in the mapping to
>>   result in SIGBUS, other states of the file will trigger SIGBUS at fault
>>   time:
>>
>>       * The file is not DAX capable
>>       * The file has reflinked (copy-on-write) blocks
>>       * The fault would trigger the filesystem to allocate blocks
>>       * The fault would trigger the filesystem to perform extent conversion
>>
>>   In other words, MAP_DIRECT expects and enforces a fully allocated file
>>   where faults can be satisfied without modifying block map metadata.
>>
>>   An unprivileged process may establish a MAP_DIRECT mapping on a file
>>   whose UID (owner) matches the filesystem UID of the  process. A process
>>   with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
>>   arbitrary files
>>
>>   ERRORS
>>   EACCES Beyond the typical mmap(2) conditions that trigger EACCES
>>   MAP_DIRECT also requires the permission to set a file lease.
>>
>>   EOPNOTSUPP The filesystem explicitly does not support the flag
>>
>>   SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
>>          might require block-map updates, or the lease timed out and the
>>          kernel invalidated the mapping.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/Kconfig                  |    2 -
>>  fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
>>  include/linux/mman.h            |    3 +
>>  include/uapi/asm-generic/mman.h |    1
>>  4 files changed, 106 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
>> index f62fc6629abb..f8765653a438 100644
>> --- a/fs/xfs/Kconfig
>> +++ b/fs/xfs/Kconfig
>> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
>>
>>  config XFS_LAYOUT
>>       def_bool y
>> -     depends on EXPORTFS_BLOCK_OPS
>> +     depends on EXPORTFS_BLOCK_OPS || FS_DAX
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index ebdd0bd2b261..e35518600e28 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -40,12 +40,22 @@
>>  #include "xfs_iomap.h"
>>  #include "xfs_reflink.h"
>>
>> +#include <linux/mman.h>
>>  #include <linux/dcache.h>
>>  #include <linux/falloc.h>
>>  #include <linux/pagevec.h>
>> +#include <linux/mapdirect.h>
>>  #include <linux/backing-dev.h>
>>
>>  static const struct vm_operations_struct xfs_file_vm_ops;
>> +static const struct vm_operations_struct xfs_file_vm_direct_ops;
>> +
>> +static inline bool
>> +is_xfs_map_direct(
>> +             struct vm_area_struct *vma)
>> +{
>> +     return vma->vm_ops == &xfs_file_vm_direct_ops;
>> +}
>
> Namespacing (xfs_vma_is_direct) and whitespace damage.

Will fix.

>
>>
>>  /*
>>   * Clear the specified ranges to zero through either the pagecache or DAX.
>> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
>>       return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
>>  }
>>
>> +static int
>> +xfs_vma_checks(
>> +     struct vm_area_struct   *vma,
>> +     struct inode            *inode)
>
> Exactly what are we checking for - function name doesn't tell me,
> and there's no comments, either?

Ok, I'll improve this.

>
>> +{
>> +     if (!is_xfs_map_direct(vma))
>> +             return 0;
>> +
>> +     if (!is_map_direct_valid(vma->vm_private_data))
>> +             return VM_FAULT_SIGBUS;
>> +
>> +     if (xfs_is_reflink_inode(XFS_I(inode)))
>> +             return VM_FAULT_SIGBUS;
>> +
>> +     if (!IS_DAX(inode))
>> +             return VM_FAULT_SIGBUS;
>
> And how do we get is_xfs_map_direct() set to true if we don't have a
> DAX inode or the inode has shared extents?

So, this was my way of trying to satisfy the request you made here:

    https://lkml.org/lkml/2017/8/11/876

i.e. allow MAP_DIRECT on non-dax files to enable a use case of
freezing the block-map to examine which file extents are linked. If
you don't want to use MAP_DIRECT for this, we can move these checks to
mmap time.

>
>> +
>> +     return 0;
>> +}
>> +
>>  /*
>>   * Locking for serialisation of IO during page faults. This results in a lock
>>   * ordering of:
>> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
>>       enum page_entry_size    pe_size,
>>       bool                    write_fault)
>>  {
>> +     struct vm_area_struct   *vma = vmf->vma;
>>       struct inode            *inode = file_inode(vmf->vma->vm_file);
>
> You missed this vmf->vma....
>
> .....
>>
>> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
>> +
>> +STATIC int
>> +xfs_file_mmap_validate(
>> +     struct file             *filp,
>> +     struct vm_area_struct   *vma,
>> +     unsigned long           map_flags,
>> +     int                     fd)
>> +{
>> +     struct inode            *inode = file_inode(filp);
>> +     struct xfs_inode        *ip = XFS_I(inode);
>> +     struct map_direct_state *mds;
>> +
>> +     if (map_flags & ~(XFS_MAP_SUPPORTED))
>> +             return -EOPNOTSUPP;
>> +
>> +     if ((map_flags & MAP_DIRECT) == 0)
>> +             return xfs_file_mmap(filp, vma);
>> +
>> +     file_accessed(filp);
>> +     vma->vm_ops = &xfs_file_vm_direct_ops;
>> +     if (IS_DAX(inode))
>> +             vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>
> And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?

In the non-DAX case it just takes the FL_LAYOUT file lease... although
we could also just have an fcntl for that purpose. The use case of
just freezing the block map does not need a mapping.

>> +     mds = map_direct_register(fd, vma);
>> +     if (IS_ERR(mds))
>> +             return PTR_ERR(mds);
>> +
>> +     /* flush in-flight faults */
>> +     xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>> +     xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>
> Urk. That's nasty. And why is it even necessary? Please explain why
> this is necessary in the comment, because it's not at all obvious to
> me...

This is related to your other observation about i_mapdcount and adding
an iomap_can_allocate() helper. I think I can clean both of these up
by using a call to break_layout(inode, false) and bailing in
->iomap_begin() if it returns EWOULDBLOCK. This would also fix the
current problem that allocating write-faults don't start the lease
break process.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-09 17:08       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: J. Bruce Fields, Jan Kara, Arnd Bergmann, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Alexander Viro,
	linux-fsdevel, Jeff Layton, Christoph Hellwig

On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:

Thanks for the review Dave.

> On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote:
>> MAP_DIRECT is an mmap(2) flag with the following semantics:
>>
>>   MAP_DIRECT
>>   When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
>>   same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
>>   is broken when a "lease breaker" attempts to write(2), change the block
>>   map (fallocate), or change the size of the file. Otherwise the mechanism
>>   of a lease break is identical to the typical lease break case where the
>>   lease needs to be removed (munmap) within the number of seconds
>>   specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
>>   remove the lease in time the kernel will invalidate the mapping and
>>   force all future accesses to the mapping to trigger SIGBUS.
>>
>>   In addition to lease break timeouts causing faults in the mapping to
>>   result in SIGBUS, other states of the file will trigger SIGBUS at fault
>>   time:
>>
>>       * The file is not DAX capable
>>       * The file has reflinked (copy-on-write) blocks
>>       * The fault would trigger the filesystem to allocate blocks
>>       * The fault would trigger the filesystem to perform extent conversion
>>
>>   In other words, MAP_DIRECT expects and enforces a fully allocated file
>>   where faults can be satisfied without modifying block map metadata.
>>
>>   An unprivileged process may establish a MAP_DIRECT mapping on a file
>>   whose UID (owner) matches the filesystem UID of the  process. A process
>>   with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
>>   arbitrary files
>>
>>   ERRORS
>>   EACCES Beyond the typical mmap(2) conditions that trigger EACCES
>>   MAP_DIRECT also requires the permission to set a file lease.
>>
>>   EOPNOTSUPP The filesystem explicitly does not support the flag
>>
>>   SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
>>          might require block-map updates, or the lease timed out and the
>>          kernel invalidated the mapping.
>>
>> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
>> Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
>> Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
>> Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
>> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
>> Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>> Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
>> Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
>> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> ---
>>  fs/xfs/Kconfig                  |    2 -
>>  fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
>>  include/linux/mman.h            |    3 +
>>  include/uapi/asm-generic/mman.h |    1
>>  4 files changed, 106 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
>> index f62fc6629abb..f8765653a438 100644
>> --- a/fs/xfs/Kconfig
>> +++ b/fs/xfs/Kconfig
>> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
>>
>>  config XFS_LAYOUT
>>       def_bool y
>> -     depends on EXPORTFS_BLOCK_OPS
>> +     depends on EXPORTFS_BLOCK_OPS || FS_DAX
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index ebdd0bd2b261..e35518600e28 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -40,12 +40,22 @@
>>  #include "xfs_iomap.h"
>>  #include "xfs_reflink.h"
>>
>> +#include <linux/mman.h>
>>  #include <linux/dcache.h>
>>  #include <linux/falloc.h>
>>  #include <linux/pagevec.h>
>> +#include <linux/mapdirect.h>
>>  #include <linux/backing-dev.h>
>>
>>  static const struct vm_operations_struct xfs_file_vm_ops;
>> +static const struct vm_operations_struct xfs_file_vm_direct_ops;
>> +
>> +static inline bool
>> +is_xfs_map_direct(
>> +             struct vm_area_struct *vma)
>> +{
>> +     return vma->vm_ops == &xfs_file_vm_direct_ops;
>> +}
>
> Namespacing (xfs_vma_is_direct) and whitespace damage.

Will fix.

>
>>
>>  /*
>>   * Clear the specified ranges to zero through either the pagecache or DAX.
>> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
>>       return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
>>  }
>>
>> +static int
>> +xfs_vma_checks(
>> +     struct vm_area_struct   *vma,
>> +     struct inode            *inode)
>
> Exactly what are we checking for - function name doesn't tell me,
> and there's no comments, either?

Ok, I'll improve this.

>
>> +{
>> +     if (!is_xfs_map_direct(vma))
>> +             return 0;
>> +
>> +     if (!is_map_direct_valid(vma->vm_private_data))
>> +             return VM_FAULT_SIGBUS;
>> +
>> +     if (xfs_is_reflink_inode(XFS_I(inode)))
>> +             return VM_FAULT_SIGBUS;
>> +
>> +     if (!IS_DAX(inode))
>> +             return VM_FAULT_SIGBUS;
>
> And how do we get is_xfs_map_direct() set to true if we don't have a
> DAX inode or the inode has shared extents?

So, this was my way of trying to satisfy the request you made here:

    https://lkml.org/lkml/2017/8/11/876

i.e. allow MAP_DIRECT on non-dax files to enable a use case of
freezing the block-map to examine which file extents are linked. If
you don't want to use MAP_DIRECT for this, we can move these checks to
mmap time.

>
>> +
>> +     return 0;
>> +}
>> +
>>  /*
>>   * Locking for serialisation of IO during page faults. This results in a lock
>>   * ordering of:
>> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
>>       enum page_entry_size    pe_size,
>>       bool                    write_fault)
>>  {
>> +     struct vm_area_struct   *vma = vmf->vma;
>>       struct inode            *inode = file_inode(vmf->vma->vm_file);
>
> You missed this vmf->vma....
>
> .....
>>
>> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
>> +
>> +STATIC int
>> +xfs_file_mmap_validate(
>> +     struct file             *filp,
>> +     struct vm_area_struct   *vma,
>> +     unsigned long           map_flags,
>> +     int                     fd)
>> +{
>> +     struct inode            *inode = file_inode(filp);
>> +     struct xfs_inode        *ip = XFS_I(inode);
>> +     struct map_direct_state *mds;
>> +
>> +     if (map_flags & ~(XFS_MAP_SUPPORTED))
>> +             return -EOPNOTSUPP;
>> +
>> +     if ((map_flags & MAP_DIRECT) == 0)
>> +             return xfs_file_mmap(filp, vma);
>> +
>> +     file_accessed(filp);
>> +     vma->vm_ops = &xfs_file_vm_direct_ops;
>> +     if (IS_DAX(inode))
>> +             vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>
> And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?

In the non-DAX case it just takes the FL_LAYOUT file lease... although
we could also just have an fcntl for that purpose. The use case of
just freezing the block map does not need a mapping.

>> +     mds = map_direct_register(fd, vma);
>> +     if (IS_ERR(mds))
>> +             return PTR_ERR(mds);
>> +
>> +     /* flush in-flight faults */
>> +     xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>> +     xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>
> Urk. That's nasty. And why is it even necessary? Please explain why
> this is necessary in the comment, because it's not at all obvious to
> me...

This is related to your other observation about i_mapdcount and adding
an iomap_can_allocate() helper. I think I can clean both of these up
by using a call to break_layout(inode, false) and bailing in
->iomap_begin() if it returns EWOULDBLOCK. This would also fix the
current problem that allocating write-faults don't start the lease
break process.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-09 17:08       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-rdma, Linux API, Christoph Hellwig,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Jeff Layton, Ross Zwisler

On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david@fromorbit.com> wrote:

Thanks for the review Dave.

> On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote:
>> MAP_DIRECT is an mmap(2) flag with the following semantics:
>>
>>   MAP_DIRECT
>>   When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
>>   same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
>>   is broken when a "lease breaker" attempts to write(2), change the block
>>   map (fallocate), or change the size of the file. Otherwise the mechanism
>>   of a lease break is identical to the typical lease break case where the
>>   lease needs to be removed (munmap) within the number of seconds
>>   specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
>>   remove the lease in time the kernel will invalidate the mapping and
>>   force all future accesses to the mapping to trigger SIGBUS.
>>
>>   In addition to lease break timeouts causing faults in the mapping to
>>   result in SIGBUS, other states of the file will trigger SIGBUS at fault
>>   time:
>>
>>       * The file is not DAX capable
>>       * The file has reflinked (copy-on-write) blocks
>>       * The fault would trigger the filesystem to allocate blocks
>>       * The fault would trigger the filesystem to perform extent conversion
>>
>>   In other words, MAP_DIRECT expects and enforces a fully allocated file
>>   where faults can be satisfied without modifying block map metadata.
>>
>>   An unprivileged process may establish a MAP_DIRECT mapping on a file
>>   whose UID (owner) matches the filesystem UID of the  process. A process
>>   with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
>>   arbitrary files
>>
>>   ERRORS
>>   EACCES Beyond the typical mmap(2) conditions that trigger EACCES
>>   MAP_DIRECT also requires the permission to set a file lease.
>>
>>   EOPNOTSUPP The filesystem explicitly does not support the flag
>>
>>   SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
>>          might require block-map updates, or the lease timed out and the
>>          kernel invalidated the mapping.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/Kconfig                  |    2 -
>>  fs/xfs/xfs_file.c               |  102 +++++++++++++++++++++++++++++++++++++++
>>  include/linux/mman.h            |    3 +
>>  include/uapi/asm-generic/mman.h |    1
>>  4 files changed, 106 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
>> index f62fc6629abb..f8765653a438 100644
>> --- a/fs/xfs/Kconfig
>> +++ b/fs/xfs/Kconfig
>> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
>>
>>  config XFS_LAYOUT
>>       def_bool y
>> -     depends on EXPORTFS_BLOCK_OPS
>> +     depends on EXPORTFS_BLOCK_OPS || FS_DAX
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index ebdd0bd2b261..e35518600e28 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -40,12 +40,22 @@
>>  #include "xfs_iomap.h"
>>  #include "xfs_reflink.h"
>>
>> +#include <linux/mman.h>
>>  #include <linux/dcache.h>
>>  #include <linux/falloc.h>
>>  #include <linux/pagevec.h>
>> +#include <linux/mapdirect.h>
>>  #include <linux/backing-dev.h>
>>
>>  static const struct vm_operations_struct xfs_file_vm_ops;
>> +static const struct vm_operations_struct xfs_file_vm_direct_ops;
>> +
>> +static inline bool
>> +is_xfs_map_direct(
>> +             struct vm_area_struct *vma)
>> +{
>> +     return vma->vm_ops == &xfs_file_vm_direct_ops;
>> +}
>
> Namespacing (xfs_vma_is_direct) and whitespace damage.

Will fix.

>
>>
>>  /*
>>   * Clear the specified ranges to zero through either the pagecache or DAX.
>> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
>>       return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
>>  }
>>
>> +static int
>> +xfs_vma_checks(
>> +     struct vm_area_struct   *vma,
>> +     struct inode            *inode)
>
> Exactly what are we checking for - function name doesn't tell me,
> and there's no comments, either?

Ok, I'll improve this.

>
>> +{
>> +     if (!is_xfs_map_direct(vma))
>> +             return 0;
>> +
>> +     if (!is_map_direct_valid(vma->vm_private_data))
>> +             return VM_FAULT_SIGBUS;
>> +
>> +     if (xfs_is_reflink_inode(XFS_I(inode)))
>> +             return VM_FAULT_SIGBUS;
>> +
>> +     if (!IS_DAX(inode))
>> +             return VM_FAULT_SIGBUS;
>
> And how do we get is_xfs_map_direct() set to true if we don't have a
> DAX inode or the inode has shared extents?

So, this was my way of trying to satisfy the request you made here:

    https://lkml.org/lkml/2017/8/11/876

i.e. allow MAP_DIRECT on non-dax files to enable a use case of
freezing the block-map to examine which file extents are linked. If
you don't want to use MAP_DIRECT for this, we can move these checks to
mmap time.

>
>> +
>> +     return 0;
>> +}
>> +
>>  /*
>>   * Locking for serialisation of IO during page faults. This results in a lock
>>   * ordering of:
>> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault(
>>       enum page_entry_size    pe_size,
>>       bool                    write_fault)
>>  {
>> +     struct vm_area_struct   *vma = vmf->vma;
>>       struct inode            *inode = file_inode(vmf->vma->vm_file);
>
> You missed this vmf->vma....
>
> .....
>>
>> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
>> +
>> +STATIC int
>> +xfs_file_mmap_validate(
>> +     struct file             *filp,
>> +     struct vm_area_struct   *vma,
>> +     unsigned long           map_flags,
>> +     int                     fd)
>> +{
>> +     struct inode            *inode = file_inode(filp);
>> +     struct xfs_inode        *ip = XFS_I(inode);
>> +     struct map_direct_state *mds;
>> +
>> +     if (map_flags & ~(XFS_MAP_SUPPORTED))
>> +             return -EOPNOTSUPP;
>> +
>> +     if ((map_flags & MAP_DIRECT) == 0)
>> +             return xfs_file_mmap(filp, vma);
>> +
>> +     file_accessed(filp);
>> +     vma->vm_ops = &xfs_file_vm_direct_ops;
>> +     if (IS_DAX(inode))
>> +             vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>
> And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?

In the non-DAX case it just takes the FL_LAYOUT file lease... although
we could also just have an fcntl for that purpose. The use case of
just freezing the block map does not need a mapping.

>> +     mds = map_direct_register(fd, vma);
>> +     if (IS_ERR(mds))
>> +             return PTR_ERR(mds);
>> +
>> +     /* flush in-flight faults */
>> +     xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>> +     xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>
> Urk. That's nasty. And why is it even necessary? Please explain why
> this is necessary in the comment, because it's not at all obvious to
> me...

This is related to your other observation about i_mapdcount and adding
an iomap_can_allocate() helper. I think I can clean both of these up
by using a call to break_layout(inode, false) and bailing in
->iomap_begin() if it returns EWOULDBLOCK. This would also fix the
current problem that allocating write-faults don't start the lease
break process.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct()
  2017-10-09  3:45     ` Dave Chinner
@ 2017-10-09 17:10       ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	Linux API, Christoph Hellwig, J. Bruce Fields, Linux MM,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

On Sun, Oct 8, 2017 at 8:45 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote:
>> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
>> mapping established. For xfs we establish a new lease and then check if
>> the MAP_DIRECT mapping has been broken. We want to be sure that the
>> process will receive notification that the MAP_DIRECT mapping is being
>> torn down so it knows why other code paths are throwing failures.
>>
>> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
>> MAP_DIRECT mapping is invalid or in the process of being invalidated.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index e35518600e28..823b65f17429 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
>>       put_map_direct_vma(vma->vm_private_data);
>>  }
>>
>> +static struct lease_direct *
>> +xfs_filemap_direct_lease(
>> +     struct vm_area_struct   *vma,
>> +     void                    (*break_fn)(void *),
>> +     void                    *owner)
>> +{
>> +     struct lease_direct     *ld;
>> +
>> +     ld = map_direct_lease(vma, break_fn, owner);
>> +
>> +     if (IS_ERR(ld))
>> +             return ld;
>> +
>> +     /*
>> +      * We now have an established lease while the base MAP_DIRECT
>> +      * lease was not broken. So, we know that the "lease holder" will
>> +      * receive a SIGIO notification when the lease is broken and
>> +      * take any necessary cleanup actions.
>> +      */
>> +     if (!is_map_direct_broken(vma->vm_private_data))
>> +             return ld;
>> +
>> +     map_direct_lease_destroy(ld);
>> +
>> +     return ERR_PTR(-ENXIO);
>
> What's any of this got to do with XFS? Shouldn't it be in generic
> code, and called generic_filemap_direct_lease()?

True, I can move this to generic code. The filesystem is in charge of
where it wants to store the 'struct map_direct_state' context, but for
generic_filemap_direct_lease() it can just assume that it is stored in
->vm_private_data. I'll add comments to this effect on the new
routine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 09/12] xfs: wire up ->lease_direct()
@ 2017-10-09 17:10       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Darrick J. Wong, linux-rdma,
	Linux API, Christoph Hellwig, J. Bruce Fields, Linux MM,
	Jeff Moyer, linux-fsdevel, Jeff Layton, Ross Zwisler

On Sun, Oct 8, 2017 at 8:45 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Oct 06, 2017 at 03:36:06PM -0700, Dan Williams wrote:
>> A 'lease_direct' lease requires that the vma have a valid MAP_DIRECT
>> mapping established. For xfs we establish a new lease and then check if
>> the MAP_DIRECT mapping has been broken. We want to be sure that the
>> process will receive notification that the MAP_DIRECT mapping is being
>> torn down so it knows why other code paths are throwing failures.
>>
>> For example in the RDMA/ibverbs case we want ibv_reg_mr() to fail if the
>> MAP_DIRECT mapping is invalid or in the process of being invalidated.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Jeff Layton <jlayton@poochiereds.net>
>> Cc: "J. Bruce Fields" <bfields@fieldses.org>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_file.c |   28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index e35518600e28..823b65f17429 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -1166,6 +1166,33 @@ xfs_filemap_direct_close(
>>       put_map_direct_vma(vma->vm_private_data);
>>  }
>>
>> +static struct lease_direct *
>> +xfs_filemap_direct_lease(
>> +     struct vm_area_struct   *vma,
>> +     void                    (*break_fn)(void *),
>> +     void                    *owner)
>> +{
>> +     struct lease_direct     *ld;
>> +
>> +     ld = map_direct_lease(vma, break_fn, owner);
>> +
>> +     if (IS_ERR(ld))
>> +             return ld;
>> +
>> +     /*
>> +      * We now have an established lease while the base MAP_DIRECT
>> +      * lease was not broken. So, we know that the "lease holder" will
>> +      * receive a SIGIO notification when the lease is broken and
>> +      * take any necessary cleanup actions.
>> +      */
>> +     if (!is_map_direct_broken(vma->vm_private_data))
>> +             return ld;
>> +
>> +     map_direct_lease_destroy(ld);
>> +
>> +     return ERR_PTR(-ENXIO);
>
> What's any of this got to do with XFS? Shouldn't it be in generic
> code, and called generic_filemap_direct_lease()?

True, I can move this to generic code. The filesystem is in charge of
where it wants to store the 'struct map_direct_state' context, but for
generic_filemap_direct_lease() it can just assume that it is stored in
->vm_private_data. I'll add comments to this effect on the new
routine.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
  2017-10-09 10:37       ` Robin Murphy
@ 2017-10-09 17:32         ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:32 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 9, 2017 at 3:37 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Dan,
>
> On 08/10/17 04:45, Dan Williams wrote:
>> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
>> The motivation for this interface is making RDMA transfers to DAX mappings
>> safe. If the DAX file's block map changes we need to be to reliably stop
>> accesses to blocks that have been freed or re-assigned to a new file.
>
> ...which is also going to require some way to force the IOMMU drivers
> (on x86 at least) to do a fully-synchronous unmap, instead of just
> throwing the IOVA onto a flush queue to invalidate the TLBs at some
> point in the future.

Isn't that the difference between iommu_unmap() and
iommu_unmap_fast()? As far as I can tell amd-iommu and intel-iommu
both flush iotlbs on iommu_unmap() and don't support fast unmaps.

> Assuming of course that there's an IOMMU both
> present and performing DMA translation in the first place.

That's why I want to call through the dma api to see if the iommu is
being used to satisfy dma mappings.

>> With the
>> iommu_domain and a callback from the DAX filesystem the kernel can safely
>> revoke access to a DMA device. The process that performed the RDMA memory
>> registration is also notified of this revocation event, but the kernel can not
>> otherwise be in the position of waiting for userspace to quiesce the device.
>
> OK, but why reinvent iommu_get_domain_for_dev()?

How do I know if the iommu returned from that routine is the one being
used for dma mapping operations for the device? Specifically, how
would I discover that the result of dma_map_sg() can be passed as an
IOVA range to iommu_unmap()?

>> Since PMEM+DAX is currently only enabled for x86, we only update the x86
>> iommu drivers.
>
> Note in particular that those two drivers happen to be the *only* place
> this approach could work - everyone else is going to have to fall back
> to the generic IOMMU API function anyway.

I want to make this functionality generic, but I'm not familiar with
the iommu sub-system. How are dma mapping operations routed to the
iommu driver in those other imlementations?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-09 17:32         ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 17:32 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 9, 2017 at 3:37 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Dan,
>
> On 08/10/17 04:45, Dan Williams wrote:
>> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
>> The motivation for this interface is making RDMA transfers to DAX mappings
>> safe. If the DAX file's block map changes we need to be to reliably stop
>> accesses to blocks that have been freed or re-assigned to a new file.
>
> ...which is also going to require some way to force the IOMMU drivers
> (on x86 at least) to do a fully-synchronous unmap, instead of just
> throwing the IOVA onto a flush queue to invalidate the TLBs at some
> point in the future.

Isn't that the difference between iommu_unmap() and
iommu_unmap_fast()? As far as I can tell amd-iommu and intel-iommu
both flush iotlbs on iommu_unmap() and don't support fast unmaps.

> Assuming of course that there's an IOMMU both
> present and performing DMA translation in the first place.

That's why I want to call through the dma api to see if the iommu is
being used to satisfy dma mappings.

>> With the
>> iommu_domain and a callback from the DAX filesystem the kernel can safely
>> revoke access to a DMA device. The process that performed the RDMA memory
>> registration is also notified of this revocation event, but the kernel can not
>> otherwise be in the position of waiting for userspace to quiesce the device.
>
> OK, but why reinvent iommu_get_domain_for_dev()?

How do I know if the iommu returned from that routine is the one being
used for dma mapping operations for the device? Specifically, how
would I discover that the result of dma_map_sg() can be passed as an
IOVA range to iommu_unmap()?

>> Since PMEM+DAX is currently only enabled for x86, we only update the x86
>> iommu drivers.
>
> Note in particular that those two drivers happen to be the *only* place
> this approach could work - everyone else is going to have to fall back
> to the generic IOMMU API function anyway.

I want to make this functionality generic, but I'm not familiar with
the iommu sub-system. How are dma mapping operations routed to the
iommu driver in those other imlementations?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 18:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, linux-mm, linux-api, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

If RDMA is driving this need, why not invalidate backing RDMA MRs
instead of requiring a IOMMU to do it? RDMA MR are finer grained and
do not suffer from the re-use problem David W. brought up with IOVAs..

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 18:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, David Woodhouse,
	Christoph Hellwig, Marek Szyprowski

On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

If RDMA is driving this need, why not invalidate backing RDMA MRs
instead of requiring a IOMMU to do it? RDMA MR are finer grained and
do not suffer from the re-use problem David W. brought up with IOVAs..

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 18:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

If RDMA is driving this need, why not invalidate backing RDMA MRs
instead of requiring a IOMMU to do it? RDMA MR are finer grained and
do not suffer from the re-use problem David W. brought up with IOVAs..

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 18:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

If RDMA is driving this need, why not invalidate backing RDMA MRs
instead of requiring a IOMMU to do it? RDMA MR are finer grained and
do not suffer from the re-use problem David W. brought up with IOVAs..

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-09 18:58     ` Jason Gunthorpe
@ 2017-10-09 19:05       ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> If RDMA is driving this need, why not invalidate backing RDMA MRs
> instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> do not suffer from the re-use problem David W. brought up with IOVAs..

Sounds promising. All I want in the end is to be sure that the kernel
is enabled to stop any in-flight RDMA at will without asking
userspace. Does this require per-RDMA driver opt-in or is there a
common call that can be made?

Outside of that the re-use problem is already solved by just unmapping
(iommu_unmap()) the IOVA, but keeping it allocated until the eventual
dma_unmap_sg() at memory un-registration time frees it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:05       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> otherwise be quiesced. The need for this knowledge is driven by a need
>> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> changes we need to be to reliably stop accesses to blocks that have been
>> freed or re-assigned to a new file.
>
> If RDMA is driving this need, why not invalidate backing RDMA MRs
> instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> do not suffer from the re-use problem David W. brought up with IOVAs..

Sounds promising. All I want in the end is to be sure that the kernel
is enabled to stop any in-flight RDMA at will without asking
userspace. Does this require per-RDMA driver opt-in or is there a
common call that can be made?

Outside of that the re-use problem is already solved by just unmapping
(iommu_unmap()) the IOVA, but keeping it allocated until the eventual
dma_unmap_sg() at memory un-registration time frees it.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> >> otherwise be quiesced. The need for this knowledge is driven by a need
> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> >> changes we need to be to reliably stop accesses to blocks that have been
> >> freed or re-assigned to a new file.
> >
> > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > do not suffer from the re-use problem David W. brought up with IOVAs..
> 
> Sounds promising. All I want in the end is to be sure that the kernel
> is enabled to stop any in-flight RDMA at will without asking
> userspace. Does this require per-RDMA driver opt-in or is there a
> common call that can be made?

I don't think this has ever come up in the context of an all-device MR
invalidate requirement. Drivers already have code to invalidate
specifc MRs, but to find all MRs that touch certain pages and then
invalidate them would be new code.

We also have ODP aware drivers that can retarget a MR to new
physical pages. If the block map changes DAX should synchronously
retarget the ODP MR, not halt DMA.

Most likely ODP & DAX would need to be used together to get robust
user applications, as having the user QP's go to an error state at
random times (due to DMA failures) during operation is never going to
be acceptable...

Perhaps you might want to initially only support ODP MR mappings with
DAX and then the DMA fencing issue goes away?

Cheers,
Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> >> otherwise be quiesced. The need for this knowledge is driven by a need
> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> >> changes we need to be to reliably stop accesses to blocks that have been
> >> freed or re-assigned to a new file.
> >
> > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > do not suffer from the re-use problem David W. brought up with IOVAs..
> 
> Sounds promising. All I want in the end is to be sure that the kernel
> is enabled to stop any in-flight RDMA at will without asking
> userspace. Does this require per-RDMA driver opt-in or is there a
> common call that can be made?

I don't think this has ever come up in the context of an all-device MR
invalidate requirement. Drivers already have code to invalidate
specifc MRs, but to find all MRs that touch certain pages and then
invalidate them would be new code.

We also have ODP aware drivers that can retarget a MR to new
physical pages. If the block map changes DAX should synchronously
retarget the ODP MR, not halt DMA.

Most likely ODP & DAX would need to be used together to get robust
user applications, as having the user QP's go to an error state at
random times (due to DMA failures) during operation is never going to
be acceptable...

Perhaps you might want to initially only support ODP MR mappings with
DAX and then the DMA fencing issue goes away?

Cheers,
Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> >> otherwise be quiesced. The need for this knowledge is driven by a need
> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> >> changes we need to be to reliably stop accesses to blocks that have been
> >> freed or re-assigned to a new file.
> >
> > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > do not suffer from the re-use problem David W. brought up with IOVAs..
> 
> Sounds promising. All I want in the end is to be sure that the kernel
> is enabled to stop any in-flight RDMA at will without asking
> userspace. Does this require per-RDMA driver opt-in or is there a
> common call that can be made?

I don't think this has ever come up in the context of an all-device MR
invalidate requirement. Drivers already have code to invalidate
specifc MRs, but to find all MRs that touch certain pages and then
invalidate them would be new code.

We also have ODP aware drivers that can retarget a MR to new
physical pages. If the block map changes DAX should synchronously
retarget the ODP MR, not halt DMA.

Most likely ODP & DAX would need to be used together to get robust
user applications, as having the user QP's go to an error state at
random times (due to DMA failures) during operation is never going to
be acceptable...

Perhaps you might want to initially only support ODP MR mappings with
DAX and then the DMA fencing issue goes away?

Cheers,
Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-09 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> >> otherwise be quiesced. The need for this knowledge is driven by a need
> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> >> changes we need to be to reliably stop accesses to blocks that have been
> >> freed or re-assigned to a new file.
> >
> > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > do not suffer from the re-use problem David W. brought up with IOVAs..
> 
> Sounds promising. All I want in the end is to be sure that the kernel
> is enabled to stop any in-flight RDMA at will without asking
> userspace. Does this require per-RDMA driver opt-in or is there a
> common call that can be made?

I don't think this has ever come up in the context of an all-device MR
invalidate requirement. Drivers already have code to invalidate
specifc MRs, but to find all MRs that touch certain pages and then
invalidate them would be new code.

We also have ODP aware drivers that can retarget a MR to new
physical pages. If the block map changes DAX should synchronously
retarget the ODP MR, not halt DMA.

Most likely ODP & DAX would need to be used together to get robust
user applications, as having the user QP's go to an error state at
random times (due to DMA failures) during operation is never going to
be acceptable...

Perhaps you might want to initially only support ODP MR mappings with
DAX and then the DMA fencing issue goes away?

Cheers,
Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:28           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 19:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 9, 2017 at 12:18 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
>> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> >> otherwise be quiesced. The need for this knowledge is driven by a need
>> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> >> changes we need to be to reliably stop accesses to blocks that have been
>> >> freed or re-assigned to a new file.
>> >
>> > If RDMA is driving this need, why not invalidate backing RDMA MRs
>> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
>> > do not suffer from the re-use problem David W. brought up with IOVAs..
>>
>> Sounds promising. All I want in the end is to be sure that the kernel
>> is enabled to stop any in-flight RDMA at will without asking
>> userspace. Does this require per-RDMA driver opt-in or is there a
>> common call that can be made?
>
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.
>
> We also have ODP aware drivers that can retarget a MR to new
> physical pages. If the block map changes DAX should synchronously
> retarget the ODP MR, not halt DMA.

Have a look at the patch [1], I don't touch the ODP path.

> Most likely ODP & DAX would need to be used together to get robust
> user applications, as having the user QP's go to an error state at
> random times (due to DMA failures) during operation is never going to
> be acceptable...

It's not random. The process that set up the mapping and registered
the memory gets SIGIO when someone else tries to modify the file map.
That process then gets /proc/sys/fs/lease-break-time seconds to fix
the problem before the kernel force revokes the DMA access.

It's otherwise not acceptable to allow DMA into random locations when
the file map changes.

> Perhaps you might want to initially only support ODP MR mappings with
> DAX and then the DMA fencing issue goes away?

I'd rather try to fix the non-ODP DAX case instead of just turning it off.

[1]: https://patchwork.kernel.org/patch/9991681/
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:28           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 19:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 9, 2017 at 12:18 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
>> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> >> otherwise be quiesced. The need for this knowledge is driven by a need
>> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> >> changes we need to be to reliably stop accesses to blocks that have been
>> >> freed or re-assigned to a new file.
>> >
>> > If RDMA is driving this need, why not invalidate backing RDMA MRs
>> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
>> > do not suffer from the re-use problem David W. brought up with IOVAs..
>>
>> Sounds promising. All I want in the end is to be sure that the kernel
>> is enabled to stop any in-flight RDMA at will without asking
>> userspace. Does this require per-RDMA driver opt-in or is there a
>> common call that can be made?
>
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.
>
> We also have ODP aware drivers that can retarget a MR to new
> physical pages. If the block map changes DAX should synchronously
> retarget the ODP MR, not halt DMA.

Have a look at the patch [1], I don't touch the ODP path.

> Most likely ODP & DAX would need to be used together to get robust
> user applications, as having the user QP's go to an error state at
> random times (due to DMA failures) during operation is never going to
> be acceptable...

It's not random. The process that set up the mapping and registered
the memory gets SIGIO when someone else tries to modify the file map.
That process then gets /proc/sys/fs/lease-break-time seconds to fix
the problem before the kernel force revokes the DMA access.

It's otherwise not acceptable to allow DMA into random locations when
the file map changes.

> Perhaps you might want to initially only support ODP MR mappings with
> DAX and then the DMA fencing issue goes away?

I'd rather try to fix the non-ODP DAX case instead of just turning it off.

[1]: https://patchwork.kernel.org/patch/9991681/
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:28           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 19:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 9, 2017 at 12:18 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
>> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> >> otherwise be quiesced. The need for this knowledge is driven by a need
>> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> >> changes we need to be to reliably stop accesses to blocks that have been
>> >> freed or re-assigned to a new file.
>> >
>> > If RDMA is driving this need, why not invalidate backing RDMA MRs
>> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
>> > do not suffer from the re-use problem David W. brought up with IOVAs..
>>
>> Sounds promising. All I want in the end is to be sure that the kernel
>> is enabled to stop any in-flight RDMA at will without asking
>> userspace. Does this require per-RDMA driver opt-in or is there a
>> common call that can be made?
>
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.
>
> We also have ODP aware drivers that can retarget a MR to new
> physical pages. If the block map changes DAX should synchronously
> retarget the ODP MR, not halt DMA.

Have a look at the patch [1], I don't touch the ODP path.

> Most likely ODP & DAX would need to be used together to get robust
> user applications, as having the user QP's go to an error state at
> random times (due to DMA failures) during operation is never going to
> be acceptable...

It's not random. The process that set up the mapping and registered
the memory gets SIGIO when someone else tries to modify the file map.
That process then gets /proc/sys/fs/lease-break-time seconds to fix
the problem before the kernel force revokes the DMA access.

It's otherwise not acceptable to allow DMA into random locations when
the file map changes.

> Perhaps you might want to initially only support ODP MR mappings with
> DAX and then the DMA fencing issue goes away?

I'd rather try to fix the non-ODP DAX case instead of just turning it off.

[1]: https://patchwork.kernel.org/patch/9991681/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-09 19:28           ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-09 19:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 9, 2017 at 12:18 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
>> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
>> >> otherwise be quiesced. The need for this knowledge is driven by a need
>> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
>> >> changes we need to be to reliably stop accesses to blocks that have been
>> >> freed or re-assigned to a new file.
>> >
>> > If RDMA is driving this need, why not invalidate backing RDMA MRs
>> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
>> > do not suffer from the re-use problem David W. brought up with IOVAs..
>>
>> Sounds promising. All I want in the end is to be sure that the kernel
>> is enabled to stop any in-flight RDMA at will without asking
>> userspace. Does this require per-RDMA driver opt-in or is there a
>> common call that can be made?
>
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.
>
> We also have ODP aware drivers that can retarget a MR to new
> physical pages. If the block map changes DAX should synchronously
> retarget the ODP MR, not halt DMA.

Have a look at the patch [1], I don't touch the ODP path.

> Most likely ODP & DAX would need to be used together to get robust
> user applications, as having the user QP's go to an error state at
> random times (due to DMA failures) during operation is never going to
> be acceptable...

It's not random. The process that set up the mapping and registered
the memory gets SIGIO when someone else tries to modify the file map.
That process then gets /proc/sys/fs/lease-break-time seconds to fix
the problem before the kernel force revokes the DMA access.

It's otherwise not acceptable to allow DMA into random locations when
the file map changes.

> Perhaps you might want to initially only support ODP MR mappings with
> DAX and then the DMA fencing issue goes away?

I'd rather try to fix the non-ODP DAX case instead of just turning it off.

[1]: https://patchwork.kernel.org/patch/9991681/

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
  2017-10-09 17:08       ` Dan Williams
@ 2017-10-09 22:50         ` Dave Chinner
  -1 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09 22:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-rdma, Linux API, Christoph Hellwig,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Jeff Layton, Ross Zwisler

On Mon, Oct 09, 2017 at 10:08:40AM -0700, Dan Williams wrote:
> On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >>  /*
> >>   * Clear the specified ranges to zero through either the pagecache or DAX.
> >> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
> >>       return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
> >>  }
> >>
> >> +static int
> >> +xfs_vma_checks(
> >> +     struct vm_area_struct   *vma,
> >> +     struct inode            *inode)
> >
> > Exactly what are we checking for - function name doesn't tell me,
> > and there's no comments, either?
> 
> Ok, I'll improve this.
> 
> >
> >> +{
> >> +     if (!is_xfs_map_direct(vma))
> >> +             return 0;
> >> +
> >> +     if (!is_map_direct_valid(vma->vm_private_data))
> >> +             return VM_FAULT_SIGBUS;
> >> +
> >> +     if (xfs_is_reflink_inode(XFS_I(inode)))
> >> +             return VM_FAULT_SIGBUS;
> >> +
> >> +     if (!IS_DAX(inode))
> >> +             return VM_FAULT_SIGBUS;
> >
> > And how do we get is_xfs_map_direct() set to true if we don't have a
> > DAX inode or the inode has shared extents?
> 
> So, this was my way of trying to satisfy the request you made here:
> 
>     https://lkml.org/lkml/2017/8/11/876
> 
> i.e. allow MAP_DIRECT on non-dax files to enable a use case of
> freezing the block-map to examine which file extents are linked. If
> you don't want to use MAP_DIRECT for this, we can move these checks to
> mmap time.

Ok, but I don't want to use mmap to deal with this, nor do I care
whether DAX is in use or not. So I don't think this is really
necessary for MAP_DIRECT.


> >> +xfs_file_mmap_validate(
> >> +     struct file             *filp,
> >> +     struct vm_area_struct   *vma,
> >> +     unsigned long           map_flags,
> >> +     int                     fd)
> >> +{
> >> +     struct inode            *inode = file_inode(filp);
> >> +     struct xfs_inode        *ip = XFS_I(inode);
> >> +     struct map_direct_state *mds;
> >> +
> >> +     if (map_flags & ~(XFS_MAP_SUPPORTED))
> >> +             return -EOPNOTSUPP;
> >> +
> >> +     if ((map_flags & MAP_DIRECT) == 0)
> >> +             return xfs_file_mmap(filp, vma);
> >> +
> >> +     file_accessed(filp);
> >> +     vma->vm_ops = &xfs_file_vm_direct_ops;
> >> +     if (IS_DAX(inode))
> >> +             vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> >
> > And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?
> 
> In the non-DAX case it just takes the FL_LAYOUT file lease... although
> we could also just have an fcntl for that purpose. The use case of
> just freezing the block map does not need a mapping.

RIght, so I think we should just add a fcntl for the non-DAX case I
have in mind, and not complicate the MAP_DIRECT implementation right
now.  We can alsways extend the scope of MAP_DIRECT in future if we
actually need to do so.

> >> +     mds = map_direct_register(fd, vma);
> >> +     if (IS_ERR(mds))
> >> +             return PTR_ERR(mds);
> >> +
> >> +     /* flush in-flight faults */
> >> +     xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> >> +     xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> >
> > Urk. That's nasty. And why is it even necessary? Please explain why
> > this is necessary in the comment, because it's not at all obvious to
> > me...
> 
> This is related to your other observation about i_mapdcount and adding
> an iomap_can_allocate() helper. I think I can clean both of these up
> by using a call to break_layout(inode, false) and bailing in
> ->iomap_begin() if it returns EWOULDBLOCK. This would also fix the
> current problem that allocating write-faults don't start the lease
> break process.

OK.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 06/12] xfs: wire up MAP_DIRECT
@ 2017-10-09 22:50         ` Dave Chinner
  0 siblings, 0 replies; 158+ messages in thread
From: Dave Chinner @ 2017-10-09 22:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-rdma, Linux API, Christoph Hellwig,
	J. Bruce Fields, Linux MM, Jeff Moyer, Alexander Viro,
	linux-fsdevel, Jeff Layton, Ross Zwisler

On Mon, Oct 09, 2017 at 10:08:40AM -0700, Dan Williams wrote:
> On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >>  /*
> >>   * Clear the specified ranges to zero through either the pagecache or DAX.
> >> @@ -1008,6 +1018,26 @@ xfs_file_llseek(
> >>       return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
> >>  }
> >>
> >> +static int
> >> +xfs_vma_checks(
> >> +     struct vm_area_struct   *vma,
> >> +     struct inode            *inode)
> >
> > Exactly what are we checking for - function name doesn't tell me,
> > and there's no comments, either?
> 
> Ok, I'll improve this.
> 
> >
> >> +{
> >> +     if (!is_xfs_map_direct(vma))
> >> +             return 0;
> >> +
> >> +     if (!is_map_direct_valid(vma->vm_private_data))
> >> +             return VM_FAULT_SIGBUS;
> >> +
> >> +     if (xfs_is_reflink_inode(XFS_I(inode)))
> >> +             return VM_FAULT_SIGBUS;
> >> +
> >> +     if (!IS_DAX(inode))
> >> +             return VM_FAULT_SIGBUS;
> >
> > And how do we get is_xfs_map_direct() set to true if we don't have a
> > DAX inode or the inode has shared extents?
> 
> So, this was my way of trying to satisfy the request you made here:
> 
>     https://lkml.org/lkml/2017/8/11/876
> 
> i.e. allow MAP_DIRECT on non-dax files to enable a use case of
> freezing the block-map to examine which file extents are linked. If
> you don't want to use MAP_DIRECT for this, we can move these checks to
> mmap time.

Ok, but I don't want to use mmap to deal with this, nor do I care
whether DAX is in use or not. So I don't think this is really
necessary for MAP_DIRECT.


> >> +xfs_file_mmap_validate(
> >> +     struct file             *filp,
> >> +     struct vm_area_struct   *vma,
> >> +     unsigned long           map_flags,
> >> +     int                     fd)
> >> +{
> >> +     struct inode            *inode = file_inode(filp);
> >> +     struct xfs_inode        *ip = XFS_I(inode);
> >> +     struct map_direct_state *mds;
> >> +
> >> +     if (map_flags & ~(XFS_MAP_SUPPORTED))
> >> +             return -EOPNOTSUPP;
> >> +
> >> +     if ((map_flags & MAP_DIRECT) == 0)
> >> +             return xfs_file_mmap(filp, vma);
> >> +
> >> +     file_accessed(filp);
> >> +     vma->vm_ops = &xfs_file_vm_direct_ops;
> >> +     if (IS_DAX(inode))
> >> +             vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> >
> > And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then?
> 
> In the non-DAX case it just takes the FL_LAYOUT file lease... although
> we could also just have an fcntl for that purpose. The use case of
> just freezing the block map does not need a mapping.

RIght, so I think we should just add a fcntl for the non-DAX case I
have in mind, and not complicate the MAP_DIRECT implementation right
now.  We can alsways extend the scope of MAP_DIRECT in future if we
actually need to do so.

> >> +     mds = map_direct_register(fd, vma);
> >> +     if (IS_ERR(mds))
> >> +             return PTR_ERR(mds);
> >> +
> >> +     /* flush in-flight faults */
> >> +     xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> >> +     xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> >
> > Urk. That's nasty. And why is it even necessary? Please explain why
> > this is necessary in the comment, because it's not at all obvious to
> > me...
> 
> This is related to your other observation about i_mapdcount and adding
> an iomap_can_allocate() helper. I think I can clean both of these up
> by using a call to break_layout(inode, false) and bailing in
> ->iomap_begin() if it returns EWOULDBLOCK. This would also fix the
> current problem that allocating write-faults don't start the lease
> break process.

OK.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
  2017-10-08  3:45     ` Dan Williams
@ 2017-10-10 14:40       ` Raj, Ashok
  -1 siblings, 0 replies; 158+ messages in thread
From: Raj, Ashok @ 2017-10-10 14:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

Hi Dan

On Sat, Oct 07, 2017 at 08:45:00PM -0700, Dan Williams wrote:
> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
> The motivation for this interface is making RDMA transfers to DAX mappings
> safe. If the DAX file's block map changes we need to be to reliably stop
> accesses to blocks that have been freed or re-assigned to a new file. With the
> iommu_domain and a callback from the DAX filesystem the kernel can safely
> revoke access to a DMA device. The process that performed the RDMA memory
> registration is also notified of this revocation event, but the kernel can not
> otherwise be in the position of waiting for userspace to quiesce the device.
> 
> Since PMEM+DAX is currently only enabled for x86, we only update the x86
> iommu drivers.
> 
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since v7:
> * retrieve the iommu_domain so that we can later pass the results of
>   dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.
> 
>  drivers/base/dma-mapping.c  |   10 ++++++++++
>  drivers/iommu/amd_iommu.c   |   10 ++++++++++
>  drivers/iommu/intel-iommu.c |   15 +++++++++++++++
>  include/linux/dma-mapping.h |    3 +++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
> index e584eddef0a7..fdb9764f95a4 100644
> --- a/drivers/base/dma-mapping.c
> +++ b/drivers/base/dma-mapping.c
> @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
>  	of_dma_deconfigure(dev);
>  	acpi_dma_deconfigure(dev);
>  }
> +
> +struct iommu_domain *dma_get_iommu_domain(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (ops && ops->get_iommu)
> +		return ops->get_iommu(dev);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(dma_get_iommu_domain);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..c8e1a45af182 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
>  	return domain;
>  }
>  
> +static struct iommu_domain *amd_dma_get_iommu(struct device *dev)

Minor: Do you want to keep the naming consistent.. amd_dma_get_domain() vs
get_iommu?

> +{
> +	struct protection_domain *domain = get_domain(dev);
> +
> +	if (IS_ERR(domain))
> +		return NULL;
> +	return &domain->domain;
> +}
> +
>  static void update_device_table(struct protection_domain *domain)
>  {
>  	struct iommu_dev_data *dev_data;
> @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
>  	.unmap_sg	= unmap_sg,
>  	.dma_supported	= amd_iommu_dma_supported,
>  	.mapping_error	= amd_iommu_mapping_error,
> +	.get_iommu	= amd_dma_get_iommu,

ditto for here and other places below:

>  };
>  
>  static int init_reserved_iova_ranges(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..f3f4939cebad 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> +static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
> +{
> +	struct dmar_domain *domain;
> +
> +	if (iommu_no_mapping(dev))
> +		return NULL;
> +
> +	domain = get_valid_domain_for_dev(dev);
> +	if (!domain)
> +		return NULL;
> +
> +	return &domain->domain;
> +}
> +
>  static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>  				     size_t size, int dir, u64 dma_mask)
>  {
> @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
>  	.mapping_error = intel_mapping_error,
> +	.get_iommu = intel_dma_get_iommu,
>  #ifdef CONFIG_X86
>  	.dma_supported = x86_dma_supported,
>  #endif
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 29ce9815da87..aa62df1d0d72 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -128,6 +128,7 @@ struct dma_map_ops {
>  				   enum dma_data_direction dir);
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
> +	struct iommu_domain *(*get_iommu)(struct device *dev);
>  #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
>  	u64 (*get_required_mask)(struct device *dev);
>  #endif
> @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>  
> +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  					      size_t size,
>  					      enum dma_data_direction dir,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v8] dma-mapping: introduce dma_get_iommu_domain()
@ 2017-10-10 14:40       ` Raj, Ashok
  0 siblings, 0 replies; 158+ messages in thread
From: Raj, Ashok @ 2017-10-10 14:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-api, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

Hi Dan

On Sat, Oct 07, 2017 at 08:45:00PM -0700, Dan Williams wrote:
> Add a dma-mapping api helper to retrieve the generic iommu_domain for a device.
> The motivation for this interface is making RDMA transfers to DAX mappings
> safe. If the DAX file's block map changes we need to be to reliably stop
> accesses to blocks that have been freed or re-assigned to a new file. With the
> iommu_domain and a callback from the DAX filesystem the kernel can safely
> revoke access to a DMA device. The process that performed the RDMA memory
> registration is also notified of this revocation event, but the kernel can not
> otherwise be in the position of waiting for userspace to quiesce the device.
> 
> Since PMEM+DAX is currently only enabled for x86, we only update the x86
> iommu drivers.
> 
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Changes since v7:
> * retrieve the iommu_domain so that we can later pass the results of
>   dma_map_* to iommu_unmap() in advance of the actual dma_unmap_*.
> 
>  drivers/base/dma-mapping.c  |   10 ++++++++++
>  drivers/iommu/amd_iommu.c   |   10 ++++++++++
>  drivers/iommu/intel-iommu.c |   15 +++++++++++++++
>  include/linux/dma-mapping.h |    3 +++
>  4 files changed, 38 insertions(+)
> 
> diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
> index e584eddef0a7..fdb9764f95a4 100644
> --- a/drivers/base/dma-mapping.c
> +++ b/drivers/base/dma-mapping.c
> @@ -369,3 +369,13 @@ void dma_deconfigure(struct device *dev)
>  	of_dma_deconfigure(dev);
>  	acpi_dma_deconfigure(dev);
>  }
> +
> +struct iommu_domain *dma_get_iommu_domain(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	if (ops && ops->get_iommu)
> +		return ops->get_iommu(dev);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(dma_get_iommu_domain);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 51f8215877f5..c8e1a45af182 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2271,6 +2271,15 @@ static struct protection_domain *get_domain(struct device *dev)
>  	return domain;
>  }
>  
> +static struct iommu_domain *amd_dma_get_iommu(struct device *dev)

Minor: Do you want to keep the naming consistent.. amd_dma_get_domain() vs
get_iommu?

> +{
> +	struct protection_domain *domain = get_domain(dev);
> +
> +	if (IS_ERR(domain))
> +		return NULL;
> +	return &domain->domain;
> +}
> +
>  static void update_device_table(struct protection_domain *domain)
>  {
>  	struct iommu_dev_data *dev_data;
> @@ -2689,6 +2698,7 @@ static const struct dma_map_ops amd_iommu_dma_ops = {
>  	.unmap_sg	= unmap_sg,
>  	.dma_supported	= amd_iommu_dma_supported,
>  	.mapping_error	= amd_iommu_mapping_error,
> +	.get_iommu	= amd_dma_get_iommu,

ditto for here and other places below:

>  };
>  
>  static int init_reserved_iova_ranges(void)
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..f3f4939cebad 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3578,6 +3578,20 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> +static struct iommu_domain *intel_dma_get_iommu(struct device *dev)
> +{
> +	struct dmar_domain *domain;
> +
> +	if (iommu_no_mapping(dev))
> +		return NULL;
> +
> +	domain = get_valid_domain_for_dev(dev);
> +	if (!domain)
> +		return NULL;
> +
> +	return &domain->domain;
> +}
> +
>  static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
>  				     size_t size, int dir, u64 dma_mask)
>  {
> @@ -3872,6 +3886,7 @@ const struct dma_map_ops intel_dma_ops = {
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
>  	.mapping_error = intel_mapping_error,
> +	.get_iommu = intel_dma_get_iommu,
>  #ifdef CONFIG_X86
>  	.dma_supported = x86_dma_supported,
>  #endif
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 29ce9815da87..aa62df1d0d72 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -128,6 +128,7 @@ struct dma_map_ops {
>  				   enum dma_data_direction dir);
>  	int (*mapping_error)(struct device *dev, dma_addr_t dma_addr);
>  	int (*dma_supported)(struct device *dev, u64 mask);
> +	struct iommu_domain *(*get_iommu)(struct device *dev);
>  #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
>  	u64 (*get_required_mask)(struct device *dev);
>  #endif
> @@ -221,6 +222,8 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>  
> +extern struct iommu_domain *dma_get_iommu_domain(struct device *dev);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  					      size_t size,
>  					      enum dma_data_direction dir,
> 

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:25             ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 17:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:25             ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 17:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:25             ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 17:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:25             ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 17:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:39               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:
>
>> > I don't think this has ever come up in the context of an all-device MR
>> > invalidate requirement. Drivers already have code to invalidate
>> > specifc MRs, but to find all MRs that touch certain pages and then
>> > invalidate them would be new code.
>> >
>> > We also have ODP aware drivers that can retarget a MR to new
>> > physical pages. If the block map changes DAX should synchronously
>> > retarget the ODP MR, not halt DMA.
>>
>> Have a look at the patch [1], I don't touch the ODP path.
>
> But, does ODP work OK already? I'm not clear on that..

It had better. If the mapping is invalidated I would hope that
generates an io fault that gets handled by the driver to setup the new
mapping. I don't see how it can work otherwise.

>> > Most likely ODP & DAX would need to be used together to get robust
>> > user applications, as having the user QP's go to an error state at
>> > random times (due to DMA failures) during operation is never going to
>> > be acceptable...
>>
>> It's not random. The process that set up the mapping and registered
>> the memory gets SIGIO when someone else tries to modify the file map.
>> That process then gets /proc/sys/fs/lease-break-time seconds to fix
>> the problem before the kernel force revokes the DMA access.
>
> Well, the process can't fix the problem in bounded time, so it is
> random if it will fail or not.
>
> MR life time is under the control of the remote side, and time to
> complete the network exchanges required to release the MRs is hard to
> bound. So even if I implement SIGIO properly my app will still likely
> have random QP failures under various cases and work loads. :(
>
> This is why ODP should be the focus because this cannot work fully
> reliably otherwise..

The lease break time is configurable. If that application can't
respond to a stop request within a timeout of its own choosing then it
should not be using DAX mappings.

>
>> > Perhaps you might want to initially only support ODP MR mappings with
>> > DAX and then the DMA fencing issue goes away?
>>
>> I'd rather try to fix the non-ODP DAX case instead of just turning it off.
>
> Well, what about using SIGKILL if the lease-break-time hits? The
> kernel will clean up the MRs when the process exits and this will
> fence DMA to that memory.

Can you point me to where the MR cleanup code fences DMA and quiesces
the device?

> But, still, if you really want to be fined graned, then I think
> invalidating the impacted MR's is a better solution for RDMA than
> trying to do it with the IOMMU...

If there's a better routine for handling ib_umem_lease_break() I'd
love to use it. Right now I'm reaching for the only tool I know for
kernel enforced revocation of DMA access.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:39               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:
>
>> > I don't think this has ever come up in the context of an all-device MR
>> > invalidate requirement. Drivers already have code to invalidate
>> > specifc MRs, but to find all MRs that touch certain pages and then
>> > invalidate them would be new code.
>> >
>> > We also have ODP aware drivers that can retarget a MR to new
>> > physical pages. If the block map changes DAX should synchronously
>> > retarget the ODP MR, not halt DMA.
>>
>> Have a look at the patch [1], I don't touch the ODP path.
>
> But, does ODP work OK already? I'm not clear on that..

It had better. If the mapping is invalidated I would hope that
generates an io fault that gets handled by the driver to setup the new
mapping. I don't see how it can work otherwise.

>> > Most likely ODP & DAX would need to be used together to get robust
>> > user applications, as having the user QP's go to an error state at
>> > random times (due to DMA failures) during operation is never going to
>> > be acceptable...
>>
>> It's not random. The process that set up the mapping and registered
>> the memory gets SIGIO when someone else tries to modify the file map.
>> That process then gets /proc/sys/fs/lease-break-time seconds to fix
>> the problem before the kernel force revokes the DMA access.
>
> Well, the process can't fix the problem in bounded time, so it is
> random if it will fail or not.
>
> MR life time is under the control of the remote side, and time to
> complete the network exchanges required to release the MRs is hard to
> bound. So even if I implement SIGIO properly my app will still likely
> have random QP failures under various cases and work loads. :(
>
> This is why ODP should be the focus because this cannot work fully
> reliably otherwise..

The lease break time is configurable. If that application can't
respond to a stop request within a timeout of its own choosing then it
should not be using DAX mappings.

>
>> > Perhaps you might want to initially only support ODP MR mappings with
>> > DAX and then the DMA fencing issue goes away?
>>
>> I'd rather try to fix the non-ODP DAX case instead of just turning it off.
>
> Well, what about using SIGKILL if the lease-break-time hits? The
> kernel will clean up the MRs when the process exits and this will
> fence DMA to that memory.

Can you point me to where the MR cleanup code fences DMA and quiesces
the device?

> But, still, if you really want to be fined graned, then I think
> invalidating the impacted MR's is a better solution for RDMA than
> trying to do it with the IOMMU...

If there's a better routine for handling ib_umem_lease_break() I'd
love to use it. Right now I'm reaching for the only tool I know for
kernel enforced revocation of DMA access.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:39               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:
>
>> > I don't think this has ever come up in the context of an all-device MR
>> > invalidate requirement. Drivers already have code to invalidate
>> > specifc MRs, but to find all MRs that touch certain pages and then
>> > invalidate them would be new code.
>> >
>> > We also have ODP aware drivers that can retarget a MR to new
>> > physical pages. If the block map changes DAX should synchronously
>> > retarget the ODP MR, not halt DMA.
>>
>> Have a look at the patch [1], I don't touch the ODP path.
>
> But, does ODP work OK already? I'm not clear on that..

It had better. If the mapping is invalidated I would hope that
generates an io fault that gets handled by the driver to setup the new
mapping. I don't see how it can work otherwise.

>> > Most likely ODP & DAX would need to be used together to get robust
>> > user applications, as having the user QP's go to an error state at
>> > random times (due to DMA failures) during operation is never going to
>> > be acceptable...
>>
>> It's not random. The process that set up the mapping and registered
>> the memory gets SIGIO when someone else tries to modify the file map.
>> That process then gets /proc/sys/fs/lease-break-time seconds to fix
>> the problem before the kernel force revokes the DMA access.
>
> Well, the process can't fix the problem in bounded time, so it is
> random if it will fail or not.
>
> MR life time is under the control of the remote side, and time to
> complete the network exchanges required to release the MRs is hard to
> bound. So even if I implement SIGIO properly my app will still likely
> have random QP failures under various cases and work loads. :(
>
> This is why ODP should be the focus because this cannot work fully
> reliably otherwise..

The lease break time is configurable. If that application can't
respond to a stop request within a timeout of its own choosing then it
should not be using DAX mappings.

>
>> > Perhaps you might want to initially only support ODP MR mappings with
>> > DAX and then the DMA fencing issue goes away?
>>
>> I'd rather try to fix the non-ODP DAX case instead of just turning it off.
>
> Well, what about using SIGKILL if the lease-break-time hits? The
> kernel will clean up the MRs when the process exits and this will
> fence DMA to that memory.

Can you point me to where the MR cleanup code fences DMA and quiesces
the device?

> But, still, if you really want to be fined graned, then I think
> invalidating the impacted MR's is a better solution for RDMA than
> trying to do it with the IOMMU...

If there's a better routine for handling ib_umem_lease_break() I'd
love to use it. Right now I'm reaching for the only tool I know for
kernel enforced revocation of DMA access.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 17:39               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:
>
>> > I don't think this has ever come up in the context of an all-device MR
>> > invalidate requirement. Drivers already have code to invalidate
>> > specifc MRs, but to find all MRs that touch certain pages and then
>> > invalidate them would be new code.
>> >
>> > We also have ODP aware drivers that can retarget a MR to new
>> > physical pages. If the block map changes DAX should synchronously
>> > retarget the ODP MR, not halt DMA.
>>
>> Have a look at the patch [1], I don't touch the ODP path.
>
> But, does ODP work OK already? I'm not clear on that..

It had better. If the mapping is invalidated I would hope that
generates an io fault that gets handled by the driver to setup the new
mapping. I don't see how it can work otherwise.

>> > Most likely ODP & DAX would need to be used together to get robust
>> > user applications, as having the user QP's go to an error state at
>> > random times (due to DMA failures) during operation is never going to
>> > be acceptable...
>>
>> It's not random. The process that set up the mapping and registered
>> the memory gets SIGIO when someone else tries to modify the file map.
>> That process then gets /proc/sys/fs/lease-break-time seconds to fix
>> the problem before the kernel force revokes the DMA access.
>
> Well, the process can't fix the problem in bounded time, so it is
> random if it will fail or not.
>
> MR life time is under the control of the remote side, and time to
> complete the network exchanges required to release the MRs is hard to
> bound. So even if I implement SIGIO properly my app will still likely
> have random QP failures under various cases and work loads. :(
>
> This is why ODP should be the focus because this cannot work fully
> reliably otherwise..

The lease break time is configurable. If that application can't
respond to a stop request within a timeout of its own choosing then it
should not be using DAX mappings.

>
>> > Perhaps you might want to initially only support ODP MR mappings with
>> > DAX and then the DMA fencing issue goes away?
>>
>> I'd rather try to fix the non-ODP DAX case instead of just turning it off.
>
> Well, what about using SIGKILL if the lease-break-time hits? The
> kernel will clean up the MRs when the process exits and this will
> fence DMA to that memory.

Can you point me to where the MR cleanup code fences DMA and quiesces
the device?

> But, still, if you really want to be fined graned, then I think
> invalidating the impacted MR's is a better solution for RDMA than
> trying to do it with the IOMMU...

If there's a better routine for handling ib_umem_lease_break() I'd
love to use it. Right now I'm reaching for the only tool I know for
kernel enforced revocation of DMA access.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 18:05                 ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 18:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 18:05                 ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 18:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Linux API, linux-fsdevel, David Woodhouse, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 18:05                 ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 18:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 18:05                 ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-10 18:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-10 18:05                 ` Jason Gunthorpe
  (?)
@ 2017-10-10 20:17                   ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 20:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
>> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
>
>> >> Have a look at the patch [1], I don't touch the ODP path.
>> >
>> > But, does ODP work OK already? I'm not clear on that..
>>
>> It had better. If the mapping is invalidated I would hope that
>> generates an io fault that gets handled by the driver to setup the new
>> mapping. I don't see how it can work otherwise.
>
> I would assume so too...
>
>> > This is why ODP should be the focus because this cannot work fully
>> > reliably otherwise..
>>
>> The lease break time is configurable. If that application can't
>> respond to a stop request within a timeout of its own choosing then it
>> should not be using DAX mappings.
>
> Well, no RDMA application can really do this, unless you set the
> timeout to multiple minutes, on par with network timeouts.

The default lease break timeout is 45 seconds on my system, so minutes
does not seem out of the question.

Also keep in mind that what triggers the lease break is another
application trying to write or punch holes in a file that is mapped
for RDMA. So, if the hardware can't handle the iommu mapping getting
invalidated asynchronously and the application can't react in the
lease break timeout period then the administrator should arrange for
the file to not be written or truncated while it is mapped.

It's already the case that get_user_pages() does not lock down file
associations, so if your application is contending with these types of
file changes it likely already has a problem keeping transactions in
sync with the file state even without DAX.

> Again, these details are why I think this kind of DAX and non ODP-MRs
> are probably practically not too useful for a production system. Great
> for test of course, but in that case SIGKILL would be fine too...
>
>> > Well, what about using SIGKILL if the lease-break-time hits? The
>> > kernel will clean up the MRs when the process exits and this will
>> > fence DMA to that memory.
>>
>> Can you point me to where the MR cleanup code fences DMA and quiesces
>> the device?
>
> Yes. The MR's are associated with an fd. When the fd is closed
> ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
> all the objects, including MRs, and deletes them.
>
> The specification for deleting a MR requires a synchronous fence with
> the hardware. After MR deletion the hardware will not DMA to any pages
> described by the old MR, and those pages will be unpinned.
>
>> > But, still, if you really want to be fined graned, then I think
>> > invalidating the impacted MR's is a better solution for RDMA than
>> > trying to do it with the IOMMU...
>>
>> If there's a better routine for handling ib_umem_lease_break() I'd
>> love to use it. Right now I'm reaching for the only tool I know for
>> kernel enforced revocation of DMA access.
>
> Well, you'd have to code something in the MR code to keep track of DAX
> MRs and issue an out of band invalidate to impacted MRs to create the
> fence.
>
> This probably needs some driver work, I'm not sure if all the hardware
> can do out of band invalidate to any MR or not..

Ok.

>
> Generally speaking, in RDMA, when a new feature like this comes along
> we have to push a lot of the work down to the driver authors, and the
> approach has historically been that new features only work on some
> hardware (as much as I dislike this, it is pragmatic)
>
> So, not being able to support DAX on certain RDMA hardware is not
> an unreasonable situation in our space.

That makes sense, but it still seems to me that this proposed solution
allows more than enough ways to avoid that worst case scenario where
hardware reacts badly to iommu invalidation. Drivers that can do
better than iommu invalidation can arrange for a callback to do their
driver-specific action at lease break time. Hardware that can't should
be blacklisted from supporting DAX altogether. In other words this is
a starting point to incrementally enhance or disable specific drivers,
but with the assurance that the kernel can always do the safe thing
when / if the driver is missing a finer grained solution.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 20:17                   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 20:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
>> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
>
>> >> Have a look at the patch [1], I don't touch the ODP path.
>> >
>> > But, does ODP work OK already? I'm not clear on that..
>>
>> It had better. If the mapping is invalidated I would hope that
>> generates an io fault that gets handled by the driver to setup the new
>> mapping. I don't see how it can work otherwise.
>
> I would assume so too...
>
>> > This is why ODP should be the focus because this cannot work fully
>> > reliably otherwise..
>>
>> The lease break time is configurable. If that application can't
>> respond to a stop request within a timeout of its own choosing then it
>> should not be using DAX mappings.
>
> Well, no RDMA application can really do this, unless you set the
> timeout to multiple minutes, on par with network timeouts.

The default lease break timeout is 45 seconds on my system, so minutes
does not seem out of the question.

Also keep in mind that what triggers the lease break is another
application trying to write or punch holes in a file that is mapped
for RDMA. So, if the hardware can't handle the iommu mapping getting
invalidated asynchronously and the application can't react in the
lease break timeout period then the administrator should arrange for
the file to not be written or truncated while it is mapped.

It's already the case that get_user_pages() does not lock down file
associations, so if your application is contending with these types of
file changes it likely already has a problem keeping transactions in
sync with the file state even without DAX.

> Again, these details are why I think this kind of DAX and non ODP-MRs
> are probably practically not too useful for a production system. Great
> for test of course, but in that case SIGKILL would be fine too...
>
>> > Well, what about using SIGKILL if the lease-break-time hits? The
>> > kernel will clean up the MRs when the process exits and this will
>> > fence DMA to that memory.
>>
>> Can you point me to where the MR cleanup code fences DMA and quiesces
>> the device?
>
> Yes. The MR's are associated with an fd. When the fd is closed
> ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
> all the objects, including MRs, and deletes them.
>
> The specification for deleting a MR requires a synchronous fence with
> the hardware. After MR deletion the hardware will not DMA to any pages
> described by the old MR, and those pages will be unpinned.
>
>> > But, still, if you really want to be fined graned, then I think
>> > invalidating the impacted MR's is a better solution for RDMA than
>> > trying to do it with the IOMMU...
>>
>> If there's a better routine for handling ib_umem_lease_break() I'd
>> love to use it. Right now I'm reaching for the only tool I know for
>> kernel enforced revocation of DMA access.
>
> Well, you'd have to code something in the MR code to keep track of DAX
> MRs and issue an out of band invalidate to impacted MRs to create the
> fence.
>
> This probably needs some driver work, I'm not sure if all the hardware
> can do out of band invalidate to any MR or not..

Ok.

>
> Generally speaking, in RDMA, when a new feature like this comes along
> we have to push a lot of the work down to the driver authors, and the
> approach has historically been that new features only work on some
> hardware (as much as I dislike this, it is pragmatic)
>
> So, not being able to support DAX on certain RDMA hardware is not
> an unreasonable situation in our space.

That makes sense, but it still seems to me that this proposed solution
allows more than enough ways to avoid that worst case scenario where
hardware reacts badly to iommu invalidation. Drivers that can do
better than iommu invalidation can arrange for a callback to do their
driver-specific action at lease break time. Hardware that can't should
be blacklisted from supporting DAX altogether. In other words this is
a starting point to incrementally enhance or disable specific drivers,
but with the assurance that the kernel can always do the safe thing
when / if the driver is missing a finer grained solution.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-10 20:17                   ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-10 20:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 11:05 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
>> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe
>
>> >> Have a look at the patch [1], I don't touch the ODP path.
>> >
>> > But, does ODP work OK already? I'm not clear on that..
>>
>> It had better. If the mapping is invalidated I would hope that
>> generates an io fault that gets handled by the driver to setup the new
>> mapping. I don't see how it can work otherwise.
>
> I would assume so too...
>
>> > This is why ODP should be the focus because this cannot work fully
>> > reliably otherwise..
>>
>> The lease break time is configurable. If that application can't
>> respond to a stop request within a timeout of its own choosing then it
>> should not be using DAX mappings.
>
> Well, no RDMA application can really do this, unless you set the
> timeout to multiple minutes, on par with network timeouts.

The default lease break timeout is 45 seconds on my system, so minutes
does not seem out of the question.

Also keep in mind that what triggers the lease break is another
application trying to write or punch holes in a file that is mapped
for RDMA. So, if the hardware can't handle the iommu mapping getting
invalidated asynchronously and the application can't react in the
lease break timeout period then the administrator should arrange for
the file to not be written or truncated while it is mapped.

It's already the case that get_user_pages() does not lock down file
associations, so if your application is contending with these types of
file changes it likely already has a problem keeping transactions in
sync with the file state even without DAX.

> Again, these details are why I think this kind of DAX and non ODP-MRs
> are probably practically not too useful for a production system. Great
> for test of course, but in that case SIGKILL would be fine too...
>
>> > Well, what about using SIGKILL if the lease-break-time hits? The
>> > kernel will clean up the MRs when the process exits and this will
>> > fence DMA to that memory.
>>
>> Can you point me to where the MR cleanup code fences DMA and quiesces
>> the device?
>
> Yes. The MR's are associated with an fd. When the fd is closed
> ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
> all the objects, including MRs, and deletes them.
>
> The specification for deleting a MR requires a synchronous fence with
> the hardware. After MR deletion the hardware will not DMA to any pages
> described by the old MR, and those pages will be unpinned.
>
>> > But, still, if you really want to be fined graned, then I think
>> > invalidating the impacted MR's is a better solution for RDMA than
>> > trying to do it with the IOMMU...
>>
>> If there's a better routine for handling ib_umem_lease_break() I'd
>> love to use it. Right now I'm reaching for the only tool I know for
>> kernel enforced revocation of DMA access.
>
> Well, you'd have to code something in the MR code to keep track of DAX
> MRs and issue an out of band invalidate to impacted MRs to create the
> fence.
>
> This probably needs some driver work, I'm not sure if all the hardware
> can do out of band invalidate to any MR or not..

Ok.

>
> Generally speaking, in RDMA, when a new feature like this comes along
> we have to push a lot of the work down to the driver authors, and the
> approach has historically been that new features only work on some
> hardware (as much as I dislike this, it is pragmatic)
>
> So, not being able to support DAX on certain RDMA hardware is not
> an unreasonable situation in our space.

That makes sense, but it still seems to me that this proposed solution
allows more than enough ways to avoid that worst case scenario where
hardware reacts badly to iommu invalidation. Drivers that can do
better than iommu invalidation can arrange for a callback to do their
driver-specific action at lease break time. Hardware that can't should
be blacklisted from supporting DAX altogether. In other words this is
a starting point to incrementally enhance or disable specific drivers,
but with the assurance that the kernel can always do the safe thing
when / if the driver is missing a finer grained solution.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-10 20:17                   ` Dan Williams
  (?)
@ 2017-10-12 18:27                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-12 18:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:

> Also keep in mind that what triggers the lease break is another
> application trying to write or punch holes in a file that is mapped
> for RDMA. So, if the hardware can't handle the iommu mapping getting
> invalidated asynchronously and the application can't react in the
> lease break timeout period then the administrator should arrange for
> the file to not be written or truncated while it is mapped.

That makes sense, but why not return ENOSYS or something to the app
trying to alter the file if the RDMA hardware can't support this
instead of having the RDMA app deal with this lease break weirdness?

> It's already the case that get_user_pages() does not lock down file
> associations, so if your application is contending with these types of
> file changes it likely already has a problem keeping transactions in
> sync with the file state even without DAX.

Yes, things go weird in non-ODP RDMA cases like this..

Also, just to clear, I would expect an app using the SIGIO interface
to basically halt ongoing RDMA, wait for MRs to become unused locally
and remotely, destroy the MRs, then somehow, establish new MRs that
cover the same logical map (eg what ODP would do transparently) after
the lease breaker has made their changes, then restart their IO.

Does your SIGIO approach have a race-free way to do that last steps?

> > So, not being able to support DAX on certain RDMA hardware is not
> > an unreasonable situation in our space.
> 
> That makes sense, but it still seems to me that this proposed solution
> allows more than enough ways to avoid that worst case scenario where
> hardware reacts badly to iommu invalidation.

Yes, although I am concerned that returning PCI-E errors is such an
unusual and untested path for some of our RDMA drivers that they may
malfunction badly...

Again, going back to the question of who would ever use this, I would
be very relucant to deploy a production configuration relying on the iommu
invalidate or SIGIO techniques, when ODP HW is available and works
flawlessly.

> be blacklisted from supporting DAX altogether. In other words this is
> a starting point to incrementally enhance or disable specific drivers,
> but with the assurance that the kernel can always do the safe thing
> when / if the driver is missing a finer grained solution.

Seems reasonable.. I think existing HW will have an easier time adding
invalidate, while new hardware really should implement ODP.

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-12 18:27                     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-12 18:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:

> Also keep in mind that what triggers the lease break is another
> application trying to write or punch holes in a file that is mapped
> for RDMA. So, if the hardware can't handle the iommu mapping getting
> invalidated asynchronously and the application can't react in the
> lease break timeout period then the administrator should arrange for
> the file to not be written or truncated while it is mapped.

That makes sense, but why not return ENOSYS or something to the app
trying to alter the file if the RDMA hardware can't support this
instead of having the RDMA app deal with this lease break weirdness?

> It's already the case that get_user_pages() does not lock down file
> associations, so if your application is contending with these types of
> file changes it likely already has a problem keeping transactions in
> sync with the file state even without DAX.

Yes, things go weird in non-ODP RDMA cases like this..

Also, just to clear, I would expect an app using the SIGIO interface
to basically halt ongoing RDMA, wait for MRs to become unused locally
and remotely, destroy the MRs, then somehow, establish new MRs that
cover the same logical map (eg what ODP would do transparently) after
the lease breaker has made their changes, then restart their IO.

Does your SIGIO approach have a race-free way to do that last steps?

> > So, not being able to support DAX on certain RDMA hardware is not
> > an unreasonable situation in our space.
> 
> That makes sense, but it still seems to me that this proposed solution
> allows more than enough ways to avoid that worst case scenario where
> hardware reacts badly to iommu invalidation.

Yes, although I am concerned that returning PCI-E errors is such an
unusual and untested path for some of our RDMA drivers that they may
malfunction badly...

Again, going back to the question of who would ever use this, I would
be very relucant to deploy a production configuration relying on the iommu
invalidate or SIGIO techniques, when ODP HW is available and works
flawlessly.

> be blacklisted from supporting DAX altogether. In other words this is
> a starting point to incrementally enhance or disable specific drivers,
> but with the assurance that the kernel can always do the safe thing
> when / if the driver is missing a finer grained solution.

Seems reasonable.. I think existing HW will have an easier time adding
invalidate, while new hardware really should implement ODP.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-12 18:27                     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-12 18:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:

> Also keep in mind that what triggers the lease break is another
> application trying to write or punch holes in a file that is mapped
> for RDMA. So, if the hardware can't handle the iommu mapping getting
> invalidated asynchronously and the application can't react in the
> lease break timeout period then the administrator should arrange for
> the file to not be written or truncated while it is mapped.

That makes sense, but why not return ENOSYS or something to the app
trying to alter the file if the RDMA hardware can't support this
instead of having the RDMA app deal with this lease break weirdness?

> It's already the case that get_user_pages() does not lock down file
> associations, so if your application is contending with these types of
> file changes it likely already has a problem keeping transactions in
> sync with the file state even without DAX.

Yes, things go weird in non-ODP RDMA cases like this..

Also, just to clear, I would expect an app using the SIGIO interface
to basically halt ongoing RDMA, wait for MRs to become unused locally
and remotely, destroy the MRs, then somehow, establish new MRs that
cover the same logical map (eg what ODP would do transparently) after
the lease breaker has made their changes, then restart their IO.

Does your SIGIO approach have a race-free way to do that last steps?

> > So, not being able to support DAX on certain RDMA hardware is not
> > an unreasonable situation in our space.
> 
> That makes sense, but it still seems to me that this proposed solution
> allows more than enough ways to avoid that worst case scenario where
> hardware reacts badly to iommu invalidation.

Yes, although I am concerned that returning PCI-E errors is such an
unusual and untested path for some of our RDMA drivers that they may
malfunction badly...

Again, going back to the question of who would ever use this, I would
be very relucant to deploy a production configuration relying on the iommu
invalidate or SIGIO techniques, when ODP HW is available and works
flawlessly.

> be blacklisted from supporting DAX altogether. In other words this is
> a starting point to incrementally enhance or disable specific drivers,
> but with the assurance that the kernel can always do the safe thing
> when / if the driver is missing a finer grained solution.

Seems reasonable.. I think existing HW will have an easier time adding
invalidate, while new hardware really should implement ODP.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-12 18:27                     ` Jason Gunthorpe
@ 2017-10-12 20:10                       ` Dan Williams
  -1 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-12 20:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
>
>> Also keep in mind that what triggers the lease break is another
>> application trying to write or punch holes in a file that is mapped
>> for RDMA. So, if the hardware can't handle the iommu mapping getting
>> invalidated asynchronously and the application can't react in the
>> lease break timeout period then the administrator should arrange for
>> the file to not be written or truncated while it is mapped.
>
> That makes sense, but why not return ENOSYS or something to the app
> trying to alter the file if the RDMA hardware can't support this
> instead of having the RDMA app deal with this lease break weirdness?

That's where I started, an inode flag that said "hands off, this file
is busy", but Christoph pointed out that we should reuse the same
mechanisms that pnfs is using. The pnfs protection scheme uses file
leases, and once the kernel decides that a lease needs to be broken /
layout needs to be recalled there is no stopping it, only delaying.

>> It's already the case that get_user_pages() does not lock down file
>> associations, so if your application is contending with these types of
>> file changes it likely already has a problem keeping transactions in
>> sync with the file state even without DAX.
>
> Yes, things go weird in non-ODP RDMA cases like this..
>
> Also, just to clear, I would expect an app using the SIGIO interface
> to basically halt ongoing RDMA, wait for MRs to become unused locally
> and remotely, destroy the MRs, then somehow, establish new MRs that
> cover the same logical map (eg what ODP would do transparently) after
> the lease breaker has made their changes, then restart their IO.
>
> Does your SIGIO approach have a race-free way to do that last steps?

After the SIGIO that's becomes a userspace / driver problem to quiesce
the I/O...

However, chatting this over with a few more people I have an alternate
solution that effectively behaves the same as how non-ODP hardware
handles this case of hole punch / truncation today. So, today if this
scenario happens on a page-cache backed mapping, the file blocks are
unmapped and the RDMA continues into pinned pages that are no longer
part of the file. We can achieve the same thing with the iommu, just
re-target the I/O into memory that isn't part of the file. That way
hardware does not see I/O errors and the DAX data consistency model is
no worse than the page-cache case.

>> > So, not being able to support DAX on certain RDMA hardware is not
>> > an unreasonable situation in our space.
>>
>> That makes sense, but it still seems to me that this proposed solution
>> allows more than enough ways to avoid that worst case scenario where
>> hardware reacts badly to iommu invalidation.
>
> Yes, although I am concerned that returning PCI-E errors is such an
> unusual and untested path for some of our RDMA drivers that they may
> malfunction badly...
>
> Again, going back to the question of who would ever use this, I would
> be very relucant to deploy a production configuration relying on the iommu
> invalidate or SIGIO techniques, when ODP HW is available and works
> flawlessly.

I don't think it is reasonable to tell people you need to throw away
your old hardware just because you want to target a DAX mapping.

>> be blacklisted from supporting DAX altogether. In other words this is
>> a starting point to incrementally enhance or disable specific drivers,
>> but with the assurance that the kernel can always do the safe thing
>> when / if the driver is missing a finer grained solution.
>
> Seems reasonable.. I think existing HW will have an easier time adding
> invalidate, while new hardware really should implement ODP.
>

Yeah, so if we go with 'remap' instead of 'invalidate' does that
address your concerns?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-12 20:10                       ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-12 20:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner, linux-xfs,
	Linux MM, Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
>
>> Also keep in mind that what triggers the lease break is another
>> application trying to write or punch holes in a file that is mapped
>> for RDMA. So, if the hardware can't handle the iommu mapping getting
>> invalidated asynchronously and the application can't react in the
>> lease break timeout period then the administrator should arrange for
>> the file to not be written or truncated while it is mapped.
>
> That makes sense, but why not return ENOSYS or something to the app
> trying to alter the file if the RDMA hardware can't support this
> instead of having the RDMA app deal with this lease break weirdness?

That's where I started, an inode flag that said "hands off, this file
is busy", but Christoph pointed out that we should reuse the same
mechanisms that pnfs is using. The pnfs protection scheme uses file
leases, and once the kernel decides that a lease needs to be broken /
layout needs to be recalled there is no stopping it, only delaying.

>> It's already the case that get_user_pages() does not lock down file
>> associations, so if your application is contending with these types of
>> file changes it likely already has a problem keeping transactions in
>> sync with the file state even without DAX.
>
> Yes, things go weird in non-ODP RDMA cases like this..
>
> Also, just to clear, I would expect an app using the SIGIO interface
> to basically halt ongoing RDMA, wait for MRs to become unused locally
> and remotely, destroy the MRs, then somehow, establish new MRs that
> cover the same logical map (eg what ODP would do transparently) after
> the lease breaker has made their changes, then restart their IO.
>
> Does your SIGIO approach have a race-free way to do that last steps?

After the SIGIO that's becomes a userspace / driver problem to quiesce
the I/O...

However, chatting this over with a few more people I have an alternate
solution that effectively behaves the same as how non-ODP hardware
handles this case of hole punch / truncation today. So, today if this
scenario happens on a page-cache backed mapping, the file blocks are
unmapped and the RDMA continues into pinned pages that are no longer
part of the file. We can achieve the same thing with the iommu, just
re-target the I/O into memory that isn't part of the file. That way
hardware does not see I/O errors and the DAX data consistency model is
no worse than the page-cache case.

>> > So, not being able to support DAX on certain RDMA hardware is not
>> > an unreasonable situation in our space.
>>
>> That makes sense, but it still seems to me that this proposed solution
>> allows more than enough ways to avoid that worst case scenario where
>> hardware reacts badly to iommu invalidation.
>
> Yes, although I am concerned that returning PCI-E errors is such an
> unusual and untested path for some of our RDMA drivers that they may
> malfunction badly...
>
> Again, going back to the question of who would ever use this, I would
> be very relucant to deploy a production configuration relying on the iommu
> invalidate or SIGIO techniques, when ODP HW is available and works
> flawlessly.

I don't think it is reasonable to tell people you need to throw away
your old hardware just because you want to target a DAX mapping.

>> be blacklisted from supporting DAX altogether. In other words this is
>> a starting point to incrementally enhance or disable specific drivers,
>> but with the assurance that the kernel can always do the safe thing
>> when / if the driver is missing a finer grained solution.
>
> Seems reasonable.. I think existing HW will have an easier time adding
> invalidate, while new hardware really should implement ODP.
>

Yeah, so if we go with 'remap' instead of 'invalidate' does that
address your concerns?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
  2017-10-12 20:10                       ` Dan Williams
  (?)
@ 2017-10-13  6:50                         ` Christoph Hellwig
  -1 siblings, 0 replies; 158+ messages in thread
From: Christoph Hellwig @ 2017-10-13  6:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs, Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, Jason Gunthorpe, Linux MM, Linux API,
	linux-fsdevel, David Woodhouse, Christoph Hellwig,
	Marek Szyprowski

On Thu, Oct 12, 2017 at 01:10:33PM -0700, Dan Williams wrote:
> On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
> >
> >> Also keep in mind that what triggers the lease break is another
> >> application trying to write or punch holes in a file that is mapped
> >> for RDMA. So, if the hardware can't handle the iommu mapping getting
> >> invalidated asynchronously and the application can't react in the
> >> lease break timeout period then the administrator should arrange for
> >> the file to not be written or truncated while it is mapped.
> >
> > That makes sense, but why not return ENOSYS or something to the app
> > trying to alter the file if the RDMA hardware can't support this
> > instead of having the RDMA app deal with this lease break weirdness?
> 
> That's where I started, an inode flag that said "hands off, this file
> is busy", but Christoph pointed out that we should reuse the same
> mechanisms that pnfs is using. The pnfs protection scheme uses file
> leases, and once the kernel decides that a lease needs to be broken /
> layout needs to be recalled there is no stopping it, only delaying.

That was just a suggestion - the important statement is that a hands
off flag is just a no-go.

> However, chatting this over with a few more people I have an alternate
> solution that effectively behaves the same as how non-ODP hardware
> handles this case of hole punch / truncation today. So, today if this
> scenario happens on a page-cache backed mapping, the file blocks are
> unmapped and the RDMA continues into pinned pages that are no longer
> part of the file. We can achieve the same thing with the iommu, just
> re-target the I/O into memory that isn't part of the file. That way
> hardware does not see I/O errors and the DAX data consistency model is
> no worse than the page-cache case.

Yikes.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13  6:50                         ` Christoph Hellwig
  0 siblings, 0 replies; 158+ messages in thread
From: Christoph Hellwig @ 2017-10-13  6:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, linux-nvdimm, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel,
	Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API,
	linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

On Thu, Oct 12, 2017 at 01:10:33PM -0700, Dan Williams wrote:
> On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
> >
> >> Also keep in mind that what triggers the lease break is another
> >> application trying to write or punch holes in a file that is mapped
> >> for RDMA. So, if the hardware can't handle the iommu mapping getting
> >> invalidated asynchronously and the application can't react in the
> >> lease break timeout period then the administrator should arrange for
> >> the file to not be written or truncated while it is mapped.
> >
> > That makes sense, but why not return ENOSYS or something to the app
> > trying to alter the file if the RDMA hardware can't support this
> > instead of having the RDMA app deal with this lease break weirdness?
> 
> That's where I started, an inode flag that said "hands off, this file
> is busy", but Christoph pointed out that we should reuse the same
> mechanisms that pnfs is using. The pnfs protection scheme uses file
> leases, and once the kernel decides that a lease needs to be broken /
> layout needs to be recalled there is no stopping it, only delaying.

That was just a suggestion - the important statement is that a hands
off flag is just a no-go.

> However, chatting this over with a few more people I have an alternate
> solution that effectively behaves the same as how non-ODP hardware
> handles this case of hole punch / truncation today. So, today if this
> scenario happens on a page-cache backed mapping, the file blocks are
> unmapped and the RDMA continues into pinned pages that are no longer
> part of the file. We can achieve the same thing with the iommu, just
> re-target the I/O into memory that isn't part of the file. That way
> hardware does not see I/O errors and the DAX data consistency model is
> no worse than the page-cache case.

Yikes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13  6:50                         ` Christoph Hellwig
  0 siblings, 0 replies; 158+ messages in thread
From: Christoph Hellwig @ 2017-10-13  6:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, linux-nvdimm, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel,
	Dave Chinner, linux-xfs, Linux MM, Jeff Moyer, Linux API,
	linux-fsdevel, Ross Zwisler, David Woodhouse, Robin Murphy,
	Christoph Hellwig, Marek Szyprowski

On Thu, Oct 12, 2017 at 01:10:33PM -0700, Dan Williams wrote:
> On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:
> >
> >> Also keep in mind that what triggers the lease break is another
> >> application trying to write or punch holes in a file that is mapped
> >> for RDMA. So, if the hardware can't handle the iommu mapping getting
> >> invalidated asynchronously and the application can't react in the
> >> lease break timeout period then the administrator should arrange for
> >> the file to not be written or truncated while it is mapped.
> >
> > That makes sense, but why not return ENOSYS or something to the app
> > trying to alter the file if the RDMA hardware can't support this
> > instead of having the RDMA app deal with this lease break weirdness?
> 
> That's where I started, an inode flag that said "hands off, this file
> is busy", but Christoph pointed out that we should reuse the same
> mechanisms that pnfs is using. The pnfs protection scheme uses file
> leases, and once the kernel decides that a lease needs to be broken /
> layout needs to be recalled there is no stopping it, only delaying.

That was just a suggestion - the important statement is that a hands
off flag is just a no-go.

> However, chatting this over with a few more people I have an alternate
> solution that effectively behaves the same as how non-ODP hardware
> handles this case of hole punch / truncation today. So, today if this
> scenario happens on a page-cache backed mapping, the file blocks are
> unmapped and the RDMA continues into pinned pages that are no longer
> part of the file. We can achieve the same thing with the iommu, just
> re-target the I/O into memory that isn't part of the file. That way
> hardware does not see I/O errors and the DAX data consistency model is
> no worse than the page-cache case.

Yikes.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13  7:09           ` Christoph Hellwig
  0 siblings, 0 replies; 158+ messages in thread
From: Christoph Hellwig @ 2017-10-13  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel,
	Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 01:18:20PM -0600, Jason Gunthorpe wrote:
> > > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > > do not suffer from the re-use problem David W. brought up with IOVAs..
> > 
> > Sounds promising. All I want in the end is to be sure that the kernel
> > is enabled to stop any in-flight RDMA at will without asking
> > userspace. Does this require per-RDMA driver opt-in or is there a
> > common call that can be made?
> 
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.

The whole point is that we should not need that IFF we provide the
right interface.

If we have a new 'register memory with a lease', the driver (or in fact
probably the umem core for the drivers using it) has the lease associated
with the ib_umem structure, which will just need a backpointer from the
ib_umem to the to the MR to unregister it.

Which might be a good opportunity to break the user MR from the in-kernel
ones and merge it with ib_umem, but that's a different story..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13  7:09           ` Christoph Hellwig
  0 siblings, 0 replies; 158+ messages in thread
From: Christoph Hellwig @ 2017-10-13  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse,
	Robin Murphy, Christoph Hellwig, Marek Szyprowski

On Mon, Oct 09, 2017 at 01:18:20PM -0600, Jason Gunthorpe wrote:
> > > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > > do not suffer from the re-use problem David W. brought up with IOVAs..
> > 
> > Sounds promising. All I want in the end is to be sure that the kernel
> > is enabled to stop any in-flight RDMA at will without asking
> > userspace. Does this require per-RDMA driver opt-in or is there a
> > common call that can be made?
> 
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.

The whole point is that we should not need that IFF we provide the
right interface.

If we have a new 'register memory with a lease', the driver (or in fact
probably the umem core for the drivers using it) has the lease associated
with the ib_umem structure, which will just need a backpointer from the
ib_umem to the to the MR to unregister it.

Which might be a good opportunity to break the user MR from the in-kernel
ones and merge it with ib_umem, but that's a different story..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13  7:09           ` Christoph Hellwig
  0 siblings, 0 replies; 158+ messages in thread
From: Christoph Hellwig @ 2017-10-13  7:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel,
	Ross Zwisler, David Woodhouse, Robin Murphy, Christoph Hellwig,
	Marek Szyprowski

On Mon, Oct 09, 2017 at 01:18:20PM -0600, Jason Gunthorpe wrote:
> > > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > > do not suffer from the re-use problem David W. brought up with IOVAs..
> > 
> > Sounds promising. All I want in the end is to be sure that the kernel
> > is enabled to stop any in-flight RDMA at will without asking
> > userspace. Does this require per-RDMA driver opt-in or is there a
> > common call that can be made?
> 
> I don't think this has ever come up in the context of an all-device MR
> invalidate requirement. Drivers already have code to invalidate
> specifc MRs, but to find all MRs that touch certain pages and then
> invalidate them would be new code.

The whole point is that we should not need that IFF we provide the
right interface.

If we have a new 'register memory with a lease', the driver (or in fact
probably the umem core for the drivers using it) has the lease associated
with the ib_umem structure, which will just need a backpointer from the
ib_umem to the to the MR to unregister it.

Which might be a good opportunity to break the user MR from the in-kernel
ones and merge it with ib_umem, but that's a different story..

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13 15:03                           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	David Woodhouse, Marek Szyprowski

On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:

> > However, chatting this over with a few more people I have an alternate
> > solution that effectively behaves the same as how non-ODP hardware
> > handles this case of hole punch / truncation today. So, today if this
> > scenario happens on a page-cache backed mapping, the file blocks are
> > unmapped and the RDMA continues into pinned pages that are no longer
> > part of the file. We can achieve the same thing with the iommu, just
> > re-target the I/O into memory that isn't part of the file. That way
> > hardware does not see I/O errors and the DAX data consistency model is
> > no worse than the page-cache case.
> 
> Yikes.

Well, as much as you say Yikes, Dan is correct, this does match the
semantics RDMA MR's already have. They become non-coherent if their
underlying object is changed, and there are many ways to get there.
I've never thought about it, but it does sound like ftruncate,
fallocate, etc on a normal file would break the MR coherency too??

There have been efforts in the past driven by the MPI people to
create, essentially, something like lease-break' SIGIO. Except it was
intended to be general, and wanted solve all the problems related with
MR de-coherence. This was complicated and never became acceptable to
mainline.

Instead ODP was developed, and ODP actually solves all the problem
sanely.

Thinking about it some more, and with your other comments on
get_user_pages in this thread, I tend to agree. It doesn't make sense
to develop a user space lease break API for MR's that is a DAX
specific feature.

Along the some lines, it also doesn't make sense to force-invalidate
MR's linked to DAX regions, while leaving MR's linked to other
regions that have the same problem alone.

If you want to make non-ODP MR's work better, then you need to have a
general overall solution to tell userspace when the MR becomes (or I
guess, is becoming) non-coherent, that covers all the cases that break
MR coherence, not just via DAX.

Otherwise, I think Dan is right, keeping the current semantic of
having MRs just do something wrong, but not corrupt memory, when they
loose coherence, is broadly consistent with how non-ODP MRs work today.

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13 15:03                           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	Joerg Roedel, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	Robin Murphy, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Linux API, linux-fsdevel, David Woodhouse, Marek Szyprowski

On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:

> > However, chatting this over with a few more people I have an alternate
> > solution that effectively behaves the same as how non-ODP hardware
> > handles this case of hole punch / truncation today. So, today if this
> > scenario happens on a page-cache backed mapping, the file blocks are
> > unmapped and the RDMA continues into pinned pages that are no longer
> > part of the file. We can achieve the same thing with the iommu, just
> > re-target the I/O into memory that isn't part of the file. That way
> > hardware does not see I/O errors and the DAX data consistency model is
> > no worse than the page-cache case.
> 
> Yikes.

Well, as much as you say Yikes, Dan is correct, this does match the
semantics RDMA MR's already have. They become non-coherent if their
underlying object is changed, and there are many ways to get there.
I've never thought about it, but it does sound like ftruncate,
fallocate, etc on a normal file would break the MR coherency too??

There have been efforts in the past driven by the MPI people to
create, essentially, something like lease-break' SIGIO. Except it was
intended to be general, and wanted solve all the problems related with
MR de-coherence. This was complicated and never became acceptable to
mainline.

Instead ODP was developed, and ODP actually solves all the problem
sanely.

Thinking about it some more, and with your other comments on
get_user_pages in this thread, I tend to agree. It doesn't make sense
to develop a user space lease break API for MR's that is a DAX
specific feature.

Along the some lines, it also doesn't make sense to force-invalidate
MR's linked to DAX regions, while leaving MR's linked to other
regions that have the same problem alone.

If you want to make non-ODP MR's work better, then you need to have a
general overall solution to tell userspace when the MR becomes (or I
guess, is becoming) non-coherent, that covers all the cases that break
MR coherence, not just via DAX.

Otherwise, I think Dan is right, keeping the current semantic of
having MRs just do something wrong, but not corrupt memory, when they
loose coherence, is broadly consistent with how non-ODP MRs work today.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13 15:03                           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel,
	Ross Zwisler, David Woodhouse, Robin Murphy, Marek Szyprowski

On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:

> > However, chatting this over with a few more people I have an alternate
> > solution that effectively behaves the same as how non-ODP hardware
> > handles this case of hole punch / truncation today. So, today if this
> > scenario happens on a page-cache backed mapping, the file blocks are
> > unmapped and the RDMA continues into pinned pages that are no longer
> > part of the file. We can achieve the same thing with the iommu, just
> > re-target the I/O into memory that isn't part of the file. That way
> > hardware does not see I/O errors and the DAX data consistency model is
> > no worse than the page-cache case.
> 
> Yikes.

Well, as much as you say Yikes, Dan is correct, this does match the
semantics RDMA MR's already have. They become non-coherent if their
underlying object is changed, and there are many ways to get there.
I've never thought about it, but it does sound like ftruncate,
fallocate, etc on a normal file would break the MR coherency too??

There have been efforts in the past driven by the MPI people to
create, essentially, something like lease-break' SIGIO. Except it was
intended to be general, and wanted solve all the problems related with
MR de-coherence. This was complicated and never became acceptable to
mainline.

Instead ODP was developed, and ODP actually solves all the problem
sanely.

Thinking about it some more, and with your other comments on
get_user_pages in this thread, I tend to agree. It doesn't make sense
to develop a user space lease break API for MR's that is a DAX
specific feature.

Along the some lines, it also doesn't make sense to force-invalidate
MR's linked to DAX regions, while leaving MR's linked to other
regions that have the same problem alone.

If you want to make non-ODP MR's work better, then you need to have a
general overall solution to tell userspace when the MR becomes (or I
guess, is becoming) non-coherent, that covers all the cases that break
MR coherence, not just via DAX.

Otherwise, I think Dan is right, keeping the current semantic of
having MRs just do something wrong, but not corrupt memory, when they
loose coherence, is broadly consistent with how non-ODP MRs work today.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-13 15:03                           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, linux-nvdimm, Jan Kara, Ashok Raj, Darrick J. Wong,
	linux-rdma, Greg Kroah-Hartman, Joerg Roedel, Dave Chinner,
	linux-xfs, Linux MM, Jeff Moyer, Linux API, linux-fsdevel,
	Ross Zwisler, David Woodhouse, Robin Murphy, Marek Szyprowski

On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:

> > However, chatting this over with a few more people I have an alternate
> > solution that effectively behaves the same as how non-ODP hardware
> > handles this case of hole punch / truncation today. So, today if this
> > scenario happens on a page-cache backed mapping, the file blocks are
> > unmapped and the RDMA continues into pinned pages that are no longer
> > part of the file. We can achieve the same thing with the iommu, just
> > re-target the I/O into memory that isn't part of the file. That way
> > hardware does not see I/O errors and the DAX data consistency model is
> > no worse than the page-cache case.
> 
> Yikes.

Well, as much as you say Yikes, Dan is correct, this does match the
semantics RDMA MR's already have. They become non-coherent if their
underlying object is changed, and there are many ways to get there.
I've never thought about it, but it does sound like ftruncate,
fallocate, etc on a normal file would break the MR coherency too??

There have been efforts in the past driven by the MPI people to
create, essentially, something like lease-break' SIGIO. Except it was
intended to be general, and wanted solve all the problems related with
MR de-coherence. This was complicated and never became acceptable to
mainline.

Instead ODP was developed, and ODP actually solves all the problem
sanely.

Thinking about it some more, and with your other comments on
get_user_pages in this thread, I tend to agree. It doesn't make sense
to develop a user space lease break API for MR's that is a DAX
specific feature.

Along the some lines, it also doesn't make sense to force-invalidate
MR's linked to DAX regions, while leaving MR's linked to other
regions that have the same problem alone.

If you want to make non-ODP MR's work better, then you need to have a
general overall solution to tell userspace when the MR becomes (or I
guess, is becoming) non-coherent, that covers all the cases that break
MR coherence, not just via DAX.

Otherwise, I think Dan is right, keeping the current semantic of
having MRs just do something wrong, but not corrupt memory, when they
loose coherence, is broadly consistent with how non-ODP MRs work today.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:14                             ` Matan Barak
  0 siblings, 0 replies; 158+ messages in thread
From: Matan Barak @ 2017-10-15 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ashok Raj, Darrick J. Wong, linux-rdma,
	Greg Kroah-Hartman, Joerg Roedel, linux-nvdimm, Dave Chinner,
	Robin Murphy, linux-xfs, Linux MM, Linux API, linux-fsdevel,
	Liran Liss, David Woodhouse, Christoph Hellwig, Marek Szyprowski

On Fri, Oct 13, 2017 at 6:03 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:
>
>> > However, chatting this over with a few more people I have an alternate
>> > solution that effectively behaves the same as how non-ODP hardware
>> > handles this case of hole punch / truncation today. So, today if this
>> > scenario happens on a page-cache backed mapping, the file blocks are
>> > unmapped and the RDMA continues into pinned pages that are no longer
>> > part of the file. We can achieve the same thing with the iommu, just
>> > re-target the I/O into memory that isn't part of the file. That way
>> > hardware does not see I/O errors and the DAX data consistency model is
>> > no worse than the page-cache case.
>>
>> Yikes.
>
> Well, as much as you say Yikes, Dan is correct, this does match the
> semantics RDMA MR's already have. They become non-coherent if their
> underlying object is changed, and there are many ways to get there.
> I've never thought about it, but it does sound like ftruncate,
> fallocate, etc on a normal file would break the MR coherency too??
>
> There have been efforts in the past driven by the MPI people to
> create, essentially, something like lease-break' SIGIO. Except it was
> intended to be general, and wanted solve all the problems related with
> MR de-coherence. This was complicated and never became acceptable to
> mainline.
>
> Instead ODP was developed, and ODP actually solves all the problem
> sanely.
>
> Thinking about it some more, and with your other comments on
> get_user_pages in this thread, I tend to agree. It doesn't make sense
> to develop a user space lease break API for MR's that is a DAX
> specific feature.
>
> Along the some lines, it also doesn't make sense to force-invalidate
> MR's linked to DAX regions, while leaving MR's linked to other
> regions that have the same problem alone.
>
> If you want to make non-ODP MR's work better, then you need to have a
> general overall solution to tell userspace when the MR becomes (or I
> guess, is becoming) non-coherent, that covers all the cases that break
> MR coherence, not just via DAX.
>
> Otherwise, I think Dan is right, keeping the current semantic of
> having MRs just do something wrong, but not corrupt memory, when they
> loose coherence, is broadly consistent with how non-ODP MRs work today.
>

I agree, keeping the current semantics is probably the best thing we
could do. It's a trade-off between breaking existing applications,
having a new lease API for DAX or just failing DAX in particular (as
opposed to other cases). For stable mappings, what we have is probably
sufficient. For mappings which could be changed, it's unclear to me
how you could guarantee non-racy behavior that is bounded by a
pre-defined time and guarantee no user-space errors. On top of that,
ODP (should) already solve that problem transparently.

IMHO, using iommu for that and causing DMA errors just because the
lease broke isn't the right thing to do.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:14                             ` Matan Barak
  0 siblings, 0 replies; 158+ messages in thread
From: Matan Barak @ 2017-10-15 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Dan Williams,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Marek Szyprowski, Liran Liss

On Fri, Oct 13, 2017 at 6:03 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:
>
>> > However, chatting this over with a few more people I have an alternate
>> > solution that effectively behaves the same as how non-ODP hardware
>> > handles this case of hole punch / truncation today. So, today if this
>> > scenario happens on a page-cache backed mapping, the file blocks are
>> > unmapped and the RDMA continues into pinned pages that are no longer
>> > part of the file. We can achieve the same thing with the iommu, just
>> > re-target the I/O into memory that isn't part of the file. That way
>> > hardware does not see I/O errors and the DAX data consistency model is
>> > no worse than the page-cache case.
>>
>> Yikes.
>
> Well, as much as you say Yikes, Dan is correct, this does match the
> semantics RDMA MR's already have. They become non-coherent if their
> underlying object is changed, and there are many ways to get there.
> I've never thought about it, but it does sound like ftruncate,
> fallocate, etc on a normal file would break the MR coherency too??
>
> There have been efforts in the past driven by the MPI people to
> create, essentially, something like lease-break' SIGIO. Except it was
> intended to be general, and wanted solve all the problems related with
> MR de-coherence. This was complicated and never became acceptable to
> mainline.
>
> Instead ODP was developed, and ODP actually solves all the problem
> sanely.
>
> Thinking about it some more, and with your other comments on
> get_user_pages in this thread, I tend to agree. It doesn't make sense
> to develop a user space lease break API for MR's that is a DAX
> specific feature.
>
> Along the some lines, it also doesn't make sense to force-invalidate
> MR's linked to DAX regions, while leaving MR's linked to other
> regions that have the same problem alone.
>
> If you want to make non-ODP MR's work better, then you need to have a
> general overall solution to tell userspace when the MR becomes (or I
> guess, is becoming) non-coherent, that covers all the cases that break
> MR coherence, not just via DAX.
>
> Otherwise, I think Dan is right, keeping the current semantic of
> having MRs just do something wrong, but not corrupt memory, when they
> loose coherence, is broadly consistent with how non-ODP MRs work today.
>

I agree, keeping the current semantics is probably the best thing we
could do. It's a trade-off between breaking existing applications,
having a new lease API for DAX or just failing DAX in particular (as
opposed to other cases). For stable mappings, what we have is probably
sufficient. For mappings which could be changed, it's unclear to me
how you could guarantee non-racy behavior that is bounded by a
pre-defined time and guarantee no user-space errors. On top of that,
ODP (should) already solve that problem transparently.

IMHO, using iommu for that and causing DMA errors just because the
lease broke isn't the right thing to do.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:14                             ` Matan Barak
  0 siblings, 0 replies; 158+ messages in thread
From: Matan Barak @ 2017-10-15 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, linux-nvdimm, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman,
	Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse,
	Robin Murphy, Marek Szyprowski, Liran Liss

On Fri, Oct 13, 2017 at 6:03 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:
>
>> > However, chatting this over with a few more people I have an alternate
>> > solution that effectively behaves the same as how non-ODP hardware
>> > handles this case of hole punch / truncation today. So, today if this
>> > scenario happens on a page-cache backed mapping, the file blocks are
>> > unmapped and the RDMA continues into pinned pages that are no longer
>> > part of the file. We can achieve the same thing with the iommu, just
>> > re-target the I/O into memory that isn't part of the file. That way
>> > hardware does not see I/O errors and the DAX data consistency model is
>> > no worse than the page-cache case.
>>
>> Yikes.
>
> Well, as much as you say Yikes, Dan is correct, this does match the
> semantics RDMA MR's already have. They become non-coherent if their
> underlying object is changed, and there are many ways to get there.
> I've never thought about it, but it does sound like ftruncate,
> fallocate, etc on a normal file would break the MR coherency too??
>
> There have been efforts in the past driven by the MPI people to
> create, essentially, something like lease-break' SIGIO. Except it was
> intended to be general, and wanted solve all the problems related with
> MR de-coherence. This was complicated and never became acceptable to
> mainline.
>
> Instead ODP was developed, and ODP actually solves all the problem
> sanely.
>
> Thinking about it some more, and with your other comments on
> get_user_pages in this thread, I tend to agree. It doesn't make sense
> to develop a user space lease break API for MR's that is a DAX
> specific feature.
>
> Along the some lines, it also doesn't make sense to force-invalidate
> MR's linked to DAX regions, while leaving MR's linked to other
> regions that have the same problem alone.
>
> If you want to make non-ODP MR's work better, then you need to have a
> general overall solution to tell userspace when the MR becomes (or I
> guess, is becoming) non-coherent, that covers all the cases that break
> MR coherence, not just via DAX.
>
> Otherwise, I think Dan is right, keeping the current semantic of
> having MRs just do something wrong, but not corrupt memory, when they
> loose coherence, is broadly consistent with how non-ODP MRs work today.
>

I agree, keeping the current semantics is probably the best thing we
could do. It's a trade-off between breaking existing applications,
having a new lease API for DAX or just failing DAX in particular (as
opposed to other cases). For stable mappings, what we have is probably
sufficient. For mappings which could be changed, it's unclear to me
how you could guarantee non-racy behavior that is bounded by a
pre-defined time and guarantee no user-space errors. On top of that,
ODP (should) already solve that problem transparently.

IMHO, using iommu for that and causing DMA errors just because the
lease broke isn't the right thing to do.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:14                             ` Matan Barak
  0 siblings, 0 replies; 158+ messages in thread
From: Matan Barak @ 2017-10-15 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Dan Williams, linux-nvdimm, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman,
	Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse,
	Robin Murphy, Marek Szyprowski, Liran Liss

On Fri, Oct 13, 2017 at 6:03 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:
>
>> > However, chatting this over with a few more people I have an alternate
>> > solution that effectively behaves the same as how non-ODP hardware
>> > handles this case of hole punch / truncation today. So, today if this
>> > scenario happens on a page-cache backed mapping, the file blocks are
>> > unmapped and the RDMA continues into pinned pages that are no longer
>> > part of the file. We can achieve the same thing with the iommu, just
>> > re-target the I/O into memory that isn't part of the file. That way
>> > hardware does not see I/O errors and the DAX data consistency model is
>> > no worse than the page-cache case.
>>
>> Yikes.
>
> Well, as much as you say Yikes, Dan is correct, this does match the
> semantics RDMA MR's already have. They become non-coherent if their
> underlying object is changed, and there are many ways to get there.
> I've never thought about it, but it does sound like ftruncate,
> fallocate, etc on a normal file would break the MR coherency too??
>
> There have been efforts in the past driven by the MPI people to
> create, essentially, something like lease-break' SIGIO. Except it was
> intended to be general, and wanted solve all the problems related with
> MR de-coherence. This was complicated and never became acceptable to
> mainline.
>
> Instead ODP was developed, and ODP actually solves all the problem
> sanely.
>
> Thinking about it some more, and with your other comments on
> get_user_pages in this thread, I tend to agree. It doesn't make sense
> to develop a user space lease break API for MR's that is a DAX
> specific feature.
>
> Along the some lines, it also doesn't make sense to force-invalidate
> MR's linked to DAX regions, while leaving MR's linked to other
> regions that have the same problem alone.
>
> If you want to make non-ODP MR's work better, then you need to have a
> general overall solution to tell userspace when the MR becomes (or I
> guess, is becoming) non-coherent, that covers all the cases that break
> MR coherence, not just via DAX.
>
> Otherwise, I think Dan is right, keeping the current semantic of
> having MRs just do something wrong, but not corrupt memory, when they
> loose coherence, is broadly consistent with how non-ODP MRs work today.
>

I agree, keeping the current semantics is probably the best thing we
could do. It's a trade-off between breaking existing applications,
having a new lease API for DAX or just failing DAX in particular (as
opposed to other cases). For stable mappings, what we have is probably
sufficient. For mappings which could be changed, it's unclear to me
how you could guarantee non-racy behavior that is bounded by a
pre-defined time and guarantee no user-space errors. On top of that,
ODP (should) already solve that problem transparently.

IMHO, using iommu for that and causing DMA errors just because the
lease broke isn't the right thing to do.

> Jason

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:21                               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-15 15:21 UTC (permalink / raw)
  To: Matan Barak
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-nvdimm, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman,
	Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse,
	Robin Murphy, Marek Szyprowski, Liran Liss

On Sun, Oct 15, 2017 at 8:14 AM, Matan Barak <matanb@dev.mellanox.co.il> wrote:
[..]
> IMHO, using iommu for that and causing DMA errors just because the
> lease broke isn't the right thing to do.

Yes, see the current proposal over in this thread:

https://lists.01.org/pipermail/linux-nvdimm/2017-October/012885.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:21                               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-15 15:21 UTC (permalink / raw)
  To: Matan Barak
  Cc: Jason Gunthorpe, Christoph Hellwig,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jan Kara, Ashok Raj,
	Darrick J. Wong, linux-rdma, Greg Kroah-Hartman, Joerg Roedel,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Jeff Moyer, Linux API, linux-fsdevel, Ross Zwisler,
	David Woodhouse, Robin Murphy, Marek Szyprowski, Liran Liss

On Sun, Oct 15, 2017 at 8:14 AM, Matan Barak <matanb-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
[..]
> IMHO, using iommu for that and causing DMA errors just because the
> lease broke isn't the right thing to do.

Yes, see the current proposal over in this thread:

https://lists.01.org/pipermail/linux-nvdimm/2017-October/012885.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
@ 2017-10-15 15:21                               ` Dan Williams
  0 siblings, 0 replies; 158+ messages in thread
From: Dan Williams @ 2017-10-15 15:21 UTC (permalink / raw)
  To: Matan Barak
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-nvdimm, Jan Kara,
	Ashok Raj, Darrick J. Wong, linux-rdma, Greg Kroah-Hartman,
	Joerg Roedel, Dave Chinner, linux-xfs, Linux MM, Jeff Moyer,
	Linux API, linux-fsdevel, Ross Zwisler, David Woodhouse,
	Robin Murphy, Marek Szyprowski, Liran Liss

On Sun, Oct 15, 2017 at 8:14 AM, Matan Barak <matanb@dev.mellanox.co.il> wrote:
[..]
> IMHO, using iommu for that and causing DMA errors just because the
> lease broke isn't the right thing to do.

Yes, see the current proposal over in this thread:

https://lists.01.org/pipermail/linux-nvdimm/2017-October/012885.html

^ permalink raw reply	[flat|nested] 158+ messages in thread

end of thread, other threads:[~2017-10-15 15:21 UTC | newest]

Thread overview: 158+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-06 22:35 [PATCH v7 00/12] MAP_DIRECT for DAX RDMA and userspace flush Dan Williams
2017-10-06 22:35 ` Dan Williams
2017-10-06 22:35 ` Dan Williams
2017-10-06 22:35 ` Dan Williams
2017-10-06 22:35 ` [PATCH v7 01/12] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35 ` [PATCH v7 02/12] fs, mm: pass fd to ->mmap_validate() Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35 ` [PATCH v7 03/12] fs: introduce i_mapdcount Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-09  3:08   ` Dave Chinner
2017-10-09  3:08     ` Dave Chinner
2017-10-09  3:08     ` Dave Chinner
2017-10-09  3:08     ` Dave Chinner
2017-10-06 22:35 ` [PATCH v7 04/12] fs: MAP_DIRECT core Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35 ` [PATCH v7 05/12] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35 ` [PATCH v7 06/12] xfs: wire up MAP_DIRECT Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-09  3:40   ` Dave Chinner
2017-10-09  3:40     ` Dave Chinner
2017-10-09  3:40     ` Dave Chinner
2017-10-09 17:08     ` Dan Williams
2017-10-09 17:08       ` Dan Williams
2017-10-09 17:08       ` Dan Williams
2017-10-09 22:50       ` Dave Chinner
2017-10-09 22:50         ` Dave Chinner
2017-10-06 22:35 ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:35   ` Dan Williams
2017-10-06 22:45   ` David Woodhouse
2017-10-06 22:45     ` David Woodhouse
2017-10-06 22:45     ` David Woodhouse
2017-10-06 22:52     ` Dan Williams
2017-10-06 22:52       ` Dan Williams
2017-10-06 22:52       ` Dan Williams
2017-10-06 22:52       ` Dan Williams
2017-10-06 23:10       ` David Woodhouse
2017-10-06 23:10         ` David Woodhouse
2017-10-06 23:10         ` David Woodhouse
2017-10-06 23:15         ` Dan Williams
2017-10-06 23:15           ` Dan Williams
2017-10-06 23:15           ` Dan Williams
2017-10-06 23:15           ` Dan Williams
2017-10-07 11:08           ` David Woodhouse
2017-10-07 11:08             ` David Woodhouse
2017-10-07 23:33             ` Dan Williams
2017-10-07 23:33               ` Dan Williams
2017-10-07 23:33               ` Dan Williams
2017-10-07 23:33               ` Dan Williams
2017-10-06 23:12       ` Dan Williams
2017-10-06 23:12         ` Dan Williams
2017-10-08  3:45   ` [PATCH v8] dma-mapping: introduce dma_get_iommu_domain() Dan Williams
2017-10-08  3:45     ` Dan Williams
2017-10-08  3:45     ` Dan Williams
2017-10-09 10:37     ` Robin Murphy
2017-10-09 10:37       ` Robin Murphy
2017-10-09 10:37       ` Robin Murphy
2017-10-09 10:37       ` Robin Murphy
2017-10-09 17:32       ` Dan Williams
2017-10-09 17:32         ` Dan Williams
2017-10-10 14:40     ` Raj, Ashok
2017-10-10 14:40       ` Raj, Ashok
2017-10-09 18:58   ` [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu() Jason Gunthorpe
2017-10-09 18:58     ` Jason Gunthorpe
2017-10-09 18:58     ` Jason Gunthorpe
2017-10-09 18:58     ` Jason Gunthorpe
2017-10-09 19:05     ` Dan Williams
2017-10-09 19:05       ` Dan Williams
2017-10-09 19:18       ` Jason Gunthorpe
2017-10-09 19:18         ` Jason Gunthorpe
2017-10-09 19:18         ` Jason Gunthorpe
2017-10-09 19:18         ` Jason Gunthorpe
2017-10-09 19:28         ` Dan Williams
2017-10-09 19:28           ` Dan Williams
2017-10-09 19:28           ` Dan Williams
2017-10-09 19:28           ` Dan Williams
2017-10-10 17:25           ` Jason Gunthorpe
2017-10-10 17:25             ` Jason Gunthorpe
2017-10-10 17:25             ` Jason Gunthorpe
2017-10-10 17:25             ` Jason Gunthorpe
2017-10-10 17:39             ` Dan Williams
2017-10-10 17:39               ` Dan Williams
2017-10-10 17:39               ` Dan Williams
2017-10-10 17:39               ` Dan Williams
2017-10-10 18:05               ` Jason Gunthorpe
2017-10-10 18:05                 ` Jason Gunthorpe
2017-10-10 18:05                 ` Jason Gunthorpe
2017-10-10 18:05                 ` Jason Gunthorpe
2017-10-10 20:17                 ` Dan Williams
2017-10-10 20:17                   ` Dan Williams
2017-10-10 20:17                   ` Dan Williams
2017-10-12 18:27                   ` Jason Gunthorpe
2017-10-12 18:27                     ` Jason Gunthorpe
2017-10-12 18:27                     ` Jason Gunthorpe
2017-10-12 20:10                     ` Dan Williams
2017-10-12 20:10                       ` Dan Williams
2017-10-13  6:50                       ` Christoph Hellwig
2017-10-13  6:50                         ` Christoph Hellwig
2017-10-13  6:50                         ` Christoph Hellwig
2017-10-13 15:03                         ` Jason Gunthorpe
2017-10-13 15:03                           ` Jason Gunthorpe
2017-10-13 15:03                           ` Jason Gunthorpe
2017-10-13 15:03                           ` Jason Gunthorpe
2017-10-15 15:14                           ` Matan Barak
2017-10-15 15:14                             ` Matan Barak
2017-10-15 15:14                             ` Matan Barak
2017-10-15 15:14                             ` Matan Barak
2017-10-15 15:21                             ` Dan Williams
2017-10-15 15:21                               ` Dan Williams
2017-10-15 15:21                               ` Dan Williams
2017-10-13  7:09         ` Christoph Hellwig
2017-10-13  7:09           ` Christoph Hellwig
2017-10-13  7:09           ` Christoph Hellwig
2017-10-06 22:36 ` [PATCH v7 08/12] fs, mapdirect: introduce ->lease_direct() Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36 ` [PATCH v7 09/12] xfs: wire up ->lease_direct() Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-09  3:45   ` Dave Chinner
2017-10-09  3:45     ` Dave Chinner
2017-10-09  3:45     ` Dave Chinner
2017-10-09  3:45     ` Dave Chinner
2017-10-09 17:10     ` Dan Williams
2017-10-09 17:10       ` Dan Williams
2017-10-06 22:36 ` [PATCH v7 10/12] device-dax: " Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36 ` [PATCH v7 11/12] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-08  4:02   ` [PATCH v8 1/2] iommu: up-level sg_num_pages() from amd-iommu Dan Williams
2017-10-08  4:02     ` Dan Williams
2017-10-08  4:02     ` Dan Williams
2017-10-08  4:04   ` [PATCH v8 2/2] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings Dan Williams
2017-10-08  4:04     ` Dan Williams
2017-10-08  4:04     ` Dan Williams
2017-10-08  6:45     ` kbuild test robot
2017-10-08  6:45       ` kbuild test robot
2017-10-08  6:45       ` kbuild test robot
2017-10-08 15:49       ` Dan Williams
2017-10-08 15:49         ` Dan Williams
2017-10-08 15:49         ` Dan Williams
2017-10-06 22:36 ` [PATCH v7 12/12] tools/testing/nvdimm: enable rdma unit tests Dan Williams
2017-10-06 22:36   ` Dan Williams
2017-10-06 22:36   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.