All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12  0:47 ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, linux-api, Dave Chinner, linux-xfs, linux-mm,
	Al Viro, Andy Lutomirski, Jeff Layton, linux-fsdevel,
	Linus Torvalds, Christoph Hellwig

Changes since v8 [1]:
* Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch
  headers (Jan)

* Include xfs_layout.h directly in all the files that call
  xfs_break_layouts() (Dave)

* Clarify / add more comments to the MAP_DIRECT checks at fault time
  (Dave)

* Rename iomap_can_allocate() to break_layouts_nowait() to make it plain
  the reason we are bailing out of iomap_begin.

* Defer the lease_direct mechanism and RDMA core changes to a later
  patch series.

* EXT4 support is in the works and will be rebased on Jan's MAP_SYNC
  patches.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some
scenarios where you would choose one over the other:

* 3rd party DMA / RDMA to DAX with hardware that does not support
  on-demand paging (shared virtual memory) => MAP_DIRECT

* Support for reflinked inodes, fallocate-punch-hole, truncate, or any
  other operation that mutates the block map of an actively
  mapped file => MAP_SYNC

* Userpsace flush => MAP_SYNC or MAP_DIRECT

* Assurances that the file's block map metadata is stable, i.e. minimize
  worst case fault latency by locking out updates => MAP_DIRECT

---

Dan Williams (6):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      fs, xfs, iomap: introduce break_layout_nowait()
      xfs: wire up MAP_DIRECT


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 fs/Kconfig                                   |    1 
 fs/Makefile                                  |    2 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  237 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  108 ++++++++++++
 fs/xfs/xfs_ioctl.c                           |    1 
 fs/xfs/xfs_iomap.c                           |    3 
 fs/xfs/xfs_iops.c                            |    1 
 fs/xfs/xfs_layout.c                          |   45 +++++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   31 ---
 fs/xfs/xfs_pnfs.h                            |    8 -
 include/linux/fs.h                           |   11 +
 include/linux/mapdirect.h                    |   40 ++++
 include/linux/mm.h                           |    9 +
 include/linux/mman.h                         |   42 +++++
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++-
 mm/nommu.c                                   |    5 -
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 33 files changed, 557 insertions(+), 62 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12  0:47 ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton

Changes since v8 [1]:
* Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch
  headers (Jan)

* Include xfs_layout.h directly in all the files that call
  xfs_break_layouts() (Dave)

* Clarify / add more comments to the MAP_DIRECT checks at fault time
  (Dave)

* Rename iomap_can_allocate() to break_layouts_nowait() to make it plain
  the reason we are bailing out of iomap_begin.

* Defer the lease_direct mechanism and RDMA core changes to a later
  patch series.

* EXT4 support is in the works and will be rebased on Jan's MAP_SYNC
  patches.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some
scenarios where you would choose one over the other:

* 3rd party DMA / RDMA to DAX with hardware that does not support
  on-demand paging (shared virtual memory) => MAP_DIRECT

* Support for reflinked inodes, fallocate-punch-hole, truncate, or any
  other operation that mutates the block map of an actively
  mapped file => MAP_SYNC

* Userpsace flush => MAP_SYNC or MAP_DIRECT

* Assurances that the file's block map metadata is stable, i.e. minimize
  worst case fault latency by locking out updates => MAP_DIRECT

---

Dan Williams (6):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      fs, xfs, iomap: introduce break_layout_nowait()
      xfs: wire up MAP_DIRECT


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 fs/Kconfig                                   |    1 
 fs/Makefile                                  |    2 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  237 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  108 ++++++++++++
 fs/xfs/xfs_ioctl.c                           |    1 
 fs/xfs/xfs_iomap.c                           |    3 
 fs/xfs/xfs_iops.c                            |    1 
 fs/xfs/xfs_layout.c                          |   45 +++++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   31 ---
 fs/xfs/xfs_pnfs.h                            |    8 -
 include/linux/fs.h                           |   11 +
 include/linux/mapdirect.h                    |   40 ++++
 include/linux/mm.h                           |    9 +
 include/linux/mman.h                         |   42 +++++
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++-
 mm/nommu.c                                   |    5 -
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 33 files changed, 557 insertions(+), 62 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12  0:47 ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton

Changes since v8 [1]:
* Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch
  headers (Jan)

* Include xfs_layout.h directly in all the files that call
  xfs_break_layouts() (Dave)

* Clarify / add more comments to the MAP_DIRECT checks at fault time
  (Dave)

* Rename iomap_can_allocate() to break_layouts_nowait() to make it plain
  the reason we are bailing out of iomap_begin.

* Defer the lease_direct mechanism and RDMA core changes to a later
  patch series.

* EXT4 support is in the works and will be rebased on Jan's MAP_SYNC
  patches.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some
scenarios where you would choose one over the other:

* 3rd party DMA / RDMA to DAX with hardware that does not support
  on-demand paging (shared virtual memory) => MAP_DIRECT

* Support for reflinked inodes, fallocate-punch-hole, truncate, or any
  other operation that mutates the block map of an actively
  mapped file => MAP_SYNC

* Userpsace flush => MAP_SYNC or MAP_DIRECT

* Assurances that the file's block map metadata is stable, i.e. minimize
  worst case fault latency by locking out updates => MAP_DIRECT

---

Dan Williams (6):
      mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
      fs, mm: pass fd to ->mmap_validate()
      fs: MAP_DIRECT core
      xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
      fs, xfs, iomap: introduce break_layout_nowait()
      xfs: wire up MAP_DIRECT


 arch/alpha/include/uapi/asm/mman.h           |    1 
 arch/mips/include/uapi/asm/mman.h            |    1 
 arch/mips/kernel/vdso.c                      |    2 
 arch/parisc/include/uapi/asm/mman.h          |    1 
 arch/tile/mm/elf.c                           |    3 
 arch/x86/mm/mpx.c                            |    3 
 arch/xtensa/include/uapi/asm/mman.h          |    1 
 fs/Kconfig                                   |    1 
 fs/Makefile                                  |    2 
 fs/aio.c                                     |    2 
 fs/mapdirect.c                               |  237 ++++++++++++++++++++++++++
 fs/xfs/Kconfig                               |    4 
 fs/xfs/Makefile                              |    1 
 fs/xfs/xfs_file.c                            |  108 ++++++++++++
 fs/xfs/xfs_ioctl.c                           |    1 
 fs/xfs/xfs_iomap.c                           |    3 
 fs/xfs/xfs_iops.c                            |    1 
 fs/xfs/xfs_layout.c                          |   45 +++++
 fs/xfs/xfs_layout.h                          |   13 +
 fs/xfs/xfs_pnfs.c                            |   31 ---
 fs/xfs/xfs_pnfs.h                            |    8 -
 include/linux/fs.h                           |   11 +
 include/linux/mapdirect.h                    |   40 ++++
 include/linux/mm.h                           |    9 +
 include/linux/mman.h                         |   42 +++++
 include/uapi/asm-generic/mman-common.h       |    1 
 include/uapi/asm-generic/mman.h              |    1 
 ipc/shm.c                                    |    3 
 mm/internal.h                                |    2 
 mm/mmap.c                                    |   28 ++-
 mm/nommu.c                                   |    5 -
 mm/util.c                                    |    7 -
 tools/include/uapi/asm-generic/mman-common.h |    1 
 33 files changed, 557 insertions(+), 62 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
  2017-10-12  0:47 ` Dan Williams
  (?)
  (?)
@ 2017-10-12  0:47   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Arnd Bergmann, linux-api, linux-xfs, linux-mm,
	Andy Lutomirski, linux-fsdevel, Andrew Morton, Linus Torvalds,
	Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..f85f18ffbf8c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..054314bb062a 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -28,6 +28,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..a66fdb9c4b6d 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..875b0e6f7499 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -35,6 +35,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..5aee97d64cae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..38f6ed954dde 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..2649c00581a0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case MAP_SHARED_VALIDATE:
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Arnd Bergmann, linux-api, linux-xfs, linux-mm,
	Andy Lutomirski, linux-fsdevel, Andrew Morton, Linus Torvalds,
	Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..f85f18ffbf8c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..054314bb062a 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -28,6 +28,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..a66fdb9c4b6d 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..875b0e6f7499 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -35,6 +35,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..5aee97d64cae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..38f6ed954dde 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..2649c00581a0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case MAP_SHARED_VALIDATE:
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Arnd Bergmann, linux-api, linux-xfs, linux-mm,
	Andy Lutomirski, linux-fsdevel, Andrew Morton, Linus Torvalds,
	Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..f85f18ffbf8c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..054314bb062a 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -28,6 +28,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..a66fdb9c4b6d 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..875b0e6f7499 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -35,6 +35,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..5aee97d64cae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..38f6ed954dde 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..2649c00581a0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case MAP_SHARED_VALIDATE:
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Arnd Bergmann, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
		    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
		    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
Cc: Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Suggested-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Suggested-by: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/alpha/include/uapi/asm/mman.h           |    1 +
 arch/mips/include/uapi/asm/mman.h            |    1 +
 arch/mips/kernel/vdso.c                      |    2 +
 arch/parisc/include/uapi/asm/mman.h          |    1 +
 arch/tile/mm/elf.c                           |    3 +-
 arch/xtensa/include/uapi/asm/mman.h          |    1 +
 include/linux/fs.h                           |    2 +
 include/linux/mm.h                           |    2 +
 include/linux/mman.h                         |   39 ++++++++++++++++++++++++++
 include/uapi/asm-generic/mman-common.h       |    1 +
 mm/mmap.c                                    |   21 ++++++++++++--
 tools/include/uapi/asm-generic/mman-common.h |    1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..f85f18ffbf8c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping (OSF/1 is _wrong_) */
 #define MAP_FIXED	0x100		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..054314bb062a 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -28,6 +28,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL);
+			   0, NULL, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..a66fdb9c4b6d 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x03		/* Mask for type of mapping */
 #define MAP_FIXED	0x04		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x10		/* don't use a file */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		unsigned long addr = MEM_USER_INTRPT;
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
-				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+				   NULL, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..875b0e6f7499 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -35,6 +35,7 @@
  */
 #define MAP_SHARED	0x001		/* Share changes */
 #define MAP_PRIVATE	0x002		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x00f		/* Mask for type of mapping */
 #define MAP_FIXED	0x010		/* Interpret addr exactly */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..5aee97d64cae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	int (*mmap_validate) (struct file *, struct vm_area_struct *,
+			unsigned long);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..38f6ed954dde 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long map_flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
 
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+		| MAP_PRIVATE \
+		| MAP_FIXED \
+		| MAP_ANONYMOUS \
+		| MAP_DENYWRITE \
+		| MAP_EXECUTABLE \
+		| MAP_UNINITIALIZED \
+		| MAP_GROWSDOWN \
+		| MAP_LOCKED \
+		| MAP_NORESERVE \
+		| MAP_POPULATE \
+		| MAP_NONBLOCK \
+		| MAP_STACK \
+		| MAP_HUGETLB \
+		| MAP_32BIT \
+		| MAP_HUGE_2MB \
+		| MAP_HUGE_1GB)
+
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..2649c00581a0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		struct inode *inode = file_inode(file);
 
 		switch (flags & MAP_TYPE) {
+		case MAP_SHARED_VALIDATE:
+			if ((flags & ~LEGACY_MAP_MASK) == 0) {
+				/*
+				 * If all legacy mmap flags, downgrade
+				 * to MAP_SHARED, i.e. invoke ->mmap()
+				 * instead of ->mmap_validate()
+				 */
+				flags &= ~MAP_TYPE;
+				flags |= MAP_SHARED;
+			} else if (!file->f_op->mmap_validate)
+				return -EOPNOTSUPP;
+			/* fall through */
 		case MAP_SHARED:
 			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
 				return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long map_flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * new file must not have been exposed to user-space, yet.
 		 */
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+			error = file->f_op->mmap_validate(file, vma, map_flags);
+		else
+			error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
 
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 203268f9231e..debd98c2eb83 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -16,6 +16,7 @@
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3		/* share + validate extension flags */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
 #define MAP_ANONYMOUS	0x20		/* don't use a file */

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
  2017-10-12  0:47 ` Dan Williams
  (?)
@ 2017-10-12  0:47   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner,
	Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5aee97d64cae..17e0e899e184 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38f6ed954dde..ec45087348c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index badac463e2c8..c98f85f6756d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index 2649c00581a0..a6794670c9cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner,
	Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5aee97d64cae..17e0e899e184 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38f6ed954dde..ec45087348c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index badac463e2c8..c98f85f6756d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index 2649c00581a0..a6794670c9cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);
 


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Darrick J. Wong, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, Christoph Hellwig,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Ross Zwisler

The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/mips/kernel/vdso.c |    2 +-
 arch/tile/mm/elf.c      |    2 +-
 arch/x86/mm/mpx.c       |    3 ++-
 fs/aio.c                |    2 +-
 include/linux/fs.h      |    2 +-
 include/linux/mm.h      |    9 +++++----
 ipc/shm.c               |    3 ++-
 mm/internal.h           |    2 +-
 mm/mmap.c               |   13 +++++++------
 mm/nommu.c              |    5 +++--
 mm/util.c               |    7 ++++---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-			   0, NULL, 0);
+			   0, NULL, 0, -1);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 		addr = mmap_region(NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-				   NULL, 0);
+				   NULL, 0, -1);
 		if (addr > (unsigned long) -PAGE_SIZE)
 			retval = (int) addr;
 	}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
 	down_write(&mm->mmap_sem);
 	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate, NULL);
+			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate,
+			NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
-				       MAP_SHARED, 0, &unused, NULL);
+				       MAP_SHARED, 0, &unused, NULL, -1);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5aee97d64cae..17e0e899e184 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*mmap_validate) (struct file *, struct vm_area_struct *,
-			unsigned long);
+			unsigned long, int);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38f6ed954dde..ec45087348c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf, unsigned long map_flags);
+	struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf);
+	struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate,
-	struct list_head *uf)
+	struct list_head *uf, int fd)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
+	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate,
+			uf, fd);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index badac463e2c8..c98f85f6756d 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1399,7 +1399,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate,
+			NULL, -1);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index 1df011f62480..70ed7b06dd85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,7 +466,7 @@ extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
-        unsigned long, unsigned long);
+        unsigned long, unsigned long, int);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index 2649c00581a0..a6794670c9cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,7 +1322,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
@@ -1477,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags, fd);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1527,7 +1527,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 out_fput:
 	if (file)
 		fput(file);
@@ -1614,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf, unsigned long map_flags)
+		struct list_head *uf, unsigned long map_flags, int fd)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1700,7 +1700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 */
 		vma->vm_file = get_file(file);
 		if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
-			error = file->f_op->mmap_validate(file, vma, map_flags);
+			error = file->f_op->mmap_validate(file, vma,
+					map_flags, fd);
 		else
 			error = call_mmap(file, vma);
 		if (error)
@@ -2842,7 +2843,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 
 	file = get_file(vma->vm_file);
 	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+			prot, flags, pgoff, &populate, NULL, -1);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..952d205d3b66 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1206,7 +1206,8 @@ unsigned long do_mmap(struct file *file,
 			vm_flags_t vm_flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			int fd)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, fd);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..dcf48d929185 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -319,7 +319,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff, int fd)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -331,7 +331,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
-				    &populate, &uf);
+				    &populate, &uf, fd);
 		up_write(&mm->mmap_sem);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -349,7 +349,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag,
+			offset >> PAGE_SHIFT, -1);
 }
 EXPORT_SYMBOL(vm_mmap);
 

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 3/6] fs: MAP_DIRECT core
  2017-10-12  0:47 ` Dan Williams
  (?)
  (?)
@ 2017-10-12  0:47   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong, linux-api,
	Dave Chinner, linux-xfs, linux-mm, linux-fsdevel, Jeff Layton,
	Christoph Hellwig

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig                |    1 
 fs/Makefile               |    2 
 fs/mapdirect.c            |  237 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   40 ++++++++
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..5491aa550e55
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+bool test_map_direct_valid(struct map_direct_state *mds);
+void generic_map_direct_open(struct vm_area_struct *vma);
+void generic_map_direct_close(struct vm_area_struct *vma);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#define generic_map_direct_open NULL
+#define generic_map_direct_close NULL
+#endif
+#endif /* __MAPDIRECT_H__ */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 3/6] fs: MAP_DIRECT core
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-api, Dave Chinner,
	Christoph Hellwig, J. Bruce Fields, linux-mm, Jeff Moyer,
	linux-fsdevel, Jeff Layton, Ross Zwisler

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig                |    1 
 fs/Makefile               |    2 
 fs/mapdirect.c            |  237 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   40 ++++++++
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..5491aa550e55
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+bool test_map_direct_valid(struct map_direct_state *mds);
+void generic_map_direct_open(struct vm_area_struct *vma);
+void generic_map_direct_close(struct vm_area_struct *vma);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#define generic_map_direct_open NULL
+#define generic_map_direct_close NULL
+#endif
+#endif /* __MAPDIRECT_H__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 3/6] fs: MAP_DIRECT core
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Darrick J. Wong, linux-api, Dave Chinner,
	Christoph Hellwig, J. Bruce Fields, linux-mm, Jeff Moyer,
	linux-fsdevel, Jeff Layton, Ross Zwisler

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig                |    1 
 fs/Makefile               |    2 
 fs/mapdirect.c            |  237 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   40 ++++++++
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..5491aa550e55
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+bool test_map_direct_valid(struct map_direct_state *mds);
+void generic_map_direct_open(struct vm_area_struct *vma);
+void generic_map_direct_close(struct vm_area_struct *vma);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#define generic_map_direct_open NULL
+#define generic_map_direct_close NULL
+#endif
+#endif /* __MAPDIRECT_H__ */


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 3/6] fs: MAP_DIRECT core
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: J. Bruce Fields, Jan Kara, Darrick J. Wong,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton,
	Christoph Hellwig

Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/Kconfig                |    1 
 fs/Makefile               |    2 
 fs/mapdirect.c            |  237 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mapdirect.h |   40 ++++++++
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
+	depends on FILE_LOCKING
 	depends on !(ARM || MIPS || SPARC)
 	select FS_IOMAP
 	select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
-obj-$(CONFIG_FS_DAX)		+= dax.o
+obj-$(CONFIG_FS_DAX)		+= dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index 000000000000..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/mapdirect.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+	atomic_t mds_ref;
+	atomic_t mds_vmaref;
+	unsigned long mds_state;
+	struct inode *mds_inode;
+	struct delayed_work mds_work;
+	struct fasync_struct *mds_fa;
+	struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return test_bit(MAPDIRECT_VALID, &mds->mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+	if (!atomic_dec_and_test(&mds->mds_ref))
+		return;
+	kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+	struct vm_area_struct *vma = mds->mds_vma;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	void *owner = mds;
+
+	if (!atomic_dec_and_test(&mds->mds_vmaref))
+		return;
+
+	/*
+	 * Flush in-flight+forced lm_break events that may be
+	 * referencing this dying vma.
+	 */
+	mds->mds_vma = NULL;
+	set_bit(MAPDIRECT_BREAK, &mds->mds_state);
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+	flush_delayed_work(&mds->mds_work);
+	iput(inode);
+
+	put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+	put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+	atomic_inc(&mds->mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+	get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+	struct map_direct_state *mds;
+	struct vm_area_struct *vma;
+	struct inode *inode;
+	void *owner;
+
+	mds = container_of(work, typeof(*mds), mds_work.work);
+
+	clear_bit(MAPDIRECT_VALID, &mds->mds_state);
+
+	vma = ACCESS_ONCE(mds->mds_vma);
+	inode = mds->mds_inode;
+	if (vma) {
+		unsigned long len = vma->vm_end - vma->vm_start;
+		loff_t start = (loff_t) vma->vm_pgoff * PAGE_SIZE;
+
+		unmap_mapping_range(inode->i_mapping, start, len, 1);
+	}
+	owner = mds;
+	vfs_setlease(vma->vm_file, F_UNLCK, NULL, &owner);
+
+	put_map_direct(mds);
+}
+
+static bool map_direct_lm_break(struct file_lock *fl)
+{
+	struct map_direct_state *mds = fl->fl_owner;
+
+	/*
+	 * Given that we need to take sleeping locks to invalidate the
+	 * mapping we schedule that work with the original timeout set
+	 * by the file-locks core. Then we tell the core to hold off on
+	 * continuing with the lease break until the delayed work
+	 * completes the invalidation and the lease unlock.
+	 *
+	 * Note that this assumes that i_mapdcount is protecting against
+	 * block-map modifying write-faults since we are unable to use
+	 * leases in that path due to locking constraints.
+	 */
+	if (!test_and_set_bit(MAPDIRECT_BREAK, &mds->mds_state)) {
+		schedule_delayed_work(&mds->mds_work, lease_break_time * HZ);
+		kill_fasync(&fl->fl_fasync, SIGIO, POLL_MSG);
+	}
+
+	/* Tell the core lease code to wait for delayed work completion */
+	fl->fl_break_time = 0;
+
+	return false;
+}
+
+static int map_direct_lm_change(struct file_lock *fl, int arg,
+		struct list_head *dispose)
+{
+	WARN_ON(!(arg & F_UNLCK));
+
+	return lease_modify(fl, arg, dispose);
+}
+
+static void map_direct_lm_setup(struct file_lock *fl, void **priv)
+{
+	struct file *file = fl->fl_file;
+	struct map_direct_state *mds = *priv;
+	struct fasync_struct *fa = mds->mds_fa;
+
+	/*
+	 * Comment copied from lease_setup():
+	 * fasync_insert_entry() returns the old entry if any. If there was no
+	 * old entry, then it used "priv" and inserted it into the fasync list.
+	 * Clear the pointer to indicate that it shouldn't be freed.
+	 */
+	if (!fasync_insert_entry(fa->fa_fd, file, &fl->fl_fasync, fa))
+		*priv = NULL;
+
+	__f_setown(file, task_pid(current), PIDTYPE_PID, 0);
+}
+
+static const struct lock_manager_operations map_direct_lm_ops = {
+	.lm_break = map_direct_lm_break,
+	.lm_change = map_direct_lm_change,
+	.lm_setup = map_direct_lm_setup,
+};
+
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma)
+{
+	struct map_direct_state *mds = kzalloc(sizeof(*mds), GFP_KERNEL);
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct fasync_struct *fa;
+	struct file_lock *fl;
+	void *owner = mds;
+	int rc = -ENOMEM;
+
+	if (!mds)
+		return ERR_PTR(-ENOMEM);
+
+	mds->mds_vma = vma;
+	atomic_set(&mds->mds_ref, 1);
+	atomic_set(&mds->mds_vmaref, 1);
+	set_bit(MAPDIRECT_VALID, &mds->mds_state);
+	mds->mds_inode = inode;
+	ihold(inode);
+	INIT_DELAYED_WORK(&mds->mds_work, map_direct_invalidate);
+
+	fa = fasync_alloc();
+	if (!fa)
+		goto err_fasync_alloc;
+	mds->mds_fa = fa;
+	fa->fa_fd = fd;
+
+	fl = locks_alloc_lock();
+	if (!fl)
+		goto err_lock_alloc;
+
+	locks_init_lock(fl);
+	fl->fl_lmops = &map_direct_lm_ops;
+	fl->fl_flags = FL_LAYOUT;
+	fl->fl_type = F_RDLCK;
+	fl->fl_end = OFFSET_MAX;
+	fl->fl_owner = mds;
+	atomic_inc(&mds->mds_ref);
+	fl->fl_pid = current->tgid;
+	fl->fl_file = file;
+
+	rc = vfs_setlease(file, fl->fl_type, &fl, &owner);
+	if (rc)
+		goto err_setlease;
+	if (fl) {
+		WARN_ON(1);
+		owner = mds;
+		vfs_setlease(file, F_UNLCK, NULL, &owner);
+		owner = NULL;
+		rc = -ENXIO;
+		goto err_setlease;
+	}
+
+	return mds;
+
+err_setlease:
+	locks_free_lock(fl);
+err_lock_alloc:
+	/* if owner is NULL then the lease machinery is reponsible @fa */
+	if (owner)
+		fasync_free(fa);
+err_fasync_alloc:
+	iput(inode);
+	kfree(mds);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_GPL(map_direct_register);
diff --git a/include/linux/mapdirect.h b/include/linux/mapdirect.h
new file mode 100644
index 000000000000..5491aa550e55
--- /dev/null
+++ b/include/linux/mapdirect.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __MAPDIRECT_H__
+#define __MAPDIRECT_H__
+#include <linux/err.h>
+
+struct inode;
+struct work_struct;
+struct vm_area_struct;
+struct map_direct_state;
+
+#if IS_ENABLED(CONFIG_FS_DAX)
+struct map_direct_state *map_direct_register(int fd, struct vm_area_struct *vma);
+bool test_map_direct_valid(struct map_direct_state *mds);
+void generic_map_direct_open(struct vm_area_struct *vma);
+void generic_map_direct_close(struct vm_area_struct *vma);
+#else
+static inline struct map_direct_state *map_direct_register(int fd,
+		struct vm_area_struct *vma)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+static inline bool test_map_direct_valid(struct map_direct_state *mds)
+{
+	return false;
+}
+#define generic_map_direct_open NULL
+#define generic_map_direct_close NULL
+#endif
+#endif /* __MAPDIRECT_H__ */

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  2017-10-12  0:47 ` Dan Williams
  (?)
  (?)
@ 2017-10-12  0:47   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner, linux-xfs,
	linux-mm, linux-fsdevel, Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_file.c   |    1 +
 fs/xfs/xfs_ioctl.c  |    1 +
 fs/xfs/xfs_iops.c   |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   31 +------------------------------
 fs/xfs/xfs_pnfs.h   |    8 --------
 9 files changed, 64 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..3cc7292b2e9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -39,6 +39,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_layout.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..8bfd6db4f06d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
 #include "xfs_fsmap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/cred.h>
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 17081c77ef86..4bc2e5ef1a3a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -39,6 +39,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..ee9de16d7672 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -18,36 +18,7 @@
 #include "xfs_shared.h"
 #include "xfs_bit.h"
 #include "xfs_pnfs.h"
-
-/*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
+#include "xfs_layout.h"
 
 /*
  * Get a unique ID including its location so that the client can identify
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..5a2710dd5478 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -7,13 +7,5 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_file.c   |    1 +
 fs/xfs/xfs_ioctl.c  |    1 +
 fs/xfs/xfs_iops.c   |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   31 +------------------------------
 fs/xfs/xfs_pnfs.h   |    8 --------
 9 files changed, 64 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..3cc7292b2e9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -39,6 +39,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_layout.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..8bfd6db4f06d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
 #include "xfs_fsmap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/cred.h>
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 17081c77ef86..4bc2e5ef1a3a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -39,6 +39,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..ee9de16d7672 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -18,36 +18,7 @@
 #include "xfs_shared.h"
 #include "xfs_bit.h"
 #include "xfs_pnfs.h"
-
-/*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
+#include "xfs_layout.h"
 
 /*
  * Get a unique ID including its location so that the client can identify
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..5a2710dd5478 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -7,13 +7,5 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_file.c   |    1 +
 fs/xfs/xfs_ioctl.c  |    1 +
 fs/xfs/xfs_iops.c   |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   31 +------------------------------
 fs/xfs/xfs_pnfs.h   |    8 --------
 9 files changed, 64 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..3cc7292b2e9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -39,6 +39,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_layout.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..8bfd6db4f06d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
 #include "xfs_fsmap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/cred.h>
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 17081c77ef86..4bc2e5ef1a3a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -39,6 +39,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..ee9de16d7672 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -18,36 +18,7 @@
 #include "xfs_shared.h"
 #include "xfs_bit.h"
 #include "xfs_pnfs.h"
-
-/*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
+#include "xfs_layout.h"
 
 /*
  * Get a unique ID including its location so that the client can identify
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..5a2710dd5478 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -7,13 +7,5 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Darrick J. Wong, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/xfs/Kconfig      |    4 ++++
 fs/xfs/Makefile     |    1 +
 fs/xfs/xfs_file.c   |    1 +
 fs/xfs/xfs_ioctl.c  |    1 +
 fs/xfs/xfs_iops.c   |    1 +
 fs/xfs/xfs_layout.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_layout.h |   13 +++++++++++++
 fs/xfs/xfs_pnfs.c   |   31 +------------------------------
 fs/xfs/xfs_pnfs.h   |    8 --------
 9 files changed, 64 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+	def_bool y
+	depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)	+= xfs_layout.o
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..3cc7292b2e9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -39,6 +39,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_layout.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..8bfd6db4f06d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
 #include "xfs_fsmap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/cred.h>
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 17081c77ef86..4bc2e5ef1a3a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -39,6 +39,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
+#include "xfs_layout.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index 000000000000..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include <linux/fs.h>
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+	struct inode		*inode,
+	uint			*iolock)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+		xfs_iunlock(ip, *iolock);
+		error = break_layout(inode, true);
+		*iolock = XFS_IOLOCK_EXCL;
+		xfs_ilock(ip, *iolock);
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index 000000000000..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+	return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..ee9de16d7672 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -18,36 +18,7 @@
 #include "xfs_shared.h"
 #include "xfs_bit.h"
 #include "xfs_pnfs.h"
-
-/*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it during the write operation, where aren't concerned
- * about exposing unallocated blocks but just want to provide basic
- * synchronization between a local writer and pNFS clients.  mmap writes would
- * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
- */
-int
-xfs_break_layouts(
-	struct inode		*inode,
-	uint			*iolock)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
-
-	while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
-		xfs_iunlock(ip, *iolock);
-		error = break_layout(inode, true);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-	}
-
-	return error;
-}
+#include "xfs_layout.h"
 
 /*
  * Get a unique ID including its location so that the client can identify
diff --git a/fs/xfs/xfs_pnfs.h b/fs/xfs/xfs_pnfs.h
index b587cb99b2b7..5a2710dd5478 100644
--- a/fs/xfs/xfs_pnfs.h
+++ b/fs/xfs/xfs_pnfs.h
@@ -7,13 +7,5 @@ int xfs_fs_map_blocks(struct inode *inode, loff_t offset, u64 length,
 		struct iomap *iomap, bool write, u32 *device_generation);
 int xfs_fs_commit_blocks(struct inode *inode, struct iomap *maps, int nr_maps,
 		struct iattr *iattr);
-
-int xfs_break_layouts(struct inode *inode, uint *iolock);
-#else
-static inline int
-xfs_break_layouts(struct inode *inode, uint *iolock)
-{
-	return 0;
-}
 #endif /* CONFIG_EXPORTFS_BLOCK_OPS */
 #endif /* _XFS_PNFS_H */

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait()
  2017-10-12  0:47 ` Dan Williams
  (?)
  (?)
@ 2017-10-12  0:47   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner, linux-xfs,
	linux-mm, Al Viro, linux-fsdevel, Christoph Hellwig

In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to start the lease break
process in contexts where we can not sleep waiting for the lease break
timeout.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_iomap.c  |    3 +++
 fs/xfs/xfs_layout.c |    5 ++++-
 include/linux/fs.h  |    9 +++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..840e4080afb5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1055,6 +1055,9 @@ xfs_file_iomap_begin(
 			error = -EAGAIN;
 			goto out_unlock;
 		}
+		error = break_layout_nowait(inode);
+		if (error)
+			goto out_unlock;
 		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..7a633b6e9397 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of break_layout_nowait in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17e0e899e184..2b030a2fccc7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2364,6 +2364,15 @@ static inline int break_layout(struct inode *inode, bool wait)
 
 #endif /* CONFIG_FILE_LOCKING */
 
+/*
+ * For use in paths where we can not wait for the layout to be recalled,
+ * for example when we are holding mmap_sem.
+ */
+static inline int break_layout_nowait(struct inode *inode)
+{
+	return break_layout(inode, false);
+}
+
 /* fs/open.c */
 struct audit_names;
 struct filename {

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait()
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, Al Viro, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to start the lease break
process in contexts where we can not sleep waiting for the lease break
timeout.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_iomap.c  |    3 +++
 fs/xfs/xfs_layout.c |    5 ++++-
 include/linux/fs.h  |    9 +++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..840e4080afb5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1055,6 +1055,9 @@ xfs_file_iomap_begin(
 			error = -EAGAIN;
 			goto out_unlock;
 		}
+		error = break_layout_nowait(inode);
+		if (error)
+			goto out_unlock;
 		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..7a633b6e9397 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of break_layout_nowait in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17e0e899e184..2b030a2fccc7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2364,6 +2364,15 @@ static inline int break_layout(struct inode *inode, bool wait)
 
 #endif /* CONFIG_FILE_LOCKING */
 
+/*
+ * For use in paths where we can not wait for the layout to be recalled,
+ * for example when we are holding mmap_sem.
+ */
+static inline int break_layout_nowait(struct inode *inode)
+{
+	return break_layout(inode, false);
+}
+
 /* fs/open.c */
 struct audit_names;
 struct filename {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait()
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Jan Kara, Darrick J. Wong, linux-api, Dave Chinner, linux-xfs,
	linux-mm, Jeff Moyer, Al Viro, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to start the lease break
process in contexts where we can not sleep waiting for the lease break
timeout.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_iomap.c  |    3 +++
 fs/xfs/xfs_layout.c |    5 ++++-
 include/linux/fs.h  |    9 +++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..840e4080afb5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1055,6 +1055,9 @@ xfs_file_iomap_begin(
 			error = -EAGAIN;
 			goto out_unlock;
 		}
+		error = break_layout_nowait(inode);
+		if (error)
+			goto out_unlock;
 		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..7a633b6e9397 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of break_layout_nowait in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17e0e899e184..2b030a2fccc7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2364,6 +2364,15 @@ static inline int break_layout(struct inode *inode, bool wait)
 
 #endif /* CONFIG_FILE_LOCKING */
 
+/*
+ * For use in paths where we can not wait for the layout to be recalled,
+ * for example when we are holding mmap_sem.
+ */
+static inline int break_layout_nowait(struct inode *inode)
+{
+	return break_layout(inode, false);
+}
+
 /* fs/open.c */
 struct audit_names;
 struct filename {


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait()
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: Jan Kara, Darrick J. Wong, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, Al Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ross Zwisler,
	Christoph Hellwig

In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to start the lease break
process in contexts where we can not sleep waiting for the lease break
timeout.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Suggested-by: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/xfs/xfs_iomap.c  |    3 +++
 fs/xfs/xfs_layout.c |    5 ++++-
 include/linux/fs.h  |    9 +++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..840e4080afb5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1055,6 +1055,9 @@ xfs_file_iomap_begin(
 			error = -EAGAIN;
 			goto out_unlock;
 		}
+		error = break_layout_nowait(inode);
+		if (error)
+			goto out_unlock;
 		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..7a633b6e9397 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of break_layout_nowait in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17e0e899e184..2b030a2fccc7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2364,6 +2364,15 @@ static inline int break_layout(struct inode *inode, bool wait)
 
 #endif /* CONFIG_FILE_LOCKING */
 
+/*
+ * For use in paths where we can not wait for the layout to be recalled,
+ * for example when we are holding mmap_sem.
+ */
+static inline int break_layout_nowait(struct inode *inode)
+{
+	return break_layout(inode, false);
+}
+
 /* fs/open.c */
 struct audit_names;
 struct filename {

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 6/6] xfs: wire up MAP_DIRECT
  2017-10-12  0:47 ` Dan Williams
  (?)
  (?)
@ 2017-10-12  0:47   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: J. Bruce Fields, Jan Kara, Arnd Bergmann, Darrick J. Wong,
	linux-api, Dave Chinner, linux-xfs, linux-mm, Alexander Viro,
	linux-fsdevel, Jeff Layton, Christoph Hellwig

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  107 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3cc7292b2e9f..71dbe0307746 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -41,12 +41,22 @@
 #include "xfs_reflink.h"
 #include "xfs_layout.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+	struct vm_area_struct	*vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1013,6 +1023,25 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static bool
+xfs_vma_has_direct_lease(
+	struct vm_area_struct	*vma)
+{
+	/* Non MAP_DIRECT vmas do not require layout leases */
+	if (!xfs_vma_is_direct(vma))
+		return true;
+
+	if (!test_map_direct_valid(vma->vm_private_data))
+		return false;
+
+	/* We have a valid lease */
+	return true;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1028,7 +1057,8 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
-	struct inode		*inode = file_inode(vmf->vma->vm_file);
+	struct vm_area_struct	*vma = vmf->vma;
+	struct inode		*inode = file_inode(vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
 
@@ -1036,10 +1066,15 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (!xfs_vma_has_direct_lease(vma)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1048,6 +1083,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1119,6 +1156,17 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= generic_map_direct_open,
+	.close		= generic_map_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1139,6 +1187,60 @@ xfs_file_mmap(
 	return 0;
 }
 
+static int
+xfs_file_mmap_direct(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	/*
+	 * Not permitted to set up MAP_DIRECT mapping over reflinked or
+	 * non-DAX extents since reflink may cause block moves /
+	 * copy-on-write, and non-DAX is by definition always indirect
+	 * through the page cache.
+	 */
+	if (xfs_is_reflink_inode(ip))
+		return -EPERM;
+	if (!IS_DAX(inode))
+		return -EPERM;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	/*
+	 * generic_map_direct_{open,close} expect ->vm_private_data is
+	 * set to the result of map_direct_register
+	 */
+	vma->vm_private_data = mds;
+	return 0;
+}
+
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+static int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+	return xfs_file_mmap_direct(filp, vma, fd);
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1150,6 +1252,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 6/6] xfs: wire up MAP_DIRECT
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton,
	Ross Zwisler

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  107 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3cc7292b2e9f..71dbe0307746 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -41,12 +41,22 @@
 #include "xfs_reflink.h"
 #include "xfs_layout.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+	struct vm_area_struct	*vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1013,6 +1023,25 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static bool
+xfs_vma_has_direct_lease(
+	struct vm_area_struct	*vma)
+{
+	/* Non MAP_DIRECT vmas do not require layout leases */
+	if (!xfs_vma_is_direct(vma))
+		return true;
+
+	if (!test_map_direct_valid(vma->vm_private_data))
+		return false;
+
+	/* We have a valid lease */
+	return true;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1028,7 +1057,8 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
-	struct inode		*inode = file_inode(vmf->vma->vm_file);
+	struct vm_area_struct	*vma = vmf->vma;
+	struct inode		*inode = file_inode(vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
 
@@ -1036,10 +1066,15 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (!xfs_vma_has_direct_lease(vma)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1048,6 +1083,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1119,6 +1156,17 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= generic_map_direct_open,
+	.close		= generic_map_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1139,6 +1187,60 @@ xfs_file_mmap(
 	return 0;
 }
 
+static int
+xfs_file_mmap_direct(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	/*
+	 * Not permitted to set up MAP_DIRECT mapping over reflinked or
+	 * non-DAX extents since reflink may cause block moves /
+	 * copy-on-write, and non-DAX is by definition always indirect
+	 * through the page cache.
+	 */
+	if (xfs_is_reflink_inode(ip))
+		return -EPERM;
+	if (!IS_DAX(inode))
+		return -EPERM;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	/*
+	 * generic_map_direct_{open,close} expect ->vm_private_data is
+	 * set to the result of map_direct_register
+	 */
+	vma->vm_private_data = mds;
+	return 0;
+}
+
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+static int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+	return xfs_file_mmap_direct(filp, vma, fd);
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1150,6 +1252,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 6/6] xfs: wire up MAP_DIRECT
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, linux-api,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	Jeff Moyer, Alexander Viro, linux-fsdevel, Jeff Layton,
	Ross Zwisler

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  107 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3cc7292b2e9f..71dbe0307746 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -41,12 +41,22 @@
 #include "xfs_reflink.h"
 #include "xfs_layout.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+	struct vm_area_struct	*vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1013,6 +1023,25 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static bool
+xfs_vma_has_direct_lease(
+	struct vm_area_struct	*vma)
+{
+	/* Non MAP_DIRECT vmas do not require layout leases */
+	if (!xfs_vma_is_direct(vma))
+		return true;
+
+	if (!test_map_direct_valid(vma->vm_private_data))
+		return false;
+
+	/* We have a valid lease */
+	return true;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1028,7 +1057,8 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
-	struct inode		*inode = file_inode(vmf->vma->vm_file);
+	struct vm_area_struct	*vma = vmf->vma;
+	struct inode		*inode = file_inode(vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
 
@@ -1036,10 +1066,15 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (!xfs_vma_has_direct_lease(vma)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1048,6 +1083,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1119,6 +1156,17 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= generic_map_direct_open,
+	.close		= generic_map_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1139,6 +1187,60 @@ xfs_file_mmap(
 	return 0;
 }
 
+static int
+xfs_file_mmap_direct(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	/*
+	 * Not permitted to set up MAP_DIRECT mapping over reflinked or
+	 * non-DAX extents since reflink may cause block moves /
+	 * copy-on-write, and non-DAX is by definition always indirect
+	 * through the page cache.
+	 */
+	if (xfs_is_reflink_inode(ip))
+		return -EPERM;
+	if (!IS_DAX(inode))
+		return -EPERM;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	/*
+	 * generic_map_direct_{open,close} expect ->vm_private_data is
+	 * set to the result of map_direct_register
+	 */
+	vma->vm_private_data = mds;
+	return 0;
+}
+
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+static int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+	return xfs_file_mmap_direct(filp, vma, fd);
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1150,6 +1252,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
 


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v9 6/6] xfs: wire up MAP_DIRECT
@ 2017-10-12  0:47   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  0:47 UTC (permalink / raw)
  To: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw
  Cc: J. Bruce Fields, Jan Kara, Arnd Bergmann, Darrick J. Wong,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Alexander Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton,
	Christoph Hellwig

MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

      * The fault would trigger the filesystem to allocate blocks
      * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
         might require block-map updates, or the lease timed out and the
         kernel invalidated the mapping.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/xfs/Kconfig                  |    2 -
 fs/xfs/xfs_file.c               |  107 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mman.h            |    3 +
 include/uapi/asm-generic/mman.h |    1 
 4 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
 	def_bool y
-	depends on EXPORTFS_BLOCK_OPS
+	depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3cc7292b2e9f..71dbe0307746 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -41,12 +41,22 @@
 #include "xfs_reflink.h"
 #include "xfs_layout.h"
 
+#include <linux/mman.h>
 #include <linux/dcache.h>
 #include <linux/falloc.h>
 #include <linux/pagevec.h>
+#include <linux/mapdirect.h>
 #include <linux/backing-dev.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+	struct vm_area_struct	*vma)
+{
+	return vma->vm_ops == &xfs_file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1013,6 +1023,25 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static bool
+xfs_vma_has_direct_lease(
+	struct vm_area_struct	*vma)
+{
+	/* Non MAP_DIRECT vmas do not require layout leases */
+	if (!xfs_vma_is_direct(vma))
+		return true;
+
+	if (!test_map_direct_valid(vma->vm_private_data))
+		return false;
+
+	/* We have a valid lease */
+	return true;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1028,7 +1057,8 @@ __xfs_filemap_fault(
 	enum page_entry_size	pe_size,
 	bool			write_fault)
 {
-	struct inode		*inode = file_inode(vmf->vma->vm_file);
+	struct vm_area_struct	*vma = vmf->vma;
+	struct inode		*inode = file_inode(vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			ret;
 
@@ -1036,10 +1066,15 @@ __xfs_filemap_fault(
 
 	if (write_fault) {
 		sb_start_pagefault(inode->i_sb);
-		file_update_time(vmf->vma->vm_file);
+		file_update_time(vma->vm_file);
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	if (!xfs_vma_has_direct_lease(vma)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
 	if (IS_DAX(inode)) {
 		ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
 	} else {
@@ -1048,6 +1083,8 @@ __xfs_filemap_fault(
 		else
 			ret = filemap_fault(vmf);
 	}
+
+out_unlock:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (write_fault)
@@ -1119,6 +1156,17 @@ xfs_filemap_pfn_mkwrite(
 
 }
 
+static const struct vm_operations_struct xfs_file_vm_direct_ops = {
+	.fault		= xfs_filemap_fault,
+	.huge_fault	= xfs_filemap_huge_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= xfs_filemap_pfn_mkwrite,
+
+	.open		= generic_map_direct_open,
+	.close		= generic_map_direct_close,
+};
+
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.huge_fault	= xfs_filemap_huge_fault,
@@ -1139,6 +1187,60 @@ xfs_file_mmap(
 	return 0;
 }
 
+static int
+xfs_file_mmap_direct(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	int			fd)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct map_direct_state	*mds;
+
+	/*
+	 * Not permitted to set up MAP_DIRECT mapping over reflinked or
+	 * non-DAX extents since reflink may cause block moves /
+	 * copy-on-write, and non-DAX is by definition always indirect
+	 * through the page cache.
+	 */
+	if (xfs_is_reflink_inode(ip))
+		return -EPERM;
+	if (!IS_DAX(inode))
+		return -EPERM;
+
+	mds = map_direct_register(fd, vma);
+	if (IS_ERR(mds))
+		return PTR_ERR(mds);
+
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_direct_ops;
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	/*
+	 * generic_map_direct_{open,close} expect ->vm_private_data is
+	 * set to the result of map_direct_register
+	 */
+	vma->vm_private_data = mds;
+	return 0;
+}
+
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT)
+
+static int
+xfs_file_mmap_validate(
+	struct file		*filp,
+	struct vm_area_struct	*vma,
+	unsigned long		map_flags,
+	int			fd)
+{
+	if (map_flags & ~(XFS_MAP_SUPPORTED))
+		return -EOPNOTSUPP;
+
+	if ((map_flags & MAP_DIRECT) == 0)
+		return xfs_file_mmap(filp, vma);
+	return xfs_file_mmap_direct(filp, vma, fd);
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
@@ -1150,6 +1252,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.mmap_validate	= xfs_file_mmap_validate,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..fab393a9dda9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -20,6 +20,9 @@
 #ifndef MAP_HUGE_1GB
 #define MAP_HUGE_1GB 0
 #endif
+#ifndef MAP_DIRECT
+#define MAP_DIRECT 0
+#endif
 #ifndef MAP_UNINITIALIZED
 #define MAP_UNINITIALIZED 0
 #endif
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..c916f22008e0 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
 #define MAP_STACK	0x20000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x40000		/* create a huge page mapping */
+#define MAP_DIRECT	0x80000		/* leased block map (layout) for DAX */
 
 /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
  2017-10-12  0:47   ` Dan Williams
  (?)
  (?)
@ 2017-10-12  1:21     ` Al Viro
  -1 siblings, 0 replies; 116+ messages in thread
From: Al Viro @ 2017-10-12  1:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, linux-api, linux-nvdimm, Dave Chinner,
	linux-xfs, linux-mm, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
> block map changes while the file is mapped. It requires the fd to setup
> an fasync_struct for signalling lease break events to the lease holder.

*UGH*

That looks like one hell of a bad API.  You are not even guaranteed that
descriptor will remain be still open by the time you pass it down to your
helper, nevermind the moment when event actually happens...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  1:21     ` Al Viro
  0 siblings, 0 replies; 116+ messages in thread
From: Al Viro @ 2017-10-12  1:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, linux-api, Dave Chinner,
	Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
> block map changes while the file is mapped. It requires the fd to setup
> an fasync_struct for signalling lease break events to the lease holder.

*UGH*

That looks like one hell of a bad API.  You are not even guaranteed that
descriptor will remain be still open by the time you pass it down to your
helper, nevermind the moment when event actually happens...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  1:21     ` Al Viro
  0 siblings, 0 replies; 116+ messages in thread
From: Al Viro @ 2017-10-12  1:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, linux-api, Dave Chinner,
	Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
> block map changes while the file is mapped. It requires the fd to setup
> an fasync_struct for signalling lease break events to the lease holder.

*UGH*

That looks like one hell of a bad API.  You are not even guaranteed that
descriptor will remain be still open by the time you pass it down to your
helper, nevermind the moment when event actually happens...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  1:21     ` Al Viro
  0 siblings, 0 replies; 116+ messages in thread
From: Al Viro @ 2017-10-12  1:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Darrick J. Wong, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Christoph Hellwig

On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
> block map changes while the file is mapped. It requires the fd to setup
> an fasync_struct for signalling lease break events to the lease holder.

*UGH*

That looks like one hell of a bad API.  You are not even guaranteed that
descriptor will remain be still open by the time you pass it down to your
helper, nevermind the moment when event actually happens...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
  2017-10-12  1:21     ` Al Viro
  (?)
  (?)
@ 2017-10-12  1:28       ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  1:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>> block map changes while the file is mapped. It requires the fd to setup
>> an fasync_struct for signalling lease break events to the lease holder.
>
> *UGH*
>
> That looks like one hell of a bad API.  You are not even guaranteed that
> descriptor will remain be still open by the time you pass it down to your
> helper, nevermind the moment when event actually happens...

What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  1:28       ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  1:28 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, Linux API, Dave Chinner,
	Christoph Hellwig, linux-xfs, Linux MM, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>> block map changes while the file is mapped. It requires the fd to setup
>> an fasync_struct for signalling lease break events to the lease holder.
>
> *UGH*
>
> That looks like one hell of a bad API.  You are not even guaranteed that
> descriptor will remain be still open by the time you pass it down to your
> helper, nevermind the moment when event actually happens...

What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  1:28       ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  1:28 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, Linux API, Dave Chinner,
	Christoph Hellwig, linux-xfs, Linux MM, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>> block map changes while the file is mapped. It requires the fd to setup
>> an fasync_struct for signalling lease break events to the lease holder.
>
> *UGH*
>
> That looks like one hell of a bad API.  You are not even guaranteed that
> descriptor will remain be still open by the time you pass it down to your
> helper, nevermind the moment when event actually happens...

What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  1:28       ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  1:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Darrick J. Wong, Linux API,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>> block map changes while the file is mapped. It requires the fd to setup
>> an fasync_struct for signalling lease break events to the lease holder.
>
> *UGH*
>
> That looks like one hell of a bad API.  You are not even guaranteed that
> descriptor will remain be still open by the time you pass it down to your
> helper, nevermind the moment when event actually happens...

What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
  2017-10-12  1:28       ` Dan Williams
  (?)
  (?)
@ 2017-10-12  2:17         ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  2:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>> block map changes while the file is mapped. It requires the fd to setup
>>> an fasync_struct for signalling lease break events to the lease holder.
>>
>> *UGH*
>>
>> That looks like one hell of a bad API.  You are not even guaranteed that
>> descriptor will remain be still open by the time you pass it down to your
>> helper, nevermind the moment when event actually happens...
>
> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

Ugh, so I think the difference with F_SETLEASE is that the lease ends
when the fd is closed. In the mmap case the lease follows the lifetime
of the vma. I'll rethink this interface...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  2:17         ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  2:17 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, Linux API, Dave Chinner,
	Christoph Hellwig, linux-xfs, Linux MM, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>> block map changes while the file is mapped. It requires the fd to setup
>>> an fasync_struct for signalling lease break events to the lease holder.
>>
>> *UGH*
>>
>> That looks like one hell of a bad API.  You are not even guaranteed that
>> descriptor will remain be still open by the time you pass it down to your
>> helper, nevermind the moment when event actually happens...
>
> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

Ugh, so I think the difference with F_SETLEASE is that the lease ends
when the fd is closed. In the mmap case the lease follows the lifetime
of the vma. I'll rethink this interface...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  2:17         ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  2:17 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, Linux API, Dave Chinner,
	Christoph Hellwig, linux-xfs, Linux MM, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>> block map changes while the file is mapped. It requires the fd to setup
>>> an fasync_struct for signalling lease break events to the lease holder.
>>
>> *UGH*
>>
>> That looks like one hell of a bad API.  You are not even guaranteed that
>> descriptor will remain be still open by the time you pass it down to your
>> helper, nevermind the moment when event actually happens...
>
> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

Ugh, so I think the difference with F_SETLEASE is that the lease ends
when the fd is closed. In the mmap case the lease follows the lifetime
of the vma. I'll rethink this interface...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  2:17         ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  2:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Darrick J. Wong, Linux API,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>> block map changes while the file is mapped. It requires the fd to setup
>>> an fasync_struct for signalling lease break events to the lease holder.
>>
>> *UGH*
>>
>> That looks like one hell of a bad API.  You are not even guaranteed that
>> descriptor will remain be still open by the time you pass it down to your
>> helper, nevermind the moment when event actually happens...
>
> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

Ugh, so I think the difference with F_SETLEASE is that the lease ends
when the fd is closed. In the mmap case the lease follows the lifetime
of the vma. I'll rethink this interface...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
  2017-10-12  2:17         ` Dan Williams
  (?)
  (?)
@ 2017-10-12  3:44           ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  3:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

On Wed, Oct 11, 2017 at 7:17 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>>> block map changes while the file is mapped. It requires the fd to setup
>>>> an fasync_struct for signalling lease break events to the lease holder.
>>>
>>> *UGH*
>>>
>>> That looks like one hell of a bad API.  You are not even guaranteed that
>>> descriptor will remain be still open by the time you pass it down to your
>>> helper, nevermind the moment when event actually happens...
>>
>> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
>
> Ugh, so I think the difference with F_SETLEASE is that the lease ends
> when the fd is closed. In the mmap case the lease follows the lifetime
> of the vma. I'll rethink this interface...

I'm not seeing a lot of good options outside of documenting that if
you close the fd that is registered with MAP_DIRECT you may still get
SIGIO notifications with si_fd set to the stale fd.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  3:44           ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  3:44 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, Linux API, Dave Chinner,
	Christoph Hellwig, linux-xfs, Linux MM, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 7:17 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>>> block map changes while the file is mapped. It requires the fd to setup
>>>> an fasync_struct for signalling lease break events to the lease holder.
>>>
>>> *UGH*
>>>
>>> That looks like one hell of a bad API.  You are not even guaranteed that
>>> descriptor will remain be still open by the time you pass it down to your
>>> helper, nevermind the moment when event actually happens...
>>
>> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
>
> Ugh, so I think the difference with F_SETLEASE is that the lease ends
> when the fd is closed. In the mmap case the lease follows the lifetime
> of the vma. I'll rethink this interface...

I'm not seeing a lot of good options outside of documenting that if
you close the fd that is registered with MAP_DIRECT you may still get
SIGIO notifications with si_fd set to the stale fd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  3:44           ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  3:44 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-nvdimm, Jan Kara, Darrick J. Wong, Linux API, Dave Chinner,
	Christoph Hellwig, linux-xfs, Linux MM, Jeff Moyer,
	linux-fsdevel, Andrew Morton, Ross Zwisler

On Wed, Oct 11, 2017 at 7:17 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>>> block map changes while the file is mapped. It requires the fd to setup
>>>> an fasync_struct for signalling lease break events to the lease holder.
>>>
>>> *UGH*
>>>
>>> That looks like one hell of a bad API.  You are not even guaranteed that
>>> descriptor will remain be still open by the time you pass it down to your
>>> helper, nevermind the moment when event actually happens...
>>
>> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
>
> Ugh, so I think the difference with F_SETLEASE is that the lease ends
> when the fd is closed. In the mmap case the lease follows the lifetime
> of the vma. I'll rethink this interface...

I'm not seeing a lot of good options outside of documenting that if
you close the fd that is registered with MAP_DIRECT you may still get
SIGIO notifications with si_fd set to the stale fd.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()
@ 2017-10-12  3:44           ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12  3:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Darrick J. Wong, Linux API,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Wed, Oct 11, 2017 at 7:17 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>>> block map changes while the file is mapped. It requires the fd to setup
>>>> an fasync_struct for signalling lease break events to the lease holder.
>>>
>>> *UGH*
>>>
>>> That looks like one hell of a bad API.  You are not even guaranteed that
>>> descriptor will remain be still open by the time you pass it down to your
>>> helper, nevermind the moment when event actually happens...
>>
>> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
>
> Ugh, so I think the difference with F_SETLEASE is that the lease ends
> when the fd is closed. In the mmap case the lease follows the lifetime
> of the vma. I'll rethink this interface...

I'm not seeing a lot of good options outside of documenting that if
you close the fd that is registered with MAP_DIRECT you may still get
SIGIO notifications with si_fd set to the stale fd.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
  2017-10-12  0:47   ` Dan Williams
  (?)
  (?)
@ 2017-10-12 13:51     ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-12 13:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Arnd Bergmann, linux-nvdimm, linux-api, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

Hi,

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..2649c00581a0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		struct inode *inode = file_inode(file);
>  
>  		switch (flags & MAP_TYPE) {
> +		case MAP_SHARED_VALIDATE:
> +			if ((flags & ~LEGACY_MAP_MASK) == 0) {
> +				/*
> +				 * If all legacy mmap flags, downgrade
> +				 * to MAP_SHARED, i.e. invoke ->mmap()
> +				 * instead of ->mmap_validate()
> +				 */
> +				flags &= ~MAP_TYPE;
> +				flags |= MAP_SHARED;
> +			} else if (!file->f_op->mmap_validate)
> +				return -EOPNOTSUPP;
> +			/* fall through */
>  		case MAP_SHARED:
>  			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
>  				return -EACCES;

When thinking a bit more about this I've realized one problem: Currently
user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
and he will get the new semantics (if the kernel happens to support it).  I
think that is undesirable and we should force usage of MAP_SHARED_VALIDATE
when you want to use flags outside of LEGACY_MAP_MASK. So I'd just mask off
non-legacy flags for MAP_SHARED mappings (so they would be silently ignored
as they used to be until now).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12 13:51     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-12 13:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Arnd Bergmann, linux-api, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

Hi,

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..2649c00581a0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		struct inode *inode = file_inode(file);
>  
>  		switch (flags & MAP_TYPE) {
> +		case MAP_SHARED_VALIDATE:
> +			if ((flags & ~LEGACY_MAP_MASK) == 0) {
> +				/*
> +				 * If all legacy mmap flags, downgrade
> +				 * to MAP_SHARED, i.e. invoke ->mmap()
> +				 * instead of ->mmap_validate()
> +				 */
> +				flags &= ~MAP_TYPE;
> +				flags |= MAP_SHARED;
> +			} else if (!file->f_op->mmap_validate)
> +				return -EOPNOTSUPP;
> +			/* fall through */
>  		case MAP_SHARED:
>  			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
>  				return -EACCES;

When thinking a bit more about this I've realized one problem: Currently
user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
and he will get the new semantics (if the kernel happens to support it).  I
think that is undesirable and we should force usage of MAP_SHARED_VALIDATE
when you want to use flags outside of LEGACY_MAP_MASK. So I'd just mask off
non-legacy flags for MAP_SHARED mappings (so they would be silently ignored
as they used to be until now).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12 13:51     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-12 13:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Jan Kara, Arnd Bergmann, linux-api, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

Hi,

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..2649c00581a0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		struct inode *inode = file_inode(file);
>  
>  		switch (flags & MAP_TYPE) {
> +		case MAP_SHARED_VALIDATE:
> +			if ((flags & ~LEGACY_MAP_MASK) == 0) {
> +				/*
> +				 * If all legacy mmap flags, downgrade
> +				 * to MAP_SHARED, i.e. invoke ->mmap()
> +				 * instead of ->mmap_validate()
> +				 */
> +				flags &= ~MAP_TYPE;
> +				flags |= MAP_SHARED;
> +			} else if (!file->f_op->mmap_validate)
> +				return -EOPNOTSUPP;
> +			/* fall through */
>  		case MAP_SHARED:
>  			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
>  				return -EACCES;

When thinking a bit more about this I've realized one problem: Currently
user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
and he will get the new semantics (if the kernel happens to support it).  I
think that is undesirable and we should force usage of MAP_SHARED_VALIDATE
when you want to use flags outside of LEGACY_MAP_MASK. So I'd just mask off
non-legacy flags for MAP_SHARED mappings (so they would be silently ignored
as they used to be until now).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12 13:51     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-12 13:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Linus Torvalds, Christoph Hellwig

Hi,

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506faceae..2649c00581a0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		struct inode *inode = file_inode(file);
>  
>  		switch (flags & MAP_TYPE) {
> +		case MAP_SHARED_VALIDATE:
> +			if ((flags & ~LEGACY_MAP_MASK) == 0) {
> +				/*
> +				 * If all legacy mmap flags, downgrade
> +				 * to MAP_SHARED, i.e. invoke ->mmap()
> +				 * instead of ->mmap_validate()
> +				 */
> +				flags &= ~MAP_TYPE;
> +				flags |= MAP_SHARED;
> +			} else if (!file->f_op->mmap_validate)
> +				return -EOPNOTSUPP;
> +			/* fall through */
>  		case MAP_SHARED:
>  			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
>  				return -EACCES;

When thinking a bit more about this I've realized one problem: Currently
user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
and he will get the new semantics (if the kernel happens to support it).  I
think that is undesirable and we should force usage of MAP_SHARED_VALIDATE
when you want to use flags outside of LEGACY_MAP_MASK. So I'd just mask off
non-legacy flags for MAP_SHARED mappings (so they would be silently ignored
as they used to be until now).

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-12  0:47 ` Dan Williams
  (?)
@ 2017-10-12 14:23   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-12 14:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-api, Dave Chinner, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

Sorry for chiming in so late, been extremely busy lately.

>From quickly glacing over what the now finally described use case is
(which contradicts the subject btw - it's not about flushing, it's
about not removing block mapping under a MR) and the previous comments
I think that mmap is simply the wrong kind of interface for this.

What we want is support for a new kinds of userspace memory registration in the
RDMA code that uses the pnfs export interface, both getting the block (or
rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
memory registration.

That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12 14:23   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-12 14:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-api, Dave Chinner, Christoph Hellwig,
	J. Bruce Fields, linux-mm, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

Sorry for chiming in so late, been extremely busy lately.

>From quickly glacing over what the now finally described use case is
(which contradicts the subject btw - it's not about flushing, it's
about not removing block mapping under a MR) and the previous comments
I think that mmap is simply the wrong kind of interface for this.

What we want is support for a new kinds of userspace memory registration in the
RDMA code that uses the pnfs export interface, both getting the block (or
rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
memory registration.

That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12 14:23   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-12 14:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	Christoph Hellwig, J. Bruce Fields,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jeff Layton,
	Linus Torvalds, Andrew Morton

Sorry for chiming in so late, been extremely busy lately.

>From quickly glacing over what the now finally described use case is
(which contradicts the subject btw - it's not about flushing, it's
about not removing block mapping under a MR) and the previous comments
I think that mmap is simply the wrong kind of interface for this.

What we want is support for a new kinds of userspace memory registration in the
RDMA code that uses the pnfs export interface, both getting the block (or
rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
memory registration.

That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
  2017-10-12 13:51     ` Jan Kara
  (?)
@ 2017-10-12 16:32       ` Linus Torvalds
  -1 siblings, 0 replies; 116+ messages in thread
From: Linus Torvalds @ 2017-10-12 16:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Arnd Bergmann, linux-nvdimm, Linux API, Christoph Hellwig,
	linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel,
	Andrew Morton

On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
>
> When thinking a bit more about this I've realized one problem: Currently
> user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> and he will get the new semantics (if the kernel happens to support it).  I
> think that is undesirable [..]

Why?

If you have a performance preference for MAP_DIRECT or something like
that, but you don't want to *enforce* it, you'd use just plain
MAP_SHARED with it.

Ie there may well be "I want this to work, possibly with downsides" issues.

So it seems to be a reasonable model, and disallowing it seems to
limit people and not really help anything.

                 Linus
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12 16:32       ` Linus Torvalds
  0 siblings, 0 replies; 116+ messages in thread
From: Linus Torvalds @ 2017-10-12 16:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, linux-nvdimm, Arnd Bergmann, Linux API, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
>
> When thinking a bit more about this I've realized one problem: Currently
> user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> and he will get the new semantics (if the kernel happens to support it).  I
> think that is undesirable [..]

Why?

If you have a performance preference for MAP_DIRECT or something like
that, but you don't want to *enforce* it, you'd use just plain
MAP_SHARED with it.

Ie there may well be "I want this to work, possibly with downsides" issues.

So it seems to be a reasonable model, and disallowing it seems to
limit people and not really help anything.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-12 16:32       ` Linus Torvalds
  0 siblings, 0 replies; 116+ messages in thread
From: Linus Torvalds @ 2017-10-12 16:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, linux-nvdimm, Arnd Bergmann, Linux API, linux-xfs,
	linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton,
	Christoph Hellwig

On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
>
> When thinking a bit more about this I've realized one problem: Currently
> user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> and he will get the new semantics (if the kernel happens to support it).  I
> think that is undesirable [..]

Why?

If you have a performance preference for MAP_DIRECT or something like
that, but you don't want to *enforce* it, you'd use just plain
MAP_SHARED with it.

Ie there may well be "I want this to work, possibly with downsides" issues.

So it seems to be a reasonable model, and disallowing it seems to
limit people and not really help anything.

                 Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-12 14:23   ` Christoph Hellwig
  (?)
@ 2017-10-12 17:41     ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, linux-fsdevel,
	Linus Torvalds, Jeff Layton, Jason Gunthorpe

On Thu, Oct 12, 2017 at 7:23 AM, Christoph Hellwig <hch@lst.de> wrote:
> Sorry for chiming in so late, been extremely busy lately.
>
> From quickly glacing over what the now finally described use case is
> (which contradicts the subject btw - it's not about flushing, it's
> about not removing block mapping under a MR) and the previous comments
> I think that mmap is simply the wrong kind of interface for this.
>
> What we want is support for a new kinds of userspace memory registration in the
> RDMA code that uses the pnfs export interface, both getting the block (or
> rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
> memory registration.
>
> That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

...and this is exactly my plan.

So, you're jumping into this review at v9 where I've split the patches
that take an initial MAP_DIRECT lease out from the patches that take
FL_LAYOUT leases at memory registration time. You can see a previous
attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
flush" which should be in your inbox.

I'm not proposing mmap as the memory registration interface, it's the
"register for notification of lease break" interface. Here's my
proposed sequence:

addr = mmap(..., MAP_DIRECT.., fd); <- register a vma for "direct"
memory registrations with an FL_LAYOUT lease that at a lease break
event sends SIGIO on the fd used for mmap.

ibv_reg_mr(..., addr, ...); <- check for a valid MAP_DIRECT vma, and
take out another FL_LAYOUT lease. This lease force revokes the RDMA
mapping when it expires, and it relies on the process receiving SIGIO
as the 'break' notification.

fallocate(fd, PUNCH_HOLE...) <- breaks all the FL_LAYOUT leases, the
vma owner gets notified by fd.

Al, rightly points out that the fd may be closed by the time the event
fires since the lease follows the vma lifetime. I see two ways to
solve this, document that the process may get notifications on a stale
fd if close() happens before munmap(), or, similar to how we call
locks_remove_posix() in filp_close(), add a routine to disable any
lease notifiers on close(). I'll investigate the second option because
this seems to be a general problem with leases.

For RDMA I am presently re-working the implementation [1]. Inspired by
a discussion with Jason [2], I am going to add something like
ib_umem_ops to allow drivers to override the default policy of what
happens on a lease that expires. The default action is to invalidate
device access to the memory with iommu_unmap(), but I want to allow
for drivers to do something smarter or choose to not support DAX
mappings at all.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012785.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012793.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12 17:41     ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton,
	Jason Gunthorpe

On Thu, Oct 12, 2017 at 7:23 AM, Christoph Hellwig <hch@lst.de> wrote:
> Sorry for chiming in so late, been extremely busy lately.
>
> From quickly glacing over what the now finally described use case is
> (which contradicts the subject btw - it's not about flushing, it's
> about not removing block mapping under a MR) and the previous comments
> I think that mmap is simply the wrong kind of interface for this.
>
> What we want is support for a new kinds of userspace memory registration in the
> RDMA code that uses the pnfs export interface, both getting the block (or
> rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
> memory registration.
>
> That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

...and this is exactly my plan.

So, you're jumping into this review at v9 where I've split the patches
that take an initial MAP_DIRECT lease out from the patches that take
FL_LAYOUT leases at memory registration time. You can see a previous
attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
flush" which should be in your inbox.

I'm not proposing mmap as the memory registration interface, it's the
"register for notification of lease break" interface. Here's my
proposed sequence:

addr = mmap(..., MAP_DIRECT.., fd); <- register a vma for "direct"
memory registrations with an FL_LAYOUT lease that at a lease break
event sends SIGIO on the fd used for mmap.

ibv_reg_mr(..., addr, ...); <- check for a valid MAP_DIRECT vma, and
take out another FL_LAYOUT lease. This lease force revokes the RDMA
mapping when it expires, and it relies on the process receiving SIGIO
as the 'break' notification.

fallocate(fd, PUNCH_HOLE...) <- breaks all the FL_LAYOUT leases, the
vma owner gets notified by fd.

Al, rightly points out that the fd may be closed by the time the event
fires since the lease follows the vma lifetime. I see two ways to
solve this, document that the process may get notifications on a stale
fd if close() happens before munmap(), or, similar to how we call
locks_remove_posix() in filp_close(), add a routine to disable any
lease notifiers on close(). I'll investigate the second option because
this seems to be a general problem with leases.

For RDMA I am presently re-working the implementation [1]. Inspired by
a discussion with Jason [2], I am going to add something like
ib_umem_ops to allow drivers to override the default policy of what
happens on a lease that expires. The default action is to invalidate
device access to the memory with iommu_unmap(), but I want to allow
for drivers to do something smarter or choose to not support DAX
mappings at all.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012785.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012793.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-12 17:41     ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-12 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton,
	Jason Gunthorpe

On Thu, Oct 12, 2017 at 7:23 AM, Christoph Hellwig <hch@lst.de> wrote:
> Sorry for chiming in so late, been extremely busy lately.
>
> From quickly glacing over what the now finally described use case is
> (which contradicts the subject btw - it's not about flushing, it's
> about not removing block mapping under a MR) and the previous comments
> I think that mmap is simply the wrong kind of interface for this.
>
> What we want is support for a new kinds of userspace memory registration in the
> RDMA code that uses the pnfs export interface, both getting the block (or
> rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
> memory registration.
>
> That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

...and this is exactly my plan.

So, you're jumping into this review at v9 where I've split the patches
that take an initial MAP_DIRECT lease out from the patches that take
FL_LAYOUT leases at memory registration time. You can see a previous
attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
flush" which should be in your inbox.

I'm not proposing mmap as the memory registration interface, it's the
"register for notification of lease break" interface. Here's my
proposed sequence:

addr = mmap(..., MAP_DIRECT.., fd); <- register a vma for "direct"
memory registrations with an FL_LAYOUT lease that at a lease break
event sends SIGIO on the fd used for mmap.

ibv_reg_mr(..., addr, ...); <- check for a valid MAP_DIRECT vma, and
take out another FL_LAYOUT lease. This lease force revokes the RDMA
mapping when it expires, and it relies on the process receiving SIGIO
as the 'break' notification.

fallocate(fd, PUNCH_HOLE...) <- breaks all the FL_LAYOUT leases, the
vma owner gets notified by fd.

Al, rightly points out that the fd may be closed by the time the event
fires since the lease follows the vma lifetime. I see two ways to
solve this, document that the process may get notifications on a stale
fd if close() happens before munmap(), or, similar to how we call
locks_remove_posix() in filp_close(), add a routine to disable any
lease notifiers on close(). I'll investigate the second option because
this seems to be a general problem with leases.

For RDMA I am presently re-working the implementation [1]. Inspired by
a discussion with Jason [2], I am going to add something like
ib_umem_ops to allow drivers to override the default policy of what
happens on a lease that expires. The default action is to invalidate
device access to the memory with iommu_unmap(), but I want to allow
for drivers to do something smarter or choose to not support DAX
mappings at all.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012785.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012793.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-12 17:41     ` Dan Williams
  (?)
@ 2017-10-13  6:57       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-13  6:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Jason Gunthorpe

On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
> So, you're jumping into this review at v9 where I've split the patches
> that take an initial MAP_DIRECT lease out from the patches that take
> FL_LAYOUT leases at memory registration time. You can see a previous
> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
> flush" which should be in your inbox.

The point is that your problem has absolutely nothing to do with mmap,
and all with get_user_pages.

get_user_pages on DAX doesn't give the same guarantees as on pagecache
or anonymous memory, and that is the prbolem we need to fix.  In fact
I'm pretty sure if we try hard enough (and we might have to try
very hard) we can see the same problem with plain direct I/O and without
any RDMA involved, e.g. do a larger direct I/O write to memory that is
mmap()ed from a DAX file, then truncate the DAX file and reallocate
the blocks, and we might corrupt that new file.  We'll probably need
a special setup where there is little other chance but to reallocate
those used blocks.

So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.

Then we need to look into the special case of a long-living non-transient
get_user_pages that RDMA does - we can't just reject any truncate or
other operation for that, so that's where something like me layout
lease suggestion comes into play - but the call that should get the
least is not the mmap - it's the memory registration call that does
the get_user_pages.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13  6:57       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-13  6:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton, Jason Gunthorpe

On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
> So, you're jumping into this review at v9 where I've split the patches
> that take an initial MAP_DIRECT lease out from the patches that take
> FL_LAYOUT leases at memory registration time. You can see a previous
> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
> flush" which should be in your inbox.

The point is that your problem has absolutely nothing to do with mmap,
and all with get_user_pages.

get_user_pages on DAX doesn't give the same guarantees as on pagecache
or anonymous memory, and that is the prbolem we need to fix.  In fact
I'm pretty sure if we try hard enough (and we might have to try
very hard) we can see the same problem with plain direct I/O and without
any RDMA involved, e.g. do a larger direct I/O write to memory that is
mmap()ed from a DAX file, then truncate the DAX file and reallocate
the blocks, and we might corrupt that new file.  We'll probably need
a special setup where there is little other chance but to reallocate
those used blocks.

So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.

Then we need to look into the special case of a long-living non-transient
get_user_pages that RDMA does - we can't just reject any truncate or
other operation for that, so that's where something like me layout
lease suggestion comes into play - but the call that should get the
least is not the mmap - it's the memory registration call that does
the get_user_pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13  6:57       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-13  6:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton, Jason Gunthorpe

On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
> So, you're jumping into this review at v9 where I've split the patches
> that take an initial MAP_DIRECT lease out from the patches that take
> FL_LAYOUT leases at memory registration time. You can see a previous
> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
> flush" which should be in your inbox.

The point is that your problem has absolutely nothing to do with mmap,
and all with get_user_pages.

get_user_pages on DAX doesn't give the same guarantees as on pagecache
or anonymous memory, and that is the prbolem we need to fix.  In fact
I'm pretty sure if we try hard enough (and we might have to try
very hard) we can see the same problem with plain direct I/O and without
any RDMA involved, e.g. do a larger direct I/O write to memory that is
mmap()ed from a DAX file, then truncate the DAX file and reallocate
the blocks, and we might corrupt that new file.  We'll probably need
a special setup where there is little other chance but to reallocate
those used blocks.

So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.

Then we need to look into the special case of a long-living non-transient
get_user_pages that RDMA does - we can't just reject any truncate or
other operation for that, so that's where something like me layout
lease suggestion comes into play - but the call that should get the
least is not the mmap - it's the memory registration call that does
the get_user_pages.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13  6:57       ` Christoph Hellwig
  (?)
@ 2017-10-13 15:14         ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton,
	Jason Gunthorpe

On Thu, Oct 12, 2017 at 11:57 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
>> So, you're jumping into this review at v9 where I've split the patches
>> that take an initial MAP_DIRECT lease out from the patches that take
>> FL_LAYOUT leases at memory registration time. You can see a previous
>> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
>> flush" which should be in your inbox.
>
> The point is that your problem has absolutely nothing to do with mmap,
> and all with get_user_pages.
>
> get_user_pages on DAX doesn't give the same guarantees as on pagecache
> or anonymous memory, and that is the prbolem we need to fix.  In fact
> I'm pretty sure if we try hard enough (and we might have to try
> very hard) we can see the same problem with plain direct I/O and without
> any RDMA involved, e.g. do a larger direct I/O write to memory that is
> mmap()ed from a DAX file, then truncate the DAX file and reallocate
> the blocks, and we might corrupt that new file.  We'll probably need
> a special setup where there is little other chance but to reallocate
> those used blocks.

I'll take a harder look at this...

> So what we need to do first is to fix get_user_pages vs unmapping
> DAX mmap()ed blocks, be that from a hole punch, truncate, COW
> operation, etc.
>
> Then we need to look into the special case of a long-living non-transient
> get_user_pages that RDMA does - we can't just reject any truncate or
> other operation for that, so that's where something like me layout
> lease suggestion comes into play - but the call that should get the
> least is not the mmap - it's the memory registration call that does
> the get_user_pages.

Yes, mmap is not the place to get the lease for a later
get_user_pages, and my patches do take an additional lease at
get_user_pages / MR init time. However, the mmap call has the
file-descriptor for SIGIO the MR-init call does not. If we delay all
of the setup it to MR time then we need to invent a notification
scheme specific to RDMA which seems like a waste to me when we can
generically signal an event on the fd for any event that effects any
of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
so as far as I can see delaying the notification until MR-init is too
late, too granular, and too RDMA specific.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 15:14         ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-xfs, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton,
	Jason Gunthorpe

On Thu, Oct 12, 2017 at 11:57 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
>> So, you're jumping into this review at v9 where I've split the patches
>> that take an initial MAP_DIRECT lease out from the patches that take
>> FL_LAYOUT leases at memory registration time. You can see a previous
>> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
>> flush" which should be in your inbox.
>
> The point is that your problem has absolutely nothing to do with mmap,
> and all with get_user_pages.
>
> get_user_pages on DAX doesn't give the same guarantees as on pagecache
> or anonymous memory, and that is the prbolem we need to fix.  In fact
> I'm pretty sure if we try hard enough (and we might have to try
> very hard) we can see the same problem with plain direct I/O and without
> any RDMA involved, e.g. do a larger direct I/O write to memory that is
> mmap()ed from a DAX file, then truncate the DAX file and reallocate
> the blocks, and we might corrupt that new file.  We'll probably need
> a special setup where there is little other chance but to reallocate
> those used blocks.

I'll take a harder look at this...

> So what we need to do first is to fix get_user_pages vs unmapping
> DAX mmap()ed blocks, be that from a hole punch, truncate, COW
> operation, etc.
>
> Then we need to look into the special case of a long-living non-transient
> get_user_pages that RDMA does - we can't just reject any truncate or
> other operation for that, so that's where something like me layout
> lease suggestion comes into play - but the call that should get the
> least is not the mmap - it's the memory registration call that does
> the get_user_pages.

Yes, mmap is not the place to get the lease for a later
get_user_pages, and my patches do take an additional lease at
get_user_pages / MR init time. However, the mmap call has the
file-descriptor for SIGIO the MR-init call does not. If we delay all
of the setup it to MR time then we need to invent a notification
scheme specific to RDMA which seems like a waste to me when we can
generically signal an event on the fd for any event that effects any
of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
so as far as I can see delaying the notification until MR-init is too
late, too granular, and too RDMA specific.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 15:14         ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 15:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton,
	Jason Gunthorpe

On Thu, Oct 12, 2017 at 11:57 PM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
>> So, you're jumping into this review at v9 where I've split the patches
>> that take an initial MAP_DIRECT lease out from the patches that take
>> FL_LAYOUT leases at memory registration time. You can see a previous
>> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
>> flush" which should be in your inbox.
>
> The point is that your problem has absolutely nothing to do with mmap,
> and all with get_user_pages.
>
> get_user_pages on DAX doesn't give the same guarantees as on pagecache
> or anonymous memory, and that is the prbolem we need to fix.  In fact
> I'm pretty sure if we try hard enough (and we might have to try
> very hard) we can see the same problem with plain direct I/O and without
> any RDMA involved, e.g. do a larger direct I/O write to memory that is
> mmap()ed from a DAX file, then truncate the DAX file and reallocate
> the blocks, and we might corrupt that new file.  We'll probably need
> a special setup where there is little other chance but to reallocate
> those used blocks.

I'll take a harder look at this...

> So what we need to do first is to fix get_user_pages vs unmapping
> DAX mmap()ed blocks, be that from a hole punch, truncate, COW
> operation, etc.
>
> Then we need to look into the special case of a long-living non-transient
> get_user_pages that RDMA does - we can't just reject any truncate or
> other operation for that, so that's where something like me layout
> lease suggestion comes into play - but the call that should get the
> least is not the mmap - it's the memory registration call that does
> the get_user_pages.

Yes, mmap is not the place to get the lease for a later
get_user_pages, and my patches do take an additional lease at
get_user_pages / MR init time. However, the mmap call has the
file-descriptor for SIGIO the MR-init call does not. If we delay all
of the setup it to MR time then we need to invent a notification
scheme specific to RDMA which seems like a waste to me when we can
generically signal an event on the fd for any event that effects any
of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
so as far as I can see delaying the notification until MR-init is too
late, too granular, and too RDMA specific.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 15:14         ` Dan Williams
  (?)
  (?)
@ 2017-10-13 16:38           ` Jason Gunthorpe
  -1 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 16:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:

> scheme specific to RDMA which seems like a waste to me when we can
> generically signal an event on the fd for any event that effects any
> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> so as far as I can see delaying the notification until MR-init is too
> late, too granular, and too RDMA specific.

But for RDMA a FD is not what we care about - we want the MR handle so
the app knows which MR needs fixing.

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 16:38           ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 16:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:

> scheme specific to RDMA which seems like a waste to me when we can
> generically signal an event on the fd for any event that effects any
> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> so as far as I can see delaying the notification until MR-init is too
> late, too granular, and too RDMA specific.

But for RDMA a FD is not what we care about - we want the MR handle so
the app knows which MR needs fixing.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 16:38           ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 16:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:

> scheme specific to RDMA which seems like a waste to me when we can
> generically signal an event on the fd for any event that effects any
> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> so as far as I can see delaying the notification until MR-init is too
> late, too granular, and too RDMA specific.

But for RDMA a FD is not what we care about - we want the MR handle so
the app knows which MR needs fixing.

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 16:38           ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 16:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Al Viro, Andy Lutomirski, Jeff Layton, linux-fsdevel,
	Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:

> scheme specific to RDMA which seems like a waste to me when we can
> generically signal an event on the fd for any event that effects any
> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> so as far as I can see delaying the notification until MR-init is too
> late, too granular, and too RDMA specific.

But for RDMA a FD is not what we care about - we want the MR handle so
the app knows which MR needs fixing.

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 16:38           ` Jason Gunthorpe
  (?)
  (?)
@ 2017-10-13 17:01             ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>
>> scheme specific to RDMA which seems like a waste to me when we can
>> generically signal an event on the fd for any event that effects any
>> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> so as far as I can see delaying the notification until MR-init is too
>> late, too granular, and too RDMA specific.
>
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

I'd rather put the onus on userspace to remember where it used a
MAP_DIRECT mapping and be aware that all the mappings of that file are
subject to a lease break. Sure, we could build up a pile of kernel
infrastructure to notify on a per-MR basis, but I think that would
only be worth it if leases were range based. As it is, the entire file
is covered by a lease instance and all MRs that might reference that
file get one notification. That said, we can always arrange for a
per-driver callback at lease-break time so that it can do something
above and beyond the default notification.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 17:01             ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>
>> scheme specific to RDMA which seems like a waste to me when we can
>> generically signal an event on the fd for any event that effects any
>> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> so as far as I can see delaying the notification until MR-init is too
>> late, too granular, and too RDMA specific.
>
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

I'd rather put the onus on userspace to remember where it used a
MAP_DIRECT mapping and be aware that all the mappings of that file are
subject to a lease break. Sure, we could build up a pile of kernel
infrastructure to notify on a per-MR basis, but I think that would
only be worth it if leases were range based. As it is, the entire file
is covered by a lease instance and all MRs that might reference that
file get one notification. That said, we can always arrange for a
per-driver callback at lease-break time so that it can do something
above and beyond the default notification.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 17:01             ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>
>> scheme specific to RDMA which seems like a waste to me when we can
>> generically signal an event on the fd for any event that effects any
>> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> so as far as I can see delaying the notification until MR-init is too
>> late, too granular, and too RDMA specific.
>
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

I'd rather put the onus on userspace to remember where it used a
MAP_DIRECT mapping and be aware that all the mappings of that file are
subject to a lease break. Sure, we could build up a pile of kernel
infrastructure to notify on a per-MR basis, but I think that would
only be worth it if leases were range based. As it is, the entire file
is covered by a lease instance and all MRs that might reference that
file get one notification. That said, we can always arrange for a
per-driver callback at lease-break time so that it can do something
above and beyond the default notification.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 17:01             ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Al Viro, Andy Lutomirski, Jeff Layton, linux-fsdevel,
	Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>
>> scheme specific to RDMA which seems like a waste to me when we can
>> generically signal an event on the fd for any event that effects any
>> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> so as far as I can see delaying the notification until MR-init is too
>> late, too granular, and too RDMA specific.
>
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

I'd rather put the onus on userspace to remember where it used a
MAP_DIRECT mapping and be aware that all the mappings of that file are
subject to a lease break. Sure, we could build up a pile of kernel
infrastructure to notify on a per-MR basis, but I think that would
only be worth it if leases were range based. As it is, the entire file
is covered by a lease instance and all MRs that might reference that
file get one notification. That said, we can always arrange for a
per-driver callback at lease-break time so that it can do something
above and beyond the default notification.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 17:01             ` Dan Williams
  (?)
  (?)
@ 2017-10-13 17:31               ` Jason Gunthorpe
  -1 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
> 
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.

I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.

So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.

I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
 - How it can be hidden in a RDMA specific library
 - How lease break can be done hitlessly, so the library user never
   needs to know it is happening or see failed/missed transfers
 - Whatever fast path checking is needed does not kill performance

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 17:31               ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
> 
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.

I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.

So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.

I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
 - How it can be hidden in a RDMA specific library
 - How lease break can be done hitlessly, so the library user never
   needs to know it is happening or see failed/missed transfers
 - Whatever fast path checking is needed does not kill performance

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 17:31               ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
> 
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.

I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.

So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.

I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
 - How it can be hidden in a RDMA specific library
 - How lease break can be done hitlessly, so the library user never
   needs to know it is happening or see failed/missed transfers
 - Whatever fast path checking is needed does not kill performance

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 17:31               ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-13 17:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Al Viro, Andy Lutomirski, Jeff Layton, linux-fsdevel,
	Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
> 
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.

I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.

So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.

I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
 - How it can be hidden in a RDMA specific library
 - How lease break can be done hitlessly, so the library user never
   needs to know it is happening or see failed/missed transfers
 - Whatever fast path checking is needed does not kill performance

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 17:31               ` Jason Gunthorpe
  (?)
  (?)
@ 2017-10-13 18:22                 ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
>> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>> >
>> >> scheme specific to RDMA which seems like a waste to me when we can
>> >> generically signal an event on the fd for any event that effects any
>> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> >> so as far as I can see delaying the notification until MR-init is too
>> >> late, too granular, and too RDMA specific.
>> >
>> > But for RDMA a FD is not what we care about - we want the MR handle so
>> > the app knows which MR needs fixing.
>>
>> I'd rather put the onus on userspace to remember where it used a
>> MAP_DIRECT mapping and be aware that all the mappings of that file are
>> subject to a lease break. Sure, we could build up a pile of kernel
>> infrastructure to notify on a per-MR basis, but I think that would
>> only be worth it if leases were range based. As it is, the entire file
>> is covered by a lease instance and all MRs that might reference that
>> file get one notification. That said, we can always arrange for a
>> per-driver callback at lease-break time so that it can do something
>> above and beyond the default notification.
>
> I don't think that really represents how lots of apps actually use
> RDMA.
>
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
>
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
>
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
>
> So, who should be responsible for MR coherency? Today we say the MPI
> is responsible. But we can't really expect the MPI
> to hook SIGIO and somehow try to reverse engineer what MRs are
> impacted from a FD that may not even still be open.

Ok, that's good insight that I didn't have. Userspace needs more help
than just an fd notification.

> I think, if you want to build a uAPI for notification of MR lease
> break, then you need show how it fits into the above software model:
>  - How it can be hidden in a RDMA specific library

So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?

>  - How lease break can be done hitlessly, so the library user never
>    needs to know it is happening or see failed/missed transfers

iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.

>  - Whatever fast path checking is needed does not kill performance

What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 18:22                 ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
>> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>> >
>> >> scheme specific to RDMA which seems like a waste to me when we can
>> >> generically signal an event on the fd for any event that effects any
>> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> >> so as far as I can see delaying the notification until MR-init is too
>> >> late, too granular, and too RDMA specific.
>> >
>> > But for RDMA a FD is not what we care about - we want the MR handle so
>> > the app knows which MR needs fixing.
>>
>> I'd rather put the onus on userspace to remember where it used a
>> MAP_DIRECT mapping and be aware that all the mappings of that file are
>> subject to a lease break. Sure, we could build up a pile of kernel
>> infrastructure to notify on a per-MR basis, but I think that would
>> only be worth it if leases were range based. As it is, the entire file
>> is covered by a lease instance and all MRs that might reference that
>> file get one notification. That said, we can always arrange for a
>> per-driver callback at lease-break time so that it can do something
>> above and beyond the default notification.
>
> I don't think that really represents how lots of apps actually use
> RDMA.
>
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
>
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
>
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
>
> So, who should be responsible for MR coherency? Today we say the MPI
> is responsible. But we can't really expect the MPI
> to hook SIGIO and somehow try to reverse engineer what MRs are
> impacted from a FD that may not even still be open.

Ok, that's good insight that I didn't have. Userspace needs more help
than just an fd notification.

> I think, if you want to build a uAPI for notification of MR lease
> break, then you need show how it fits into the above software model:
>  - How it can be hidden in a RDMA specific library

So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?

>  - How lease break can be done hitlessly, so the library user never
>    needs to know it is happening or see failed/missed transfers

iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.

>  - Whatever fast path checking is needed does not kill performance

What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 18:22                 ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
>> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>> >
>> >> scheme specific to RDMA which seems like a waste to me when we can
>> >> generically signal an event on the fd for any event that effects any
>> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> >> so as far as I can see delaying the notification until MR-init is too
>> >> late, too granular, and too RDMA specific.
>> >
>> > But for RDMA a FD is not what we care about - we want the MR handle so
>> > the app knows which MR needs fixing.
>>
>> I'd rather put the onus on userspace to remember where it used a
>> MAP_DIRECT mapping and be aware that all the mappings of that file are
>> subject to a lease break. Sure, we could build up a pile of kernel
>> infrastructure to notify on a per-MR basis, but I think that would
>> only be worth it if leases were range based. As it is, the entire file
>> is covered by a lease instance and all MRs that might reference that
>> file get one notification. That said, we can always arrange for a
>> per-driver callback at lease-break time so that it can do something
>> above and beyond the default notification.
>
> I don't think that really represents how lots of apps actually use
> RDMA.
>
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
>
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
>
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
>
> So, who should be responsible for MR coherency? Today we say the MPI
> is responsible. But we can't really expect the MPI
> to hook SIGIO and somehow try to reverse engineer what MRs are
> impacted from a FD that may not even still be open.

Ok, that's good insight that I didn't have. Userspace needs more help
than just an fd notification.

> I think, if you want to build a uAPI for notification of MR lease
> break, then you need show how it fits into the above software model:
>  - How it can be hidden in a RDMA specific library

So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?

>  - How lease break can be done hitlessly, so the library user never
>    needs to know it is happening or see failed/missed transfers

iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.

>  - Whatever fast path checking is needed does not kill performance

What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-13 18:22                 ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-13 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
>> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>> >
>> >> scheme specific to RDMA which seems like a waste to me when we can
>> >> generically signal an event on the fd for any event that effects any
>> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> >> so as far as I can see delaying the notification until MR-init is too
>> >> late, too granular, and too RDMA specific.
>> >
>> > But for RDMA a FD is not what we care about - we want the MR handle so
>> > the app knows which MR needs fixing.
>>
>> I'd rather put the onus on userspace to remember where it used a
>> MAP_DIRECT mapping and be aware that all the mappings of that file are
>> subject to a lease break. Sure, we could build up a pile of kernel
>> infrastructure to notify on a per-MR basis, but I think that would
>> only be worth it if leases were range based. As it is, the entire file
>> is covered by a lease instance and all MRs that might reference that
>> file get one notification. That said, we can always arrange for a
>> per-driver callback at lease-break time so that it can do something
>> above and beyond the default notification.
>
> I don't think that really represents how lots of apps actually use
> RDMA.
>
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
>
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
>
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
>
> So, who should be responsible for MR coherency? Today we say the MPI
> is responsible. But we can't really expect the MPI
> to hook SIGIO and somehow try to reverse engineer what MRs are
> impacted from a FD that may not even still be open.

Ok, that's good insight that I didn't have. Userspace needs more help
than just an fd notification.

> I think, if you want to build a uAPI for notification of MR lease
> break, then you need show how it fits into the above software model:
>  - How it can be hidden in a RDMA specific library

So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?

>  - How lease break can be done hitlessly, so the library user never
>    needs to know it is happening or see failed/missed transfers

iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.

>  - Whatever fast path checking is needed does not kill performance

What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 18:22                 ` Dan Williams
  (?)
  (?)
@ 2017-10-14  1:57                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-14  1:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> > So, who should be responsible for MR coherency? Today we say the MPI
> > is responsible. But we can't really expect the MPI
> > to hook SIGIO and somehow try to reverse engineer what MRs are
> > impacted from a FD that may not even still be open.
> 
> Ok, that's good insight that I didn't have. Userspace needs more help
> than just an fd notification.

Glad to help!

> > I think, if you want to build a uAPI for notification of MR lease
> > break, then you need show how it fits into the above software model:
> >  - How it can be hidden in a RDMA specific library
> 
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.

So, you need a side channel of some kind, either in certain drivers or
generically..

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

Yes, if the iommu can be fenced properly it sounds doable.

> >  - Whatever fast path checking is needed does not kill performance
> 
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.

ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...

Now that I see this whole thing in this light it seem so very similar
to the MPI driven user space mmu notifications ideas and has similar
challenges. FWIW, RDMA banged its head on this issue for 10 years and
it was ODP that emerged as the solution.

One option might be to use an async event notification 'MR
de-coherence' and rely on a main polling loop to catch it.

This is good enough for dax becaue the lease-requestor would wait
until the async event was processed. It would also be acceptable for
the general MPI case too, but only if this lease concept was wider
than just DAX, eg a MR leases a peice of VMA, and if anything anyhow
changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait
from the MR to release the lease. ie munmap would block until the
async event is processed. ODP-light in userspace, essentially.

IIRC this sort of suggestion was never explored, something like:

poll(fd)
ibv_read_async_event(fd)
if (event == MR_DECOHERENCE) {
    queice_network();
    ibv_restore_mr(mr);
    restore_network();
}

The implemention of ibv_restore_mr would have to make a new MR that
pointed to the same virtual memory addresses, but was backed by the
*new* physical pages. This means it has to unblock the lease, and wait
for the lease requestor to complete executing.

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-14  1:57                   ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-14  1:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> > So, who should be responsible for MR coherency? Today we say the MPI
> > is responsible. But we can't really expect the MPI
> > to hook SIGIO and somehow try to reverse engineer what MRs are
> > impacted from a FD that may not even still be open.
> 
> Ok, that's good insight that I didn't have. Userspace needs more help
> than just an fd notification.

Glad to help!

> > I think, if you want to build a uAPI for notification of MR lease
> > break, then you need show how it fits into the above software model:
> >  - How it can be hidden in a RDMA specific library
> 
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.

So, you need a side channel of some kind, either in certain drivers or
generically..

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

Yes, if the iommu can be fenced properly it sounds doable.

> >  - Whatever fast path checking is needed does not kill performance
> 
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.

ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...

Now that I see this whole thing in this light it seem so very similar
to the MPI driven user space mmu notifications ideas and has similar
challenges. FWIW, RDMA banged its head on this issue for 10 years and
it was ODP that emerged as the solution.

One option might be to use an async event notification 'MR
de-coherence' and rely on a main polling loop to catch it.

This is good enough for dax becaue the lease-requestor would wait
until the async event was processed. It would also be acceptable for
the general MPI case too, but only if this lease concept was wider
than just DAX, eg a MR leases a peice of VMA, and if anything anyhow
changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait
from the MR to release the lease. ie munmap would block until the
async event is processed. ODP-light in userspace, essentially.

IIRC this sort of suggestion was never explored, something like:

poll(fd)
ibv_read_async_event(fd)
if (event == MR_DECOHERENCE) {
    queice_network();
    ibv_restore_mr(mr);
    restore_network();
}

The implemention of ibv_restore_mr would have to make a new MR that
pointed to the same virtual memory addresses, but was backed by the
*new* physical pages. This means it has to unblock the lease, and wait
for the lease requestor to complete executing.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-14  1:57                   ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-14  1:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> > So, who should be responsible for MR coherency? Today we say the MPI
> > is responsible. But we can't really expect the MPI
> > to hook SIGIO and somehow try to reverse engineer what MRs are
> > impacted from a FD that may not even still be open.
> 
> Ok, that's good insight that I didn't have. Userspace needs more help
> than just an fd notification.

Glad to help!

> > I think, if you want to build a uAPI for notification of MR lease
> > break, then you need show how it fits into the above software model:
> >  - How it can be hidden in a RDMA specific library
> 
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.

So, you need a side channel of some kind, either in certain drivers or
generically..

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

Yes, if the iommu can be fenced properly it sounds doable.

> >  - Whatever fast path checking is needed does not kill performance
> 
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.

ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...

Now that I see this whole thing in this light it seem so very similar
to the MPI driven user space mmu notifications ideas and has similar
challenges. FWIW, RDMA banged its head on this issue for 10 years and
it was ODP that emerged as the solution.

One option might be to use an async event notification 'MR
de-coherence' and rely on a main polling loop to catch it.

This is good enough for dax becaue the lease-requestor would wait
until the async event was processed. It would also be acceptable for
the general MPI case too, but only if this lease concept was wider
than just DAX, eg a MR leases a peice of VMA, and if anything anyhow
changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait
from the MR to release the lease. ie munmap would block until the
async event is processed. ODP-light in userspace, essentially.

IIRC this sort of suggestion was never explored, something like:

poll(fd)
ibv_read_async_event(fd)
if (event == MR_DECOHERENCE) {
    queice_network();
    ibv_restore_mr(mr);
    restore_network();
}

The implemention of ibv_restore_mr would have to make a new MR that
pointed to the same virtual memory addresses, but was backed by the
*new* physical pages. This means it has to unblock the lease, and wait
for the lease requestor to complete executing.

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-14  1:57                   ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-14  1:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Al Viro, Andy Lutomirski, Jeff Layton, linux-fsdevel,
	Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> > So, who should be responsible for MR coherency? Today we say the MPI
> > is responsible. But we can't really expect the MPI
> > to hook SIGIO and somehow try to reverse engineer what MRs are
> > impacted from a FD that may not even still be open.
> 
> Ok, that's good insight that I didn't have. Userspace needs more help
> than just an fd notification.

Glad to help!

> > I think, if you want to build a uAPI for notification of MR lease
> > break, then you need show how it fits into the above software model:
> >  - How it can be hidden in a RDMA specific library
> 
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.

So, you need a side channel of some kind, either in certain drivers or
generically..

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

Yes, if the iommu can be fenced properly it sounds doable.

> >  - Whatever fast path checking is needed does not kill performance
> 
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.

ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...

Now that I see this whole thing in this light it seem so very similar
to the MPI driven user space mmu notifications ideas and has similar
challenges. FWIW, RDMA banged its head on this issue for 10 years and
it was ODP that emerged as the solution.

One option might be to use an async event notification 'MR
de-coherence' and rely on a main polling loop to catch it.

This is good enough for dax becaue the lease-requestor would wait
until the async event was processed. It would also be acceptable for
the general MPI case too, but only if this lease concept was wider
than just DAX, eg a MR leases a peice of VMA, and if anything anyhow
changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait
from the MR to release the lease. ie munmap would block until the
async event is processed. ODP-light in userspace, essentially.

IIRC this sort of suggestion was never explored, something like:

poll(fd)
ibv_read_async_event(fd)
if (event == MR_DECOHERENCE) {
    queice_network();
    ibv_restore_mr(mr);
    restore_network();
}

The implemention of ibv_restore_mr would have to make a new MR that
pointed to the same virtual memory addresses, but was backed by the
*new* physical pages. This means it has to unblock the lease, and wait
for the lease requestor to complete executing.

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 16:38           ` Jason Gunthorpe
  (?)
  (?)
@ 2017-10-16  7:22             ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-xfs, Linux MM, Jeff Layton, Al Viro, Andy Lutomirski,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 10:38:22AM -0600, Jason Gunthorpe wrote:
> > scheme specific to RDMA which seems like a waste to me when we can
> > generically signal an event on the fd for any event that effects any
> > of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> > so as far as I can see delaying the notification until MR-init is too
> > late, too granular, and too RDMA specific.
> 
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

Yes.  Although the fd for the ibX device might be a good handle to
transport that information, unlike the fd for the mapped file.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:22             ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Christoph Hellwig, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 10:38:22AM -0600, Jason Gunthorpe wrote:
> > scheme specific to RDMA which seems like a waste to me when we can
> > generically signal an event on the fd for any event that effects any
> > of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> > so as far as I can see delaying the notification until MR-init is too
> > late, too granular, and too RDMA specific.
> 
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

Yes.  Although the fd for the ibX device might be a good handle to
transport that information, unlike the fd for the mapped file.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:22             ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Christoph Hellwig, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 10:38:22AM -0600, Jason Gunthorpe wrote:
> > scheme specific to RDMA which seems like a waste to me when we can
> > generically signal an event on the fd for any event that effects any
> > of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> > so as far as I can see delaying the notification until MR-init is too
> > late, too granular, and too RDMA specific.
> 
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

Yes.  Although the fd for the ibX device might be a good handle to
transport that information, unlike the fd for the mapped file.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:22             ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Christoph Hellwig,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 10:38:22AM -0600, Jason Gunthorpe wrote:
> > scheme specific to RDMA which seems like a waste to me when we can
> > generically signal an event on the fd for any event that effects any
> > of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> > so as far as I can see delaying the notification until MR-init is too
> > late, too granular, and too RDMA specific.
> 
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

Yes.  Although the fd for the ibX device might be a good handle to
transport that information, unlike the fd for the mapped file.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 17:31               ` Jason Gunthorpe
  (?)
@ 2017-10-16  7:26                 ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Christoph Hellwig, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
> I don't think that really represents how lots of apps actually use
> RDMA.
> 
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
> 
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
> 
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
> 

I suspect one of the more interesting use cases might be a file server,
for which that's not the case.  But otherwise I agree with the above,
and also thing that notifying the MR handle is the only way to go for
another very important reason:  fencing.  What if the application/library
does not react on the notification?  With a per-MR notification we
can unregister the MR in kernel space and have a rock solid fencing
mechanism.  And that is the most important bit here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:26                 ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Christoph Hellwig, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
> I don't think that really represents how lots of apps actually use
> RDMA.
> 
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
> 
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
> 
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
> 

I suspect one of the more interesting use cases might be a file server,
for which that's not the case.  But otherwise I agree with the above,
and also thing that notifying the MR handle is the only way to go for
another very important reason:  fencing.  What if the application/library
does not react on the notification?  With a per-MR notification we
can unregister the MR in kernel space and have a rock solid fencing
mechanism.  And that is the most important bit here.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:26                 ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Christoph Hellwig,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
> I don't think that really represents how lots of apps actually use
> RDMA.
> 
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
> 
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
> 
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
> 

I suspect one of the more interesting use cases might be a file server,
for which that's not the case.  But otherwise I agree with the above,
and also thing that notifying the MR handle is the only way to go for
another very important reason:  fencing.  What if the application/library
does not react on the notification?  With a per-MR notification we
can unregister the MR in kernel space and have a rock solid fencing
mechanism.  And that is the most important bit here.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-13 18:22                 ` Dan Williams
  (?)
@ 2017-10-16  7:30                   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

The problem aren't local protection errors, but remote protection errors
when we modify a MR with an rkey that the remote side accesses.

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

But systems that care about performance (e.g. the usual RDMA users) usually
don't use an IOMMU due to the performance impact.  Especially as HCAs
already have their own built-in iommus (aka the MR mechanism).

Note that file systems already have a mechanism like you mention above
to keep extents that are busy from being reallocated.  E.g. take a look at
fs/xfs/xfs_extent_busy.c.  The downside is that this could lock down
a massive amount of space in the busy list if we for example have a MR
covering a huge file that is truncated down.  So even if we'd want that
scheme we'd need some sort of ulmit for the amount of DAX pages locked
down in get_user_pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:30                   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

The problem aren't local protection errors, but remote protection errors
when we modify a MR with an rkey that the remote side accesses.

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

But systems that care about performance (e.g. the usual RDMA users) usually
don't use an IOMMU due to the performance impact.  Especially as HCAs
already have their own built-in iommus (aka the MR mechanism).

Note that file systems already have a mechanism like you mention above
to keep extents that are busy from being reallocated.  E.g. take a look at
fs/xfs/xfs_extent_busy.c.  The downside is that this could lock down
a massive amount of space in the busy list if we for example have a MR
covering a huge file that is truncated down.  So even if we'd want that
scheme we'd need some sort of ulmit for the amount of DAX pages locked
down in get_user_pages.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16  7:30                   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Andy Lutomirski,
	Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner, Andrew Morton, Jason Gunthorpe,
	Linux MM, Al Viro, J. Bruce Fields, Jeff Layton, linux-fsdevel,
	Linus Torvalds, Christoph Hellwig

On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

The problem aren't local protection errors, but remote protection errors
when we modify a MR with an rkey that the remote side accesses.

> >  - How lease break can be done hitlessly, so the library user never
> >    needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

But systems that care about performance (e.g. the usual RDMA users) usually
don't use an IOMMU due to the performance impact.  Especially as HCAs
already have their own built-in iommus (aka the MR mechanism).

Note that file systems already have a mechanism like you mention above
to keep extents that are busy from being reallocated.  E.g. take a look at
fs/xfs/xfs_extent_busy.c.  The downside is that this could lock down
a massive amount of space in the busy list if we for example have a MR
covering a huge file that is truncated down.  So even if we'd want that
scheme we'd need some sort of ulmit for the amount of DAX pages locked
down in get_user_pages.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
  2017-10-12 16:32       ` Linus Torvalds
  (?)
@ 2017-10-16  7:38         ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Arnd Bergmann, linux-nvdimm, Linux API,
	Christoph Hellwig, linux-xfs, linux-mm, Andy Lutomirski,
	linux-fsdevel, Andrew Morton

On Thu, Oct 12, 2017 at 09:32:17AM -0700, Linus Torvalds wrote:
> On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > When thinking a bit more about this I've realized one problem: Currently
> > user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> > and he will get the new semantics (if the kernel happens to support it).  I
> > think that is undesirable [..]
> 
> Why?
> 
> If you have a performance preference for MAP_DIRECT or something like
> that, but you don't want to *enforce* it, you'd use just plain
> MAP_SHARED with it.
> 
> Ie there may well be "I want this to work, possibly with downsides" issues.
> 
> So it seems to be a reasonable model, and disallowing it seems to
> limit people and not really help anything.

I don't think for MAP_DIRECT it matters (and I think we shouldn't have
MAP_DIRECT to start with, see the discussions later in the thread).

But for the main use case, MAP_SYNC you really want a hard error when you
don't get it.  And while we could tell people that they should only use
MAP_SYNC with MAP_SHARED_VALIDATE instead of MAP_SHARED chances that they
get it wrong are extremely high.  On the other hand if you really only
want a flag to optimize calling mmap twice is very little overhead, and
a very good documentation of you intent:

	addr = mmap(...., MAP_SHARED_VALIDATE | MAP_DIRECT, ...);
	if (!addr && errno = EOPNOTSUPP) {
		/* MAP_DIRECT didn't work, we'll just cope using blah, blah */
		addr = mmap(...., MAP_SHARED, ...);
	}
	if (!addr)
		goto handle_error;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-16  7:38         ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Dan Williams, linux-nvdimm, Arnd Bergmann, Linux API,
	linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Thu, Oct 12, 2017 at 09:32:17AM -0700, Linus Torvalds wrote:
> On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > When thinking a bit more about this I've realized one problem: Currently
> > user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> > and he will get the new semantics (if the kernel happens to support it).  I
> > think that is undesirable [..]
> 
> Why?
> 
> If you have a performance preference for MAP_DIRECT or something like
> that, but you don't want to *enforce* it, you'd use just plain
> MAP_SHARED with it.
> 
> Ie there may well be "I want this to work, possibly with downsides" issues.
> 
> So it seems to be a reasonable model, and disallowing it seems to
> limit people and not really help anything.

I don't think for MAP_DIRECT it matters (and I think we shouldn't have
MAP_DIRECT to start with, see the discussions later in the thread).

But for the main use case, MAP_SYNC you really want a hard error when you
don't get it.  And while we could tell people that they should only use
MAP_SYNC with MAP_SHARED_VALIDATE instead of MAP_SHARED chances that they
get it wrong are extremely high.  On the other hand if you really only
want a flag to optimize calling mmap twice is very little overhead, and
a very good documentation of you intent:

	addr = mmap(...., MAP_SHARED_VALIDATE | MAP_DIRECT, ...);
	if (!addr && errno = EOPNOTSUPP) {
		/* MAP_DIRECT didn't work, we'll just cope using blah, blah */
		addr = mmap(...., MAP_SHARED, ...);
	}
	if (!addr)
		goto handle_error;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-16  7:38         ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-16  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Dan Williams, linux-nvdimm, Arnd Bergmann, Linux API,
	linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Thu, Oct 12, 2017 at 09:32:17AM -0700, Linus Torvalds wrote:
> On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > When thinking a bit more about this I've realized one problem: Currently
> > user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> > and he will get the new semantics (if the kernel happens to support it).  I
> > think that is undesirable [..]
> 
> Why?
> 
> If you have a performance preference for MAP_DIRECT or something like
> that, but you don't want to *enforce* it, you'd use just plain
> MAP_SHARED with it.
> 
> Ie there may well be "I want this to work, possibly with downsides" issues.
> 
> So it seems to be a reasonable model, and disallowing it seems to
> limit people and not really help anything.

I don't think for MAP_DIRECT it matters (and I think we shouldn't have
MAP_DIRECT to start with, see the discussions later in the thread).

But for the main use case, MAP_SYNC you really want a hard error when you
don't get it.  And while we could tell people that they should only use
MAP_SYNC with MAP_SHARED_VALIDATE instead of MAP_SHARED chances that they
get it wrong are extremely high.  On the other hand if you really only
want a flag to optimize calling mmap twice is very little overhead, and
a very good documentation of you intent:

	addr = mmap(...., MAP_SHARED_VALIDATE | MAP_DIRECT, ...);
	if (!addr && errno = EOPNOTSUPP) {
		/* MAP_DIRECT didn't work, we'll just cope using blah, blah */
		addr = mmap(...., MAP_SHARED, ...);
	}
	if (!addr)
		goto handle_error;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
  2017-10-12 16:32       ` Linus Torvalds
  (?)
@ 2017-10-16  7:56         ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-16  7:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Arnd Bergmann, linux-nvdimm, Linux API,
	Christoph Hellwig, linux-xfs, linux-mm, Andy Lutomirski,
	linux-fsdevel, Andrew Morton

On Thu 12-10-17 09:32:17, Linus Torvalds wrote:
> On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > When thinking a bit more about this I've realized one problem: Currently
> > user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> > and he will get the new semantics (if the kernel happens to support it).  I
> > think that is undesirable [..]
> 
> Why?
> 
> If you have a performance preference for MAP_DIRECT or something like
> that, but you don't want to *enforce* it, you'd use just plain
> MAP_SHARED with it.
> 
> Ie there may well be "I want this to work, possibly with downsides" issues.
> 
> So it seems to be a reasonable model, and disallowing it seems to
> limit people and not really help anything.

I have two concerns:

1) IMHO it supports sloppy programming from userspace - if application asks
e.g. for MAP_DIRECT and doesn't know whether it gets it or not, it would
have to be very careful not to assume anything about that in its code. And
frankly I think the most likely scenario is that a programmer will just use
MAP_SHARED | MAP_DIRECT, *assume* he will get the MAP_DIRECT semantics if
the call does not fail and then complain when his application breaks.

2) In theory there could be an application that inadvertedly sets some high
flag bits and now it would get confused by getting different mmap(2)
semantics. But I agree this is mostly theoretical.

Overall I think the benefit of being able to say "do MAP_DIRECT if you can"
does not outweight the risk of bugs in userspace applications. Especially
since userspace can easily implement the same semantics by retrying the
mmap(2) call without MAP_SHARED_VALIDATE | MAP_DIRECT.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-16  7:56         ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-16  7:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Dan Williams, linux-nvdimm, Arnd Bergmann, Linux API,
	linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Thu 12-10-17 09:32:17, Linus Torvalds wrote:
> On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > When thinking a bit more about this I've realized one problem: Currently
> > user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> > and he will get the new semantics (if the kernel happens to support it).  I
> > think that is undesirable [..]
> 
> Why?
> 
> If you have a performance preference for MAP_DIRECT or something like
> that, but you don't want to *enforce* it, you'd use just plain
> MAP_SHARED with it.
> 
> Ie there may well be "I want this to work, possibly with downsides" issues.
> 
> So it seems to be a reasonable model, and disallowing it seems to
> limit people and not really help anything.

I have two concerns:

1) IMHO it supports sloppy programming from userspace - if application asks
e.g. for MAP_DIRECT and doesn't know whether it gets it or not, it would
have to be very careful not to assume anything about that in its code. And
frankly I think the most likely scenario is that a programmer will just use
MAP_SHARED | MAP_DIRECT, *assume* he will get the MAP_DIRECT semantics if
the call does not fail and then complain when his application breaks.

2) In theory there could be an application that inadvertedly sets some high
flag bits and now it would get confused by getting different mmap(2)
semantics. But I agree this is mostly theoretical.

Overall I think the benefit of being able to say "do MAP_DIRECT if you can"
does not outweight the risk of bugs in userspace applications. Especially
since userspace can easily implement the same semantics by retrying the
mmap(2) call without MAP_SHARED_VALIDATE | MAP_DIRECT.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
@ 2017-10-16  7:56         ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2017-10-16  7:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, Dan Williams, linux-nvdimm, Arnd Bergmann, Linux API,
	linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel,
	Andrew Morton, Christoph Hellwig

On Thu 12-10-17 09:32:17, Linus Torvalds wrote:
> On Thu, Oct 12, 2017 at 6:51 AM, Jan Kara <jack@suse.cz> wrote:
> >
> > When thinking a bit more about this I've realized one problem: Currently
> > user can call mmap() with MAP_SHARED type and MAP_SYNC or MAP_DIRECT flags
> > and he will get the new semantics (if the kernel happens to support it).  I
> > think that is undesirable [..]
> 
> Why?
> 
> If you have a performance preference for MAP_DIRECT or something like
> that, but you don't want to *enforce* it, you'd use just plain
> MAP_SHARED with it.
> 
> Ie there may well be "I want this to work, possibly with downsides" issues.
> 
> So it seems to be a reasonable model, and disallowing it seems to
> limit people and not really help anything.

I have two concerns:

1) IMHO it supports sloppy programming from userspace - if application asks
e.g. for MAP_DIRECT and doesn't know whether it gets it or not, it would
have to be very careful not to assume anything about that in its code. And
frankly I think the most likely scenario is that a programmer will just use
MAP_SHARED | MAP_DIRECT, *assume* he will get the MAP_DIRECT semantics if
the call does not fail and then complain when his application breaks.

2) In theory there could be an application that inadvertedly sets some high
flag bits and now it would get confused by getting different mmap(2)
semantics. But I agree this is mostly theoretical.

Overall I think the benefit of being able to say "do MAP_DIRECT if you can"
does not outweight the risk of bugs in userspace applications. Especially
since userspace can easily implement the same semantics by retrying the
mmap(2) call without MAP_SHARED_VALIDATE | MAP_DIRECT.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-14  1:57                   ` Jason Gunthorpe
@ 2017-10-16 12:02                     ` Sagi Grimberg
  -1 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-16 12:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig


Hey folks, (chiming in very late here...)

>>> I think, if you want to build a uAPI for notification of MR lease
>>> break, then you need show how it fits into the above software model:
>>>   - How it can be hidden in a RDMA specific library
>>
>> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
>> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
>> the solution generic across DAX and non-DAX. What's you're feeling for
>> how well applications are prepared to deal with that status return?
> 
> Stuffing an entry into the CQ is difficult. The CQ is in user memory
> and it is DMA'd from the HCA for several pieces of hardware, so the
> kernel can't just stuff something in there. It can be done
> with HW support by having the HCA DMA it via an exception path or
> something, but even then, you run into questions like CQ overflow and
> accounting issues since it is not ment for this.

But why should the kernel ever need to mangle the CQ? if a lease break
would deregister the MR the device is expected to generate remote
protection errors on its own.

And in that case, I think we need a query mechanism rather an event
mechanism so when the application starts seeing protection errors
it can query the relevant MR (I think most if not all devices have that
information in their internal completion queue entries).

> 
> So, you need a side channel of some kind, either in certain drivers or
> generically..
> 
>>>   - How lease break can be done hitlessly, so the library user never
>>>     needs to know it is happening or see failed/missed transfers

I agree that the application should not be aware of lease breakages, but
seeing failed transfers is perfectly acceptable given that an access
violation is happening (my assumption is that failed transfers are error
completions reported in the user completion queue). What we need to have
is a framework to help user-space to recover sanely, which is to query
what MR had the access violation, restore it, and re-establish the queue
pair.

>>
>> iommu redirect should be hit less and behave like the page cache case
>> where RDMA targets pages that are no longer part of the file.
> 
> Yes, if the iommu can be fenced properly it sounds doable.
> 
>>>   - Whatever fast path checking is needed does not kill performance
>>
>> What do you consider a fast path? I was assuming that memory
>> registration is a slow path, and iommu operations are asynchronous so
>> should not impact performance of ongoing operations beyond typical
>> iommu overhead.
> 
> ibv_poll_cq() and ibv_post_send() would be a fast path.
> 
> Where this struggled before is in creating a side channel you also now
> have to check that side channel, and checking it at high performance
> is quite hard.. Even quiecing things to be able to tear down the MR
> has performance implications on post send...

This is exactly why I think we should not have it, but instead give
building blocks to recover sanely from error completions...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 12:02                     ` Sagi Grimberg
  0 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-16 12:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux MM, Al Viro, Andy Lutomirski, Jeff Layton,
	linux-fsdevel, Linus Torvalds, Christoph Hellwig


Hey folks, (chiming in very late here...)

>>> I think, if you want to build a uAPI for notification of MR lease
>>> break, then you need show how it fits into the above software model:
>>>   - How it can be hidden in a RDMA specific library
>>
>> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
>> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
>> the solution generic across DAX and non-DAX. What's you're feeling for
>> how well applications are prepared to deal with that status return?
> 
> Stuffing an entry into the CQ is difficult. The CQ is in user memory
> and it is DMA'd from the HCA for several pieces of hardware, so the
> kernel can't just stuff something in there. It can be done
> with HW support by having the HCA DMA it via an exception path or
> something, but even then, you run into questions like CQ overflow and
> accounting issues since it is not ment for this.

But why should the kernel ever need to mangle the CQ? if a lease break
would deregister the MR the device is expected to generate remote
protection errors on its own.

And in that case, I think we need a query mechanism rather an event
mechanism so when the application starts seeing protection errors
it can query the relevant MR (I think most if not all devices have that
information in their internal completion queue entries).

> 
> So, you need a side channel of some kind, either in certain drivers or
> generically..
> 
>>>   - How lease break can be done hitlessly, so the library user never
>>>     needs to know it is happening or see failed/missed transfers

I agree that the application should not be aware of lease breakages, but
seeing failed transfers is perfectly acceptable given that an access
violation is happening (my assumption is that failed transfers are error
completions reported in the user completion queue). What we need to have
is a framework to help user-space to recover sanely, which is to query
what MR had the access violation, restore it, and re-establish the queue
pair.

>>
>> iommu redirect should be hit less and behave like the page cache case
>> where RDMA targets pages that are no longer part of the file.
> 
> Yes, if the iommu can be fenced properly it sounds doable.
> 
>>>   - Whatever fast path checking is needed does not kill performance
>>
>> What do you consider a fast path? I was assuming that memory
>> registration is a slow path, and iommu operations are asynchronous so
>> should not impact performance of ongoing operations beyond typical
>> iommu overhead.
> 
> ibv_poll_cq() and ibv_post_send() would be a fast path.
> 
> Where this struggled before is in creating a side channel you also now
> have to check that side channel, and checking it at high performance
> is quite hard.. Even quiecing things to be able to tear down the MR
> has performance implications on post send...

This is exactly why I think we should not have it, but instead give
building blocks to recover sanely from error completions...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-16  7:26                 ` Christoph Hellwig
  (?)
@ 2017-10-16 12:07                   ` Sagi Grimberg
  -1 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-16 12:07 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-xfs, Linux MM, Jeff Layton, Al Viro, Andy Lutomirski,
	linux-fsdevel, Linus Torvalds


>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
> 
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

I agree we must deregister the MR in kernel space. As said, I think
its perfectly reasonable to let user-space see error completions and
provide query mechanism for MR granularity (unfortunately this will
probably need drivers assistance as they know how their device reports
in MR granularity access violations).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 12:07                   ` Sagi Grimberg
  0 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-16 12:07 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-xfs, Linux MM, Jeff Layton, Al Viro, Andy Lutomirski,
	linux-fsdevel, Linus Torvalds


>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
> 
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

I agree we must deregister the MR in kernel space. As said, I think
its perfectly reasonable to let user-space see error completions and
provide query mechanism for MR granularity (unfortunately this will
probably need drivers assistance as they know how their device reports
in MR granularity access violations).

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 12:07                   ` Sagi Grimberg
  0 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-16 12:07 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: J. Bruce Fields, Jan Kara, Andrew Morton, Arnd Bergmann,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Darrick J. Wong,
	Dave Chinner, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM,
	Jeff Layton, Al Viro, Andy Lutomirski, linux-fsdevel,
	Linus Torvalds


>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
> 
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

I agree we must deregister the MR in kernel space. As said, I think
its perfectly reasonable to let user-space see error completions and
provide query mechanism for MR granularity (unfortunately this will
probably need drivers assistance as they know how their device reports
in MR granularity access violations).

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-16  7:26                 ` Christoph Hellwig
  (?)
@ 2017-10-16 17:43                   ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-16 17:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
>
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

While I agree with the need for a per-MR notification mechanism, one
thing we lose by walking away from MAP_DIRECT is a way for a
hypervisor to coordinate pass through of a DAX mapping to an RDMA
device in a guest. That will remain a case where we will still need to
use device-dax. I'm fine if that's the answer, but just want to be
clear about all the places we need to protect a DAX mapping against
RDMA from a non-ODP device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 17:43                   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-16 17:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
>
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

While I agree with the need for a per-MR notification mechanism, one
thing we lose by walking away from MAP_DIRECT is a way for a
hypervisor to coordinate pass through of a DAX mapping to an RDMA
device in a guest. That will remain a case where we will still need to
use device-dax. I'm fine if that's the answer, but just want to be
clear about all the places we need to protect a DAX mapping against
RDMA from a non-ODP device.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 17:43                   ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-16 17:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Andy Lutomirski,
	Arnd Bergmann, Darrick J. Wong, Linux API,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, Andrew Morton,
	Jason Gunthorpe, Linux MM, Al Viro, J. Bruce Fields,
	linux-fsdevel, Linus Torvalds, Jeff Layton

On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
>
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

While I agree with the need for a per-MR notification mechanism, one
thing we lose by walking away from MAP_DIRECT is a way for a
hypervisor to coordinate pass through of a DAX mapping to an RDMA
device in a guest. That will remain a case where we will still need to
use device-dax. I'm fine if that's the answer, but just want to be
clear about all the places we need to protect a DAX mapping against
RDMA from a non-ODP device.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-16 17:43                   ` Dan Williams
  (?)
@ 2017-10-16 19:44                     ` Dan Williams
  -1 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-16 19:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, Jan Kara, Andy Lutomirski, Arnd Bergmann,
	Darrick J. Wong, Linux API, linux-nvdimm, Dave Chinner,
	Andrew Morton, Jason Gunthorpe, Linux MM, Al Viro,
	J. Bruce Fields, linux-fsdevel, Linus Torvalds, Jeff Layton

On Mon, Oct 16, 2017 at 10:43 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>>> I don't think that really represents how lots of apps actually use
>>> RDMA.
>>>
>>> RDMA is often buried down in the software stack (eg in a MPI), and by
>>> the time a mapping gets used for RDMA transfer the link between the
>>> FD, mmap and the MR is totally opaque.
>>>
>>> Having a MR specific notification means the low level RDMA libraries
>>> have a chance to deal with everything for the app.
>>>
>>> Eg consider a HPC app using MPI that uses some DAX aware library to
>>> get DAX backed mmap's. It then passes memory in those mmaps to the
>>> MPI library to do transfers. The MPI creates the MR on demand.
>>>
>>
>> I suspect one of the more interesting use cases might be a file server,
>> for which that's not the case.  But otherwise I agree with the above,
>> and also thing that notifying the MR handle is the only way to go for
>> another very important reason:  fencing.  What if the application/library
>> does not react on the notification?  With a per-MR notification we
>> can unregister the MR in kernel space and have a rock solid fencing
>> mechanism.  And that is the most important bit here.
>
> While I agree with the need for a per-MR notification mechanism, one
> thing we lose by walking away from MAP_DIRECT is a way for a
> hypervisor to coordinate pass through of a DAX mapping to an RDMA
> device in a guest. That will remain a case where we will still need to
> use device-dax. I'm fine if that's the answer, but just want to be
> clear about all the places we need to protect a DAX mapping against
> RDMA from a non-ODP device.

For this specific issue perhaps we promote FL_LAYOUT as a lease-type
that can be set by fcntl().
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 19:44                     ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-16 19:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Mon, Oct 16, 2017 at 10:43 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>>> I don't think that really represents how lots of apps actually use
>>> RDMA.
>>>
>>> RDMA is often buried down in the software stack (eg in a MPI), and by
>>> the time a mapping gets used for RDMA transfer the link between the
>>> FD, mmap and the MR is totally opaque.
>>>
>>> Having a MR specific notification means the low level RDMA libraries
>>> have a chance to deal with everything for the app.
>>>
>>> Eg consider a HPC app using MPI that uses some DAX aware library to
>>> get DAX backed mmap's. It then passes memory in those mmaps to the
>>> MPI library to do transfers. The MPI creates the MR on demand.
>>>
>>
>> I suspect one of the more interesting use cases might be a file server,
>> for which that's not the case.  But otherwise I agree with the above,
>> and also thing that notifying the MR handle is the only way to go for
>> another very important reason:  fencing.  What if the application/library
>> does not react on the notification?  With a per-MR notification we
>> can unregister the MR in kernel space and have a rock solid fencing
>> mechanism.  And that is the most important bit here.
>
> While I agree with the need for a per-MR notification mechanism, one
> thing we lose by walking away from MAP_DIRECT is a way for a
> hypervisor to coordinate pass through of a DAX mapping to an RDMA
> device in a guest. That will remain a case where we will still need to
> use device-dax. I'm fine if that's the answer, but just want to be
> clear about all the places we need to protect a DAX mapping against
> RDMA from a non-ODP device.

For this specific issue perhaps we promote FL_LAYOUT as a lease-type
that can be set by fcntl().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-16 19:44                     ` Dan Williams
  0 siblings, 0 replies; 116+ messages in thread
From: Dan Williams @ 2017-10-16 19:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, linux-nvdimm, linux-xfs, Jan Kara,
	Arnd Bergmann, Darrick J. Wong, Linux API, Dave Chinner,
	J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski,
	Ross Zwisler, linux-fsdevel, Jeff Layton, Linus Torvalds,
	Andrew Morton

On Mon, Oct 16, 2017 at 10:43 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig <hch@lst.de> wrote:
>> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>>> I don't think that really represents how lots of apps actually use
>>> RDMA.
>>>
>>> RDMA is often buried down in the software stack (eg in a MPI), and by
>>> the time a mapping gets used for RDMA transfer the link between the
>>> FD, mmap and the MR is totally opaque.
>>>
>>> Having a MR specific notification means the low level RDMA libraries
>>> have a chance to deal with everything for the app.
>>>
>>> Eg consider a HPC app using MPI that uses some DAX aware library to
>>> get DAX backed mmap's. It then passes memory in those mmaps to the
>>> MPI library to do transfers. The MPI creates the MR on demand.
>>>
>>
>> I suspect one of the more interesting use cases might be a file server,
>> for which that's not the case.  But otherwise I agree with the above,
>> and also thing that notifying the MR handle is the only way to go for
>> another very important reason:  fencing.  What if the application/library
>> does not react on the notification?  With a per-MR notification we
>> can unregister the MR in kernel space and have a rock solid fencing
>> mechanism.  And that is the most important bit here.
>
> While I agree with the need for a per-MR notification mechanism, one
> thing we lose by walking away from MAP_DIRECT is a way for a
> hypervisor to coordinate pass through of a DAX mapping to an RDMA
> device in a guest. That will remain a case where we will still need to
> use device-dax. I'm fine if that's the answer, but just want to be
> clear about all the places we need to protect a DAX mapping against
> RDMA from a non-ODP device.

For this specific issue perhaps we promote FL_LAYOUT as a lease-type
that can be set by fcntl().

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-16 19:44                     ` Dan Williams
  (?)
  (?)
@ 2017-10-17  6:46                       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-17  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs, Jan Kara, Andy Lutomirski, Arnd Bergmann,
	linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	Andrew Morton, Jason Gunthorpe, Linux MM, Al Viro,
	J. Bruce Fields, Jeff Layton, linux-fsdevel, Linus Torvalds,
	Christoph Hellwig

On Mon, Oct 16, 2017 at 12:44:31PM -0700, Dan Williams wrote:
> > While I agree with the need for a per-MR notification mechanism, one
> > thing we lose by walking away from MAP_DIRECT is a way for a
> > hypervisor to coordinate pass through of a DAX mapping to an RDMA
> > device in a guest. That will remain a case where we will still need to
> > use device-dax. I'm fine if that's the answer, but just want to be
> > clear about all the places we need to protect a DAX mapping against
> > RDMA from a non-ODP device.
> 
> For this specific issue perhaps we promote FL_LAYOUT as a lease-type
> that can be set by fcntl().

I don't think it is a good userspace interface, mostly because it
is about things that don't matter for userspace (block mappings).

It makes sense as a kernel interface for callers that want to pin
down a memory long-term, but for userspace the fact that the block
mapping changes doesn't matter - it matters that their long term
pin is broken by something.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-17  6:46                       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-17  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jason Gunthorpe, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Mon, Oct 16, 2017 at 12:44:31PM -0700, Dan Williams wrote:
> > While I agree with the need for a per-MR notification mechanism, one
> > thing we lose by walking away from MAP_DIRECT is a way for a
> > hypervisor to coordinate pass through of a DAX mapping to an RDMA
> > device in a guest. That will remain a case where we will still need to
> > use device-dax. I'm fine if that's the answer, but just want to be
> > clear about all the places we need to protect a DAX mapping against
> > RDMA from a non-ODP device.
> 
> For this specific issue perhaps we promote FL_LAYOUT as a lease-type
> that can be set by fcntl().

I don't think it is a good userspace interface, mostly because it
is about things that don't matter for userspace (block mappings).

It makes sense as a kernel interface for callers that want to pin
down a memory long-term, but for userspace the fact that the block
mapping changes doesn't matter - it matters that their long term
pin is broken by something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-17  6:46                       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-17  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jason Gunthorpe, linux-nvdimm, linux-xfs,
	Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	Dave Chinner, J. Bruce Fields, Linux MM, Jeff Moyer, Al Viro,
	Andy Lutomirski, Ross Zwisler, linux-fsdevel, Jeff Layton,
	Linus Torvalds, Andrew Morton

On Mon, Oct 16, 2017 at 12:44:31PM -0700, Dan Williams wrote:
> > While I agree with the need for a per-MR notification mechanism, one
> > thing we lose by walking away from MAP_DIRECT is a way for a
> > hypervisor to coordinate pass through of a DAX mapping to an RDMA
> > device in a guest. That will remain a case where we will still need to
> > use device-dax. I'm fine if that's the answer, but just want to be
> > clear about all the places we need to protect a DAX mapping against
> > RDMA from a non-ODP device.
> 
> For this specific issue perhaps we promote FL_LAYOUT as a lease-type
> that can be set by fcntl().

I don't think it is a good userspace interface, mostly because it
is about things that don't matter for userspace (block mappings).

It makes sense as a kernel interface for callers that want to pin
down a memory long-term, but for userspace the fact that the block
mapping changes doesn't matter - it matters that their long term
pin is broken by something.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-17  6:46                       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-17  6:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jason Gunthorpe,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jan Kara, Arnd Bergmann,
	Darrick J. Wong, Linux API, Dave Chinner, J. Bruce Fields,
	Linux MM, Jeff Moyer, Al Viro, Andy Lutomirski, Ross Zwisler,
	linux-fsdevel, Jeff Layton, Linus Torvalds, Andrew Morton

On Mon, Oct 16, 2017 at 12:44:31PM -0700, Dan Williams wrote:
> > While I agree with the need for a per-MR notification mechanism, one
> > thing we lose by walking away from MAP_DIRECT is a way for a
> > hypervisor to coordinate pass through of a DAX mapping to an RDMA
> > device in a guest. That will remain a case where we will still need to
> > use device-dax. I'm fine if that's the answer, but just want to be
> > clear about all the places we need to protect a DAX mapping against
> > RDMA from a non-ODP device.
> 
> For this specific issue perhaps we promote FL_LAYOUT as a lease-type
> that can be set by fcntl().

I don't think it is a good userspace interface, mostly because it
is about things that don't matter for userspace (block mappings).

It makes sense as a kernel interface for callers that want to pin
down a memory long-term, but for userspace the fact that the block
mapping changes doesn't matter - it matters that their long term
pin is broken by something.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
  2017-10-16 12:02                     ` Sagi Grimberg
  (?)
  (?)
@ 2017-10-19  6:02                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-19  6:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-xfs, Jan Kara, Arnd Bergmann, Darrick J. Wong, Linux API,
	linux-nvdimm, Dave Chinner, Christoph Hellwig, J. Bruce Fields,
	Linux MM, Jeff Layton, Al Viro, Andy Lutomirski, linux-fsdevel,
	Linus Torvalds, Andrew Morton

On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote:
> But why should the kernel ever need to mangle the CQ? if a lease break
> would deregister the MR the device is expected to generate remote
> protection errors on its own.

The point is to avoid protection errors - hittles change over when the
DAX mapping changes like ODP does.

Theonly way to get there is to notify the app before the mappings
change.. Dan suggested having ibv_pollcq return this indication..

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-19  6:02                       ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-19  6:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Dan Williams, J. Bruce Fields, Jan Kara, Andrew Morton,
	Arnd Bergmann, Darrick J. Wong, Linux API, linux-nvdimm,
	Dave Chinner, linux-xfs, Linux MM, Al Viro, Andy Lutomirski,
	Jeff Layton, linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote:
> But why should the kernel ever need to mangle the CQ? if a lease break
> would deregister the MR the device is expected to generate remote
> protection errors on its own.

The point is to avoid protection errors - hittles change over when the
DAX mapping changes like ODP does.

Theonly way to get there is to notify the app before the mappings
change.. Dan suggested having ibv_pollcq return this indication..

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-19  6:02                       ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-19  6:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Dan Williams, J. Bruce Fields, Jan Kara, Andrew Morton,
	Arnd Bergmann, Darrick J. Wong, Linux API, linux-nvdimm,
	Dave Chinner, linux-xfs, Linux MM, Al Viro, Andy Lutomirski,
	Jeff Layton, linux-fsdevel, Linus Torvalds, Christoph Hellwig

On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote:
> But why should the kernel ever need to mangle the CQ? if a lease break
> would deregister the MR the device is expected to generate remote
> protection errors on its own.

The point is to avoid protection errors - hittles change over when the
DAX mapping changes like ODP does.

Theonly way to get there is to notify the app before the mappings
change.. Dan suggested having ibv_pollcq return this indication..

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
@ 2017-10-19  6:02                       ` Jason Gunthorpe
  0 siblings, 0 replies; 116+ messages in thread
From: Jason Gunthorpe @ 2017-10-19  6:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Dan Williams, J. Bruce Fields, Jan Kara, Andrew Morton,
	Arnd Bergmann, Darrick J. Wong, Linux API,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Al Viro,
	Andy Lutomirski, Jeff Layton, linux-fsdevel, Linus Torvalds,
	Christoph Hellwig

On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote:
> But why should the kernel ever need to mangle the CQ? if a lease break
> would deregister the MR the device is expected to generate remote
> protection errors on its own.

The point is to avoid protection errors - hittles change over when the
DAX mapping changes like ODP does.

Theonly way to get there is to notify the app before the mappings
change.. Dan suggested having ibv_pollcq return this indication..

Jason

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2017-10-19  6:03 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-12  0:47 [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush Dan Williams
2017-10-12  0:47 ` Dan Williams
2017-10-12  0:47 ` Dan Williams
2017-10-12  0:47 ` [PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12 13:51   ` Jan Kara
2017-10-12 13:51     ` Jan Kara
2017-10-12 13:51     ` Jan Kara
2017-10-12 13:51     ` Jan Kara
2017-10-12 16:32     ` Linus Torvalds
2017-10-12 16:32       ` Linus Torvalds
2017-10-12 16:32       ` Linus Torvalds
2017-10-16  7:38       ` Christoph Hellwig
2017-10-16  7:38         ` Christoph Hellwig
2017-10-16  7:38         ` Christoph Hellwig
2017-10-16  7:56       ` Jan Kara
2017-10-16  7:56         ` Jan Kara
2017-10-16  7:56         ` Jan Kara
2017-10-12  0:47 ` [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate() Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  1:21   ` Al Viro
2017-10-12  1:21     ` Al Viro
2017-10-12  1:21     ` Al Viro
2017-10-12  1:21     ` Al Viro
2017-10-12  1:28     ` Dan Williams
2017-10-12  1:28       ` Dan Williams
2017-10-12  1:28       ` Dan Williams
2017-10-12  1:28       ` Dan Williams
2017-10-12  2:17       ` Dan Williams
2017-10-12  2:17         ` Dan Williams
2017-10-12  2:17         ` Dan Williams
2017-10-12  2:17         ` Dan Williams
2017-10-12  3:44         ` Dan Williams
2017-10-12  3:44           ` Dan Williams
2017-10-12  3:44           ` Dan Williams
2017-10-12  3:44           ` Dan Williams
2017-10-12  0:47 ` [PATCH v9 3/6] fs: MAP_DIRECT core Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47 ` [PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47 ` [PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait() Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47 ` [PATCH v9 6/6] xfs: wire up MAP_DIRECT Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12  0:47   ` Dan Williams
2017-10-12 14:23 ` [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush Christoph Hellwig
2017-10-12 14:23   ` Christoph Hellwig
2017-10-12 14:23   ` Christoph Hellwig
2017-10-12 17:41   ` Dan Williams
2017-10-12 17:41     ` Dan Williams
2017-10-12 17:41     ` Dan Williams
2017-10-13  6:57     ` Christoph Hellwig
2017-10-13  6:57       ` Christoph Hellwig
2017-10-13  6:57       ` Christoph Hellwig
2017-10-13 15:14       ` Dan Williams
2017-10-13 15:14         ` Dan Williams
2017-10-13 15:14         ` Dan Williams
2017-10-13 16:38         ` Jason Gunthorpe
2017-10-13 16:38           ` Jason Gunthorpe
2017-10-13 16:38           ` Jason Gunthorpe
2017-10-13 16:38           ` Jason Gunthorpe
2017-10-13 17:01           ` Dan Williams
2017-10-13 17:01             ` Dan Williams
2017-10-13 17:01             ` Dan Williams
2017-10-13 17:01             ` Dan Williams
2017-10-13 17:31             ` Jason Gunthorpe
2017-10-13 17:31               ` Jason Gunthorpe
2017-10-13 17:31               ` Jason Gunthorpe
2017-10-13 17:31               ` Jason Gunthorpe
2017-10-13 18:22               ` Dan Williams
2017-10-13 18:22                 ` Dan Williams
2017-10-13 18:22                 ` Dan Williams
2017-10-13 18:22                 ` Dan Williams
2017-10-14  1:57                 ` Jason Gunthorpe
2017-10-14  1:57                   ` Jason Gunthorpe
2017-10-14  1:57                   ` Jason Gunthorpe
2017-10-14  1:57                   ` Jason Gunthorpe
2017-10-16 12:02                   ` Sagi Grimberg
2017-10-16 12:02                     ` Sagi Grimberg
2017-10-19  6:02                     ` Jason Gunthorpe
2017-10-19  6:02                       ` Jason Gunthorpe
2017-10-19  6:02                       ` Jason Gunthorpe
2017-10-19  6:02                       ` Jason Gunthorpe
2017-10-16  7:30                 ` Christoph Hellwig
2017-10-16  7:30                   ` Christoph Hellwig
2017-10-16  7:30                   ` Christoph Hellwig
2017-10-16  7:26               ` Christoph Hellwig
2017-10-16  7:26                 ` Christoph Hellwig
2017-10-16  7:26                 ` Christoph Hellwig
2017-10-16 12:07                 ` Sagi Grimberg
2017-10-16 12:07                   ` Sagi Grimberg
2017-10-16 12:07                   ` Sagi Grimberg
2017-10-16 17:43                 ` Dan Williams
2017-10-16 17:43                   ` Dan Williams
2017-10-16 17:43                   ` Dan Williams
2017-10-16 19:44                   ` Dan Williams
2017-10-16 19:44                     ` Dan Williams
2017-10-16 19:44                     ` Dan Williams
2017-10-17  6:46                     ` Christoph Hellwig
2017-10-17  6:46                       ` Christoph Hellwig
2017-10-17  6:46                       ` Christoph Hellwig
2017-10-17  6:46                       ` Christoph Hellwig
2017-10-16  7:22           ` Christoph Hellwig
2017-10-16  7:22             ` Christoph Hellwig
2017-10-16  7:22             ` Christoph Hellwig
2017-10-16  7:22             ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.