Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002   ;-)
@ 2019-08-09 22:58 ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space ira.weiny
                   ` (19 more replies)
  0 siblings, 20 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Pre-requisites
==============
	Based on mmotm tree.

Based on the feedback from LSFmm, the LWN article, the RFC series since
then, and a ton of scenarios I've worked in my mind and/or tested...[1]

Solution summary
================

The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated.  So really any solution we present which:

A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong

Should be an acceptable solution.

Because this is slightly new behavior.  And because this is going to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.

The following patches implement the following solution.

0) Registrations to Device DAX char devs are not affected

1) The user has to opt in to allowing page pins on a file with an exclusive
   layout lease.  Both exclusive and layout lease flags are user visible now.

2) page pins will fail if the lease is not active when the file back page is
   encountered.

3) Any truncate or hole punch operation on a pinned DAX page will fail.

4) The user has the option of holding the lease or releasing it.  If they
   release it no other pin calls will work on the file.

5) Closing the file is ok.

6) Unmapping the file is ok

7) Pins against the files are tracked back to an owning file or an owning mm
   depending on the internal subsystem needs.  With RDMA there is an owning
   file which is related to the pined file.

8) Only RDMA is currently supported

9) Truncation of pages which are not actively pinned nor covered by a lease
   will succeed.


Reporting of pinned files in procfs
===================================

A number of alternatives were explored for how to report the file pins within
procfs.  The following incorporates ideas from Jan Kara, Jason Gunthorpe, Dave
Chinner, Dan Williams and myself.

A new entry is added to procfs

/proc/<pid>/file_pins

For processes which have pinned DAX file memory file_pins reference come in 2
flavors.  Those which are attached to another open file descriptor (For example
what is done in the RDMA subsytem) and those which are attached to a process
mm.

For those which are attached to another open file descriptor (such as RDMA)
the file pin references go through the 'struct file' associated with that pin.
In RDMA this is the RDMA context struct file.

The resulting output from proc fs is something like.

$ cat /proc/<pid>/file_pins
3: /dev/infiniband/uverbs0
	/mnt/pmem/foo

Where '3' is the file descriptor (and file path) of the rdma context within the
process.  The paths of the files pinned using that context are then listed.

RDMA contexts may have multiple MR each of which may have multiple files pinned
within them.  So an output like the following is possible.

$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
	/mnt/pmem/foo
	/mnt/pmem/bar
	/mnt/pmem/another
	/mnt/pmem/one

The actual memory regions associated with the file pins are not reported.

For processes which are pinning memory which is not associated with a specific
file descriptor memory pins are reported directly as paths to the file.

$ cat /proc/<pid>/file_pins
/mnt/pmem/foo

Putting the above together if a process was using RDMA and another subsystem
the output could be something like:


$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
	/mnt/pmem/foo
	/mnt/pmem/bar
	/mnt/pmem/another
	/mnt/pmem/one
/mnt/pmem/foo
/mnt/pmem/another
/mnt/pmem/mm_mapped_file


[1] https://lkml.org/lkml/2019/6/5/1046


Background
==========

It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP).  There are 2 big reasons this may not work.

	1) The hardware being used for RDMA may not support ODP
	2) ODP may be detrimental to the over all network (cluster or cloud)
	   performance

Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.

1) "longterm" GUP users need to inform other subsystems that they have taken a
   pin on a page which may remain pinned for a very "long time".  The
   definition of long time is debatable but it has been established that RDMAs
   use of pages for, minutes, hours, or even days after the pin is the extreme
   case which makes this problem most severe.

2) Any page which is "controlled" by a file system needs to have special
   handling.  The details of the handling depends on if the page is page cache
   fronted or not.

   2a) A page cache fronted page which has been pinned by GUP long term can use a
   bounce buffer to allow the file system to write back snap shots of the page.
   This is handled by the FS recognizing the GUP long term pin and making a copy
   of the page to be written back.
	NOTE: this patch set does not address this path.

   2b) A FS "controlled" page which is not page cache fronted is either easier
   to deal with or harder depending on the operation the filesystem is trying
   to do.

	2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
	FS can no longer use the pages in question until the pin has been
	removed.  This patch set presents a solution to this by introducing
	some reasonable restrictions on user space applications.

	2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
	then there is nothing which need be done.  Data is Read or Written
	directly to the page.  This is an easy case which would currently work
	if not for GUP long term pins being disabled.  Therefore this patch set
	need not change access to the file data but does allow for GUP pins
	after 2ba above is dealt with.


This patch series and presents a solution for problem 2ba)

Ira Weiny (19):
  fs/locks: Export F_LAYOUT lease to user space
  fs/locks: Add Exclusive flag to user Layout lease
  mm/gup: Pass flags down to __gup_device_huge* calls
  mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
  fs/ext4: Teach ext4 to break layout leases
  fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  fs/xfs: Teach xfs to use new dax_layout_busy_page()
  fs/xfs: Fail truncate if page lease can't be broken
  mm/gup: Introduce vaddr_pin structure
  mm/gup: Pass a NULL vaddr_pin through GUP fast
  mm/gup: Pass follow_page_context further down the call stack
  mm/gup: Prep put_user_pages() to take an vaddr_pin struct
  {mm,file}: Add file_pins objects
  fs/locks: Associate file pins while performing GUP
  mm/gup: Introduce vaddr_pin_pages()
  RDMA/uverbs: Add back pointer to system file object
  RDMA/umem: Convert to vaddr_[pin|unpin]* operations.
  {mm,procfs}: Add display file_pins proc
  mm/gup: Remove FOLL_LONGTERM DAX exclusion

 drivers/infiniband/core/umem.c        |  26 +-
 drivers/infiniband/core/umem_odp.c    |  16 +-
 drivers/infiniband/core/uverbs.h      |   1 +
 drivers/infiniband/core/uverbs_main.c |   1 +
 fs/Kconfig                            |   1 +
 fs/dax.c                              |  38 ++-
 fs/ext4/ext4.h                        |   2 +-
 fs/ext4/extents.c                     |   6 +-
 fs/ext4/inode.c                       |  26 +-
 fs/file_table.c                       |   4 +
 fs/locks.c                            | 291 +++++++++++++++++-
 fs/proc/base.c                        | 214 +++++++++++++
 fs/xfs/xfs_file.c                     |  21 +-
 fs/xfs/xfs_inode.h                    |   5 +-
 fs/xfs/xfs_ioctl.c                    |  15 +-
 fs/xfs/xfs_iops.c                     |  14 +-
 include/linux/dax.h                   |  12 +-
 include/linux/file.h                  |  49 +++
 include/linux/fs.h                    |   5 +-
 include/linux/huge_mm.h               |  17 --
 include/linux/mm.h                    |  69 +++--
 include/linux/mm_types.h              |   2 +
 include/rdma/ib_umem.h                |   2 +-
 include/uapi/asm-generic/fcntl.h      |   5 +
 kernel/fork.c                         |   3 +
 mm/gup.c                              | 418 ++++++++++++++++----------
 mm/huge_memory.c                      |  18 +-
 mm/internal.h                         |  28 ++
 28 files changed, 1048 insertions(+), 261 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 23:52   ` Dave Chinner
  2019-08-09 22:58 ` [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease ira.weiny
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

In order to support an opt-in policy for users to allow long term pins
of FS DAX pages we need to export the LAYOUT lease to user space.

This is the first of 2 new lease flags which must be used to allow a
long term pin to be made on a file.

After the complete series:

0) Registrations to Device DAX char devs are not affected

1) The user has to opt in to allowing page pins on a file with an exclusive
   layout lease.  Both exclusive and layout lease flags are user visible now.

2) page pins will fail if the lease is not active when the file back page is
   encountered.

3) Any truncate or hole punch operation on a pinned DAX page will fail.

4) The user has the option of holding the lease or releasing it.  If they
   release it no other pin calls will work on the file.

5) Closing the file is ok.

6) Unmapping the file is ok

7) Pins against the files are tracked back to an owning file or an owning mm
   depending on the internal subsystem needs.  With RDMA there is an owning
   file which is related to the pined file.

8) Only RDMA is currently supported

9) Truncation of pages which are not actively pinned nor covered by a lease
   will succeed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/locks.c                       | 36 +++++++++++++++++++++++++++-----
 include/linux/fs.h               |  2 +-
 include/uapi/asm-generic/fcntl.h |  3 +++
 3 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 24d1db632f6c..ad17c6ffca06 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -191,6 +191,8 @@ static int target_leasetype(struct file_lock *fl)
 		return F_UNLCK;
 	if (fl->fl_flags & FL_DOWNGRADE_PENDING)
 		return F_RDLCK;
+	if (fl->fl_flags & FL_LAYOUT)
+		return F_LAYOUT;
 	return fl->fl_type;
 }
 
@@ -611,7 +613,8 @@ static const struct lock_manager_operations lease_manager_ops = {
 /*
  * Initialize a lease, use the default lock manager operations
  */
-static int lease_init(struct file *filp, long type, struct file_lock *fl)
+static int lease_init(struct file *filp, long type, unsigned int flags,
+		      struct file_lock *fl)
 {
 	if (assign_type(fl, type) != 0)
 		return -EINVAL;
@@ -621,6 +624,8 @@ static int lease_init(struct file *filp, long type, struct file_lock *fl)
 
 	fl->fl_file = filp;
 	fl->fl_flags = FL_LEASE;
+	if (flags & FL_LAYOUT)
+		fl->fl_flags |= FL_LAYOUT;
 	fl->fl_start = 0;
 	fl->fl_end = OFFSET_MAX;
 	fl->fl_ops = NULL;
@@ -629,7 +634,8 @@ static int lease_init(struct file *filp, long type, struct file_lock *fl)
 }
 
 /* Allocate a file_lock initialised to this type of lease */
-static struct file_lock *lease_alloc(struct file *filp, long type)
+static struct file_lock *lease_alloc(struct file *filp, long type,
+				     unsigned int flags)
 {
 	struct file_lock *fl = locks_alloc_lock();
 	int error = -ENOMEM;
@@ -637,7 +643,7 @@ static struct file_lock *lease_alloc(struct file *filp, long type)
 	if (fl == NULL)
 		return ERR_PTR(error);
 
-	error = lease_init(filp, type, fl);
+	error = lease_init(filp, type, flags, fl);
 	if (error) {
 		locks_free_lock(fl);
 		return ERR_PTR(error);
@@ -1583,7 +1589,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 	int want_write = (mode & O_ACCMODE) != O_RDONLY;
 	LIST_HEAD(dispose);
 
-	new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK);
+	new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK, 0);
 	if (IS_ERR(new_fl))
 		return PTR_ERR(new_fl);
 	new_fl->fl_flags = type;
@@ -1720,6 +1726,8 @@ EXPORT_SYMBOL(lease_get_mtime);
  *
  *	%F_UNLCK to indicate no lease is held.
  *
+ *	%F_LAYOUT to indicate a layout lease is held.
+ *
  *	(if a lease break is pending):
  *
  *	%F_RDLCK to indicate an exclusive lease needs to be
@@ -2022,8 +2030,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
 	struct file_lock *fl;
 	struct fasync_struct *new;
 	int error;
+	unsigned int flags = 0;
+
+	/*
+	 * NOTE on F_LAYOUT lease
+	 *
+	 * LAYOUT lease types are taken on files which the user knows that
+	 * they will be pinning in memory for some indeterminate amount of
+	 * time.  Such as for use with RDMA.  While we don't know what user
+	 * space is going to do with the file we still use a F_RDLOCK level of
+	 * lease.  This ensures that there are no conflicts between
+	 * 2 users.  The conflict should only come from the File system wanting
+	 * to revoke the lease in break_layout()  And this is done by using
+	 * F_WRLCK in the break code.
+	 */
+	if (arg == F_LAYOUT) {
+		arg = F_RDLCK;
+		flags = FL_LAYOUT;
+	}
 
-	fl = lease_alloc(filp, arg);
+	fl = lease_alloc(filp, arg, flags);
 	if (IS_ERR(fl))
 		return PTR_ERR(fl);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 046108cd4ed9..dd60d5be9886 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1004,7 +1004,7 @@ static inline struct file *get_file(struct file *f)
 #define FL_DOWNGRADE_PENDING	256 /* Lease is being downgraded */
 #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
 #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
-#define FL_LAYOUT	2048	/* outstanding pNFS layout */
+#define FL_LAYOUT	2048	/* outstanding pNFS layout or user held pin */
 
 #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..baddd54f3031 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -174,6 +174,9 @@ struct f_owner_ex {
 #define F_SHLCK		8	/* or 4 */
 #endif
 
+#define F_LAYOUT	16      /* layout lease to allow longterm pins such as
+				   RDMA */
+
 /* operations for bsd flock(), also used by the kernel implementation */
 #define LOCK_SH		1	/* shared lock */
 #define LOCK_EX		2	/* exclusive lock */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-14 14:15   ` Jeff Layton
  2019-09-04 23:12   ` John Hubbard
  2019-08-09 22:58 ` [RFC PATCH v2 03/19] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Add an exclusive lease flag which indicates that the layout mechanism
can not be broken.

Exclusive layout leases allow the file system to know that pages may be
GUP pined and that attempts to change the layout, ie truncate, should be
failed.

A process which attempts to break it's own exclusive lease gets an
EDEADLOCK return to help determine that this is likely a programming bug
vs someone else holding a resource.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/locks.c                       | 23 +++++++++++++++++++++--
 include/linux/fs.h               |  1 +
 include/uapi/asm-generic/fcntl.h |  2 ++
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index ad17c6ffca06..0c7359cdab92 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -626,6 +626,8 @@ static int lease_init(struct file *filp, long type, unsigned int flags,
 	fl->fl_flags = FL_LEASE;
 	if (flags & FL_LAYOUT)
 		fl->fl_flags |= FL_LAYOUT;
+	if (flags & FL_EXCLUSIVE)
+		fl->fl_flags |= FL_EXCLUSIVE;
 	fl->fl_start = 0;
 	fl->fl_end = OFFSET_MAX;
 	fl->fl_ops = NULL;
@@ -1619,6 +1621,14 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 	list_for_each_entry_safe(fl, tmp, &ctx->flc_lease, fl_list) {
 		if (!leases_conflict(fl, new_fl))
 			continue;
+		if (fl->fl_flags & FL_EXCLUSIVE) {
+			error = -ETXTBSY;
+			if (new_fl->fl_pid == fl->fl_pid) {
+				error = -EDEADLOCK;
+				goto out;
+			}
+			continue;
+		}
 		if (want_write) {
 			if (fl->fl_flags & FL_UNLOCK_PENDING)
 				continue;
@@ -1634,6 +1644,13 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
 			locks_delete_lock_ctx(fl, &dispose);
 	}
 
+	/* We differentiate between -EDEADLOCK and -ETXTBSY so the above loop
+	 * continues with -ETXTBSY looking for a potential deadlock instead.
+	 * If deadlock is not found go ahead and return -ETXTBSY.
+	 */
+	if (error == -ETXTBSY)
+		goto out;
+
 	if (list_empty(&ctx->flc_lease))
 		goto out;
 
@@ -2044,9 +2061,11 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
 	 * to revoke the lease in break_layout()  And this is done by using
 	 * F_WRLCK in the break code.
 	 */
-	if (arg == F_LAYOUT) {
+	if ((arg & F_LAYOUT) == F_LAYOUT) {
+		if ((arg & F_EXCLUSIVE) == F_EXCLUSIVE)
+			flags |= FL_EXCLUSIVE;
 		arg = F_RDLCK;
-		flags = FL_LAYOUT;
+		flags |= FL_LAYOUT;
 	}
 
 	fl = lease_alloc(filp, arg, flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd60d5be9886..2e41ce547913 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1005,6 +1005,7 @@ static inline struct file *get_file(struct file *f)
 #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
 #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
 #define FL_LAYOUT	2048	/* outstanding pNFS layout or user held pin */
+#define FL_EXCLUSIVE	4096	/* Layout lease is exclusive */
 
 #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index baddd54f3031..88b175ceccbc 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -176,6 +176,8 @@ struct f_owner_ex {
 
 #define F_LAYOUT	16      /* layout lease to allow longterm pins such as
 				   RDMA */
+#define F_EXCLUSIVE	32      /* layout lease is exclusive */
+				/* FIXME or shoudl this be F_EXLCK??? */
 
 /* operations for bsd flock(), also used by the kernel implementation */
 #define LOCK_SH		1	/* shared lock */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 03/19] mm/gup: Pass flags down to __gup_device_huge* calls
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 04/19] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

In order to support checking for a layout lease on a FS DAX inode these
calls need to know if FOLL_LONGTERM was specified.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 mm/gup.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b6a293bf1267..80423779a50a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1881,7 +1881,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+		unsigned long end, struct page **pages, int *nr,
+		unsigned int flags)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1907,30 +1908,33 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+		unsigned long end, struct page **pages, int *nr,
+		unsigned int flags)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
 		undo_dev_pagemap(nr, nr_start, pages);
 		return 0;
 	}
+
 	return 1;
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+		unsigned long end, struct page **pages, int *nr,
+		unsigned int flags)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -1941,14 +1945,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+		unsigned long end, struct page **pages, int *nr,
+		unsigned int flags)
 {
 	BUILD_BUG();
 	return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+		unsigned long end, struct page **pages, int *nr,
+		unsigned int flags)
 {
 	BUILD_BUG();
 	return 0;
@@ -2051,7 +2057,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr,
+					     flags);
 	}
 
 	refs = 0;
@@ -2092,7 +2099,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr,
+					     flags);
 	}
 
 	refs = 0;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 04/19] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (2 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 03/19] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 05/19] fs/ext4: Teach ext4 to break layout leases ira.weiny
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

On FS DAX files users must inform the file system they intend to take
long term GUP pins on the file pages.  Failure to do so should result in
an error.

Ensure that a F_LAYOUT lease exists at the time the GUP call is made.
If not return EPERM.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from RFC v1:

    The old version had remnants of when GUP was going to take the lease
    for the user.  Remove this prototype code.
    Fix issue in gup_device_huge which was setting page reference prior
    to checking for Layout Lease
    Re-base to 5.3+
    Clean up htmldoc comments

 fs/locks.c         | 47 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 mm/gup.c           | 23 +++++++++++++++++++++++
 mm/huge_memory.c   | 12 ++++++++++++
 4 files changed, 84 insertions(+)

diff --git a/fs/locks.c b/fs/locks.c
index 0c7359cdab92..14892c84844b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2971,3 +2971,50 @@ static int __init filelock_init(void)
 	return 0;
 }
 core_initcall(filelock_init);
+
+/**
+ * mapping_inode_has_layout - ensure a file mapped page has a layout lease
+ * taken
+ * @page: page we are trying to GUP
+ *
+ * This should only be called on DAX pages.  DAX pages which are mapped through
+ * FS DAX do not use the page cache.  As a result they require the user to take
+ * a LAYOUT lease on them prior to be able to pin them for longterm use.
+ * This allows the user to opt-into the fact that truncation operations will
+ * fail for the duration of the pin.
+ *
+ * Return true if the page has a LAYOUT lease associated with it's file.
+ */
+bool mapping_inode_has_layout(struct page *page)
+{
+	bool ret = false;
+	struct inode *inode;
+	struct file_lock *fl;
+
+	if (WARN_ON(PageAnon(page)) ||
+	    WARN_ON(!page) ||
+	    WARN_ON(!page->mapping) ||
+	    WARN_ON(!page->mapping->host))
+		return false;
+
+	inode = page->mapping->host;
+
+	smp_mb();
+	if (inode->i_flctx &&
+	    !list_empty_careful(&inode->i_flctx->flc_lease)) {
+		spin_lock(&inode->i_flctx->flc_lock);
+		ret = false;
+		list_for_each_entry(fl, &inode->i_flctx->flc_lease, fl_list) {
+			if (fl->fl_pid == current->tgid &&
+			    (fl->fl_flags & FL_LAYOUT) &&
+			    (fl->fl_flags & FL_EXCLUSIVE)) {
+				ret = true;
+				break;
+			}
+		}
+		spin_unlock(&inode->i_flctx->flc_lock);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mapping_inode_has_layout);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad6766a08f9b..04f22722b374 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1583,6 +1583,8 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
 
+bool mapping_inode_has_layout(struct page *page);
+
 /* Container for pinned pfns / pages */
 struct frame_vector {
 	unsigned int nr_allocated;	/* Number of frames we have space for */
diff --git a/mm/gup.c b/mm/gup.c
index 80423779a50a..0b05e22ac05f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -221,6 +221,13 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 			page = pte_page(pte);
 		else
 			goto no_page;
+
+		if (unlikely(flags & FOLL_LONGTERM) &&
+		    (*pgmap)->type == MEMORY_DEVICE_FS_DAX &&
+		    !mapping_inode_has_layout(page)) {
+			page = ERR_PTR(-EPERM);
+			goto out;
+		}
 	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
@@ -1847,6 +1854,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 
+		if (pte_devmap(pte) &&
+		    unlikely(flags & FOLL_LONGTERM) &&
+		    pgmap->type == MEMORY_DEVICE_FS_DAX &&
+		    !mapping_inode_has_layout(head)) {
+			put_user_page(head);
+			goto pte_unmap;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
@@ -1895,6 +1910,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			undo_dev_pagemap(nr, nr_start, pages);
 			return 0;
 		}
+
+		if (unlikely(flags & FOLL_LONGTERM) &&
+		    pgmap->type == MEMORY_DEVICE_FS_DAX &&
+		    !mapping_inode_has_layout(page)) {
+			undo_dev_pagemap(nr, nr_start, pages);
+			return 0;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		get_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1334ede667a8..bc1a07a55be1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -953,6 +953,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
+
+	if (unlikely(flags & FOLL_LONGTERM) &&
+	    (*pgmap)->type == MEMORY_DEVICE_FS_DAX &&
+	    !mapping_inode_has_layout(page))
+		return ERR_PTR(-EPERM);
+
 	get_page(page);
 
 	return page;
@@ -1093,6 +1099,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
+
+	if (unlikely(flags & FOLL_LONGTERM) &&
+	    (*pgmap)->type == MEMORY_DEVICE_FS_DAX &&
+	    !mapping_inode_has_layout(page))
+		return ERR_PTR(-EPERM);
+
 	get_page(page);
 
 	return page;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 05/19] fs/ext4: Teach ext4 to break layout leases
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (3 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 04/19] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

ext4 must attempt to break a layout lease if it is held to know if the
layout can be modified.

Split out the logic to determine if a mapping is DAX, export it, and then
break layout leases if a mapping is DAX.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from RFC v1:

	Based on feedback from Dave Chinner, add support to fail all
	other layout breaks when a lease is held.

 fs/dax.c            | 23 ++++++++++++++++-------
 fs/ext4/inode.c     |  7 +++++++
 include/linux/dax.h |  6 ++++++
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b64964ef44f6..a14ec32255d8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -557,6 +557,21 @@ static void *grab_mapping_entry(struct xa_state *xas,
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
 
+bool dax_mapping_is_dax(struct address_space *mapping)
+{
+	/*
+	 * In the 'limited' case get_user_pages() for dax is disabled.
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return false;
+
+	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+		return false;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
+
 /**
  * dax_layout_busy_page - find first pinned page in @mapping
  * @mapping: address space to scan for a page with ref count > 1
@@ -579,13 +594,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	unsigned int scanned = 0;
 	struct page *page = NULL;
 
-	/*
-	 * In the 'limited' case get_user_pages() for dax is disabled.
-	 */
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return NULL;
-
-	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+	if (!dax_mapping_is_dax(mapping))
 		return NULL;
 
 	/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b2c8d09acf65..f08f48de52c5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4271,6 +4271,13 @@ int ext4_break_layouts(struct inode *inode)
 	if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
 		return -EINVAL;
 
+	/* Break layout leases if active */
+	if (dax_mapping_is_dax(inode->i_mapping)) {
+		error = break_layout(inode, true);
+		if (error)
+			return error;
+	}
+
 	do {
 		page = dax_layout_busy_page(inode->i_mapping);
 		if (!page)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9bd8528bd305..da0768b34b48 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -143,6 +143,7 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 
+bool dax_mapping_is_dax(struct address_space *mapping);
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
@@ -174,6 +175,11 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
 	return NULL;
 }
 
+static inline bool dax_mapping_is_dax(struct address_space *mapping)
+{
+	return false;
+}
+
 static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 {
 	return NULL;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (4 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 05/19] fs/ext4: Teach ext4 to break layout leases ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-23 15:18   ` Vivek Goyal
  2019-08-09 22:58 ` [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Callers of dax_layout_busy_page() are only rarely operating on the
entire file of concern.

Teach dax_layout_busy_page() to operate on a sub-range of the
address_space provided.  Specifying 0 - ULONG_MAX however, will continue
to operate on the "entire file" and XFS is split out to a separate patch
by this method.

This could potentially speed up dax_layout_busy_page() as well.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from RFC v1
	Fix 0-day build errors

 fs/dax.c            | 15 +++++++++++----
 fs/ext4/ext4.h      |  2 +-
 fs/ext4/extents.c   |  6 +++---
 fs/ext4/inode.c     | 19 ++++++++++++-------
 fs/xfs/xfs_file.c   |  3 ++-
 include/linux/dax.h |  6 ++++--
 6 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a14ec32255d8..3ad19c384454 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -573,8 +573,11 @@ bool dax_mapping_is_dax(struct address_space *mapping)
 EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
 
 /**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page - find first pinned page in @mapping within
+ *                        the range @off - @off + @len
  * @mapping: address space to scan for a page with ref count > 1
+ * @off: offset to start at
+ * @len: length to scan through
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
@@ -587,9 +590,13 @@ EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
  * to be able to run unmap_mapping_range() and subsequently not race
  * mapping_mapped() becoming true.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page(struct address_space *mapping,
+				  loff_t off, loff_t len)
 {
-	XA_STATE(xas, &mapping->i_pages, 0);
+	unsigned long start_idx = off >> PAGE_SHIFT;
+	unsigned long end_idx = (len == ULONG_MAX) ? ULONG_MAX
+				: start_idx + (len >> PAGE_SHIFT);
+	XA_STATE(xas, &mapping->i_pages, start_idx);
 	void *entry;
 	unsigned int scanned = 0;
 	struct page *page = NULL;
@@ -612,7 +619,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 	unmap_mapping_range(mapping, 0, 0, 1);
 
 	xas_lock_irq(&xas);
-	xas_for_each(&xas, entry, ULONG_MAX) {
+	xas_for_each(&xas, entry, end_idx) {
 		if (WARN_ON_ONCE(!xa_is_value(entry)))
 			continue;
 		if (unlikely(dax_is_locked(entry)))
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9c7f4036021b..32738ccdac1d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2578,7 +2578,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
-extern int ext4_break_layouts(struct inode *);
+extern int ext4_break_layouts(struct inode *inode, loff_t offset, loff_t len);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks);
 extern void ext4_set_inode_flags(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 92266a2da7d6..ded4b1d92299 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4736,7 +4736,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		 */
 		down_write(&EXT4_I(inode)->i_mmap_sem);
 
-		ret = ext4_break_layouts(inode);
+		ret = ext4_break_layouts(inode, offset, len);
 		if (ret) {
 			up_write(&EXT4_I(inode)->i_mmap_sem);
 			goto out_mutex;
@@ -5419,7 +5419,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	 */
 	down_write(&EXT4_I(inode)->i_mmap_sem);
 
-	ret = ext4_break_layouts(inode);
+	ret = ext4_break_layouts(inode, offset, len);
 	if (ret)
 		goto out_mmap;
 
@@ -5572,7 +5572,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 	 */
 	down_write(&EXT4_I(inode)->i_mmap_sem);
 
-	ret = ext4_break_layouts(inode);
+	ret = ext4_break_layouts(inode, offset, len);
 	if (ret)
 		goto out_mmap;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f08f48de52c5..d3fc6035428c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4262,7 +4262,7 @@ static void ext4_wait_dax_page(struct ext4_inode_info *ei)
 	down_write(&ei->i_mmap_sem);
 }
 
-int ext4_break_layouts(struct inode *inode)
+int ext4_break_layouts(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct page *page;
@@ -4279,7 +4279,7 @@ int ext4_break_layouts(struct inode *inode)
 	}
 
 	do {
-		page = dax_layout_busy_page(inode->i_mapping);
+		page = dax_layout_busy_page(inode->i_mapping, offset, len);
 		if (!page)
 			return 0;
 
@@ -4366,7 +4366,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	 */
 	down_write(&EXT4_I(inode)->i_mmap_sem);
 
-	ret = ext4_break_layouts(inode);
+	ret = ext4_break_layouts(inode, offset, length);
 	if (ret)
 		goto out_dio;
 
@@ -5657,10 +5657,15 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 
 		down_write(&EXT4_I(inode)->i_mmap_sem);
 
-		rc = ext4_break_layouts(inode);
-		if (rc) {
-			up_write(&EXT4_I(inode)->i_mmap_sem);
-			return rc;
+		if (shrink) {
+			loff_t off = attr->ia_size;
+			loff_t len = inode->i_size - attr->ia_size;
+
+			rc = ext4_break_layouts(inode, off, len);
+			if (rc) {
+				up_write(&EXT4_I(inode)->i_mmap_sem);
+				return rc;
+			}
 		}
 
 		if (attr->ia_size != inode->i_size) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 28101bbc0b78..8f8d478f9ec6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -740,7 +740,8 @@ xfs_break_dax_layouts(
 
 	ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
 
-	page = dax_layout_busy_page(inode->i_mapping);
+	/* We default to the "whole file" */
+	page = dax_layout_busy_page(inode->i_mapping, 0, ULONG_MAX);
 	if (!page)
 		return 0;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index da0768b34b48..f34616979e45 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -144,7 +144,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 		struct block_device *bdev, struct writeback_control *wbc);
 
 bool dax_mapping_is_dax(struct address_space *mapping);
-struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page(struct address_space *mapping,
+				  loff_t off, loff_t len);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
 #else
@@ -180,7 +181,8 @@ static inline bool dax_mapping_is_dax(struct address_space *mapping)
 	return false;
 }
 
-static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+static inline struct page *dax_layout_busy_page(struct address_space *mapping,
+						loff_t off, loff_t len)
 {
 	return NULL;
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (5 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 23:30   ` Dave Chinner
  2019-08-09 22:58 ` [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken ira.weiny
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

dax_layout_busy_page() can now operate on a sub-range of the
address_space provided.

Have xfs specify the sub range to dax_layout_busy_page()

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_file.c  | 19 +++++++++++++------
 fs/xfs/xfs_inode.h |  5 +++--
 fs/xfs/xfs_ioctl.c | 15 ++++++++++++---
 fs/xfs/xfs_iops.c  | 14 ++++++++++----
 4 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8f8d478f9ec6..447571e3cb02 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -295,7 +295,11 @@ xfs_file_aio_write_checks(
 	if (error <= 0)
 		return error;
 
-	error = xfs_break_layouts(inode, iolock, BREAK_WRITE);
+	/*
+	 * BREAK_WRITE ignores offset/len tuple just specify the whole file
+	 * (0 - ULONG_MAX to be safe.
+	 */
+	error = xfs_break_layouts(inode, iolock, 0, ULONG_MAX, BREAK_WRITE);
 	if (error)
 		return error;
 
@@ -734,14 +738,15 @@ xfs_wait_dax_page(
 static int
 xfs_break_dax_layouts(
 	struct inode		*inode,
-	bool			*retry)
+	bool			*retry,
+	loff_t                   off,
+	loff_t                   len)
 {
 	struct page		*page;
 
 	ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
 
-	/* We default to the "whole file" */
-	page = dax_layout_busy_page(inode->i_mapping, 0, ULONG_MAX);
+	page = dax_layout_busy_page(inode->i_mapping, off, len);
 	if (!page)
 		return 0;
 
@@ -755,6 +760,8 @@ int
 xfs_break_layouts(
 	struct inode		*inode,
 	uint			*iolock,
+	loff_t                   off,
+	loff_t                   len,
 	enum layout_break_reason reason)
 {
 	bool			retry;
@@ -766,7 +773,7 @@ xfs_break_layouts(
 		retry = false;
 		switch (reason) {
 		case BREAK_UNMAP:
-			error = xfs_break_dax_layouts(inode, &retry);
+			error = xfs_break_dax_layouts(inode, &retry, off, len);
 			if (error || retry)
 				break;
 			/* fall through */
@@ -808,7 +815,7 @@ xfs_file_fallocate(
 		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
+	error = xfs_break_layouts(inode, &iolock, offset, len, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 558173f95a03..1b0948f5267c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -475,8 +475,9 @@ enum xfs_prealloc_flags {
 
 int	xfs_update_prealloc_flags(struct xfs_inode *ip,
 				  enum xfs_prealloc_flags flags);
-int	xfs_break_layouts(struct inode *inode, uint *iolock,
-		enum layout_break_reason reason);
+int xfs_break_layouts(struct inode *inode, uint *iolock,
+		      loff_t off, loff_t len,
+		      enum layout_break_reason reason);
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6f7848cd5527..3897b88080bd 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -597,6 +597,7 @@ xfs_ioc_space(
 	enum xfs_prealloc_flags	flags = 0;
 	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	int			error;
+	loff_t                  break_length;
 
 	if (inode->i_flags & (S_IMMUTABLE|S_APPEND))
 		return -EPERM;
@@ -617,9 +618,6 @@ xfs_ioc_space(
 		return error;
 
 	xfs_ilock(ip, iolock);
-	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
-	if (error)
-		goto out_unlock;
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
@@ -665,6 +663,17 @@ xfs_ioc_space(
 		goto out_unlock;
 	}
 
+	/* break layout for the whole file if len ends up 0 */
+	if (bf->l_len == 0)
+		break_length = ULONG_MAX;
+	else
+		break_length = bf->l_len;
+
+	error = xfs_break_layouts(inode, &iolock, bf->l_start, break_length,
+				  BREAK_UNMAP);
+	if (error)
+		goto out_unlock;
+
 	switch (cmd) {
 	case XFS_IOC_ZERO_RANGE:
 		flags |= XFS_PREALLOC_SET;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff3c1fae5357..f0de5486f6c1 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1042,10 +1042,16 @@ xfs_vn_setattr(
 		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
-		error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
-		if (error) {
-			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
-			return error;
+		if (iattr->ia_size < inode->i_size) {
+			loff_t                  off = iattr->ia_size;
+			loff_t                  len = inode->i_size - iattr->ia_size;
+
+			error = xfs_break_layouts(inode, &iolock, off, len,
+						  BREAK_UNMAP);
+			if (error) {
+				xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+				return error;
+			}
 		}
 
 		error = xfs_vn_setattr_size(dentry, iattr);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (6 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 23:22   ` Dave Chinner
  2019-08-09 22:58 ` [RFC PATCH v2 09/19] mm/gup: Introduce vaddr_pin structure ira.weiny
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

If pages are under a lease fail the truncate operation.  We change the order of
lease breaks to directly fail the operation if the lease exists.

Select EXPORT_BLOCK_OPS for FS_DAX to ensure that xfs_break_lease_layouts() is
defined for FS_DAX as well as pNFS.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/Kconfig        | 1 +
 fs/xfs/xfs_file.c | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 14cd4abdc143..c10b91f92528 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,6 +48,7 @@ config FS_DAX
 	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
 	select FS_IOMAP
 	select DAX
+	select EXPORTFS_BLOCK_OPS
 	help
 	  Direct Access (DAX) can be used on memory-backed block devices.
 	  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 447571e3cb02..850d0a0953a2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -773,10 +773,11 @@ xfs_break_layouts(
 		retry = false;
 		switch (reason) {
 		case BREAK_UNMAP:
-			error = xfs_break_dax_layouts(inode, &retry, off, len);
+			error = xfs_break_leased_layouts(inode, iolock, &retry);
 			if (error || retry)
 				break;
-			/* fall through */
+			error = xfs_break_dax_layouts(inode, &retry, off, len);
+			break;
 		case BREAK_WRITE:
 			error = xfs_break_leased_layouts(inode, iolock, &retry);
 			break;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 09/19] mm/gup: Introduce vaddr_pin structure
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (7 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-10  0:06   ` John Hubbard
  2019-08-09 22:58 ` [RFC PATCH v2 10/19] mm/gup: Pass a NULL vaddr_pin through GUP fast ira.weiny
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Some subsystems need to pass owning file information to GUP calls to
allow for GUP to associate the "owning file" to any files being pinned
within the GUP call.

Introduce an object to specify this information and pass it down through
some of the GUP call stack.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/mm.h |  9 +++++++++
 mm/gup.c           | 36 ++++++++++++++++++++++--------------
 2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04f22722b374..befe150d17be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -971,6 +971,15 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
+/**
+ * @f_owner The file who "owns this GUP"
+ * @mm The mm who "owns this GUP"
+ */
+struct vaddr_pin {
+	struct file *f_owner;
+	struct mm_struct *mm;
+};
+
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 void __put_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
diff --git a/mm/gup.c b/mm/gup.c
index 0b05e22ac05f..7a449500f0a6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1005,7 +1005,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 						struct page **pages,
 						struct vm_area_struct **vmas,
 						int *locked,
-						unsigned int flags)
+						unsigned int flags,
+						struct vaddr_pin *vaddr_pin)
 {
 	long ret, pages_done;
 	bool lock_dropped;
@@ -1165,7 +1166,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
 				       locked,
-				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+				       gup_flags | FOLL_TOUCH | FOLL_REMOTE,
+				       NULL);
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
@@ -1320,7 +1322,8 @@ static long __get_user_pages_locked(struct task_struct *tsk,
 		struct mm_struct *mm, unsigned long start,
 		unsigned long nr_pages, struct page **pages,
 		struct vm_area_struct **vmas, int *locked,
-		unsigned int foll_flags)
+		unsigned int foll_flags,
+		struct vaddr_pin *vaddr_pin)
 {
 	struct vm_area_struct *vma;
 	unsigned long vm_flags;
@@ -1504,7 +1507,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 		 */
 		nr_pages = __get_user_pages_locked(tsk, mm, start, nr_pages,
 						   pages, vmas, NULL,
-						   gup_flags);
+						   gup_flags, NULL);
 
 		if ((nr_pages > 0) && migrate_allow) {
 			drain_allow = true;
@@ -1537,7 +1540,8 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 				  unsigned long nr_pages,
 				  struct page **pages,
 				  struct vm_area_struct **vmas,
-				  unsigned int gup_flags)
+				  unsigned int gup_flags,
+				  struct vaddr_pin *vaddr_pin)
 {
 	struct vm_area_struct **vmas_tmp = vmas;
 	unsigned long flags = 0;
@@ -1558,7 +1562,7 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 	}
 
 	rc = __get_user_pages_locked(tsk, mm, start, nr_pages, pages,
-				     vmas_tmp, NULL, gup_flags);
+				     vmas_tmp, NULL, gup_flags, vaddr_pin);
 
 	if (gup_flags & FOLL_LONGTERM) {
 		memalloc_nocma_restore(flags);
@@ -1588,10 +1592,11 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
 						  unsigned long nr_pages,
 						  struct page **pages,
 						  struct vm_area_struct **vmas,
-						  unsigned int flags)
+						  unsigned int flags,
+						  struct vaddr_pin *vaddr_pin)
 {
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
-				       NULL, flags);
+				       NULL, flags, vaddr_pin);
 }
 #endif /* CONFIG_FS_DAX || CONFIG_CMA */
 
@@ -1607,7 +1612,8 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 		struct vm_area_struct **vmas)
 {
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
-				     pages, vmas, gup_flags | FOLL_TOUCH);
+				     pages, vmas, gup_flags | FOLL_TOUCH,
+				     NULL);
 }
 EXPORT_SYMBOL(get_user_pages);
 
@@ -1647,7 +1653,7 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 
 	return __get_user_pages_locked(current, current->mm, start, nr_pages,
 				       pages, NULL, locked,
-				       gup_flags | FOLL_TOUCH);
+				       gup_flags | FOLL_TOUCH, NULL);
 }
 EXPORT_SYMBOL(get_user_pages_locked);
 
@@ -1684,7 +1690,7 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 
 	down_read(&mm->mmap_sem);
 	ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
-				      &locked, gup_flags | FOLL_TOUCH);
+				      &locked, gup_flags | FOLL_TOUCH, NULL);
 	if (locked)
 		up_read(&mm->mmap_sem);
 	return ret;
@@ -2377,7 +2383,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 
 static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
-				   unsigned int gup_flags, struct page **pages)
+				   unsigned int gup_flags, struct page **pages,
+				   struct vaddr_pin *vaddr_pin)
 {
 	int ret;
 
@@ -2389,7 +2396,8 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 		down_read(&current->mm->mmap_sem);
 		ret = __gup_longterm_locked(current, current->mm,
 					    start, nr_pages,
-					    pages, NULL, gup_flags);
+					    pages, NULL, gup_flags,
+					    vaddr_pin);
 		up_read(&current->mm->mmap_sem);
 	} else {
 		ret = get_user_pages_unlocked(start, nr_pages,
@@ -2448,7 +2456,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 		pages += nr;
 
 		ret = __gup_longterm_unlocked(start, nr_pages - nr,
-					      gup_flags, pages);
+					      gup_flags, pages, NULL);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 10/19] mm/gup: Pass a NULL vaddr_pin through GUP fast
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (8 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 09/19] mm/gup: Introduce vaddr_pin structure ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-10  0:06   ` John Hubbard
  2019-08-09 22:58 ` [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack ira.weiny
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Internally GUP fast needs to know that fast users will not support file
pins.  Pass NULL for vaddr_pin through the fast call stack so that the
pin code can return an error if it encounters file backed memory within
the address range.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 mm/gup.c | 65 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7a449500f0a6..504af3e9a942 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1813,7 +1813,8 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
 
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
-			 unsigned int flags, struct page **pages, int *nr)
+			 unsigned int flags, struct page **pages, int *nr,
+			 struct vaddr_pin *vaddr_pin)
 {
 	struct dev_pagemap *pgmap = NULL;
 	int nr_start = *nr, ret = 0;
@@ -1894,7 +1895,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
  * useful to have gup_huge_pmd even if we can't operate on ptes.
  */
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
-			 unsigned int flags, struct page **pages, int *nr)
+			 unsigned int flags, struct page **pages, int *nr,
+			 struct vaddr_pin *vaddr_pin)
 {
 	return 0;
 }
@@ -1903,7 +1905,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr,
-		unsigned int flags)
+		unsigned int flags, struct vaddr_pin *vaddr_pin)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1938,13 +1940,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr,
-		unsigned int flags)
+		unsigned int flags, struct vaddr_pin *vaddr_pin)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags))
+	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags,
+			       vaddr_pin))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -1957,13 +1960,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr,
-		unsigned int flags)
+		unsigned int flags, struct vaddr_pin *vaddr_pin)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags))
+	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags,
+			       vaddr_pin))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -1975,7 +1979,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr,
-		unsigned int flags)
+		unsigned int flags, struct vaddr_pin *vaddr_pin)
 {
 	BUILD_BUG();
 	return 0;
@@ -1983,7 +1987,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 		unsigned long end, struct page **pages, int *nr,
-		unsigned int flags)
+		unsigned int flags, struct vaddr_pin *vaddr_pin)
 {
 	BUILD_BUG();
 	return 0;
@@ -2075,7 +2079,8 @@ static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
 #endif /* CONFIG_ARCH_HAS_HUGEPD */
 
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+		unsigned long end, unsigned int flags, struct page **pages,
+		int *nr, struct vaddr_pin *vaddr_pin)
 {
 	struct page *head, *page;
 	int refs;
@@ -2087,7 +2092,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr,
-					     flags);
+					     flags, vaddr_pin);
 	}
 
 	refs = 0;
@@ -2117,7 +2122,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+		unsigned long end, unsigned int flags, struct page **pages, int *nr,
+		struct vaddr_pin *vaddr_pin)
 {
 	struct page *head, *page;
 	int refs;
@@ -2129,7 +2135,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
 		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr,
-					     flags);
+					     flags, vaddr_pin);
 	}
 
 	refs = 0;
@@ -2196,7 +2202,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 }
 
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
-		unsigned int flags, struct page **pages, int *nr)
+		unsigned int flags, struct page **pages, int *nr,
+		struct vaddr_pin *vaddr_pin)
 {
 	unsigned long next;
 	pmd_t *pmdp;
@@ -2220,7 +2227,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 				return 0;
 
 			if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
-				pages, nr))
+				pages, nr, vaddr_pin))
 				return 0;
 
 		} else if (unlikely(is_hugepd(__hugepd(pmd_val(pmd))))) {
@@ -2231,7 +2238,8 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
 					 PMD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
+		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr,
+					  vaddr_pin))
 			return 0;
 	} while (pmdp++, addr = next, addr != end);
 
@@ -2239,7 +2247,8 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 }
 
 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
-			 unsigned int flags, struct page **pages, int *nr)
+			 unsigned int flags, struct page **pages, int *nr,
+			 struct vaddr_pin *vaddr_pin)
 {
 	unsigned long next;
 	pud_t *pudp;
@@ -2253,13 +2262,14 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			return 0;
 		if (unlikely(pud_huge(pud))) {
 			if (!gup_huge_pud(pud, pudp, addr, next, flags,
-					  pages, nr))
+					  pages, nr, vaddr_pin))
 				return 0;
 		} else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) {
 			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
 					 PUD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
+		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr,
+					  vaddr_pin))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
@@ -2267,7 +2277,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 }
 
 static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
-			 unsigned int flags, struct page **pages, int *nr)
+			 unsigned int flags, struct page **pages, int *nr,
+			 struct vaddr_pin *vaddr_pin)
 {
 	unsigned long next;
 	p4d_t *p4dp;
@@ -2284,7 +2295,8 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
 					 P4D_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
+		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr,
+					  vaddr_pin))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -2292,7 +2304,8 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
 }
 
 static void gup_pgd_range(unsigned long addr, unsigned long end,
-		unsigned int flags, struct page **pages, int *nr)
+		unsigned int flags, struct page **pages, int *nr,
+		struct vaddr_pin *vaddr_pin)
 {
 	unsigned long next;
 	pgd_t *pgdp;
@@ -2312,7 +2325,8 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, flags, pages, nr))
 				return;
-		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
+		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr,
+					  vaddr_pin))
 			return;
 	} while (pgdp++, addr = next, addr != end);
 }
@@ -2374,7 +2388,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
 		local_irq_save(flags);
-		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
+		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr,
+			      NULL);
 		local_irq_restore(flags);
 	}
 
@@ -2445,7 +2460,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
 		local_irq_disable();
-		gup_pgd_range(addr, end, gup_flags, pages, &nr);
+		gup_pgd_range(addr, end, gup_flags, pages, &nr, NULL);
 		local_irq_enable();
 		ret = nr;
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (9 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 10/19] mm/gup: Pass a NULL vaddr_pin through GUP fast ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-10  0:18   ` John Hubbard
  2019-08-09 22:58 ` [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct ira.weiny
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

In preparation for passing more information (vaddr_pin) into
follow_page_pte(), follow_devmap_pud(), and follow_devmap_pmd().

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/huge_mm.h | 17 -----------------
 mm/gup.c                | 31 +++++++++++++++----------------
 mm/huge_memory.c        |  6 ++++--
 mm/internal.h           | 28 ++++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 45ede62aa85b..b01a20ce0bb9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -233,11 +233,6 @@ static inline int hpage_nr_pages(struct page *page)
 	return 1;
 }
 
-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap);
-
 extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
@@ -375,18 +370,6 @@ static inline void mm_put_huge_zero_page(struct mm_struct *mm)
 	return;
 }
 
-static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
-	unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
 static inline bool thp_migration_supported(void)
 {
 	return false;
diff --git a/mm/gup.c b/mm/gup.c
index 504af3e9a942..a7a9d2f5278c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -24,11 +24,6 @@
 
 #include "internal.h"
 
-struct follow_page_context {
-	struct dev_pagemap *pgmap;
-	unsigned int page_mask;
-};
-
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -172,8 +167,9 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd, unsigned int flags,
-		struct dev_pagemap **pgmap)
+		struct follow_page_context *ctx)
 {
+	struct dev_pagemap **pgmap = &ctx->pgmap;
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	spinlock_t *ptl;
@@ -363,13 +359,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	}
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
-		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
+		page = follow_devmap_pmd(vma, address, pmd, flags, ctx);
 		spin_unlock(ptl);
 		if (page)
 			return page;
 	}
 	if (likely(!pmd_trans_huge(pmdval)))
-		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		return follow_page_pte(vma, address, pmd, flags, ctx);
 
 	if ((flags & FOLL_NUMA) && pmd_protnone(pmdval))
 		return no_page_table(vma, flags);
@@ -389,7 +385,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		return follow_page_pte(vma, address, pmd, flags, ctx);
 	}
 	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
 		int ret;
@@ -419,7 +415,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		}
 
 		return ret ? ERR_PTR(ret) :
-			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+			follow_page_pte(vma, address, pmd, flags, ctx);
 	}
 	page = follow_trans_huge_pmd(vma, address, pmd, flags);
 	spin_unlock(ptl);
@@ -456,7 +452,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	}
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
-		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+		page = follow_devmap_pud(vma, address, pud, flags, ctx);
 		spin_unlock(ptl);
 		if (page)
 			return page;
@@ -786,7 +782,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas, int *nonblocking)
+		struct vm_area_struct **vmas, int *nonblocking,
+		struct vaddr_pin *vaddr_pin)
 {
 	long ret = 0, i = 0;
 	struct vm_area_struct *vma = NULL;
@@ -797,6 +794,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
 
+	ctx.vaddr_pin = vaddr_pin;
+
 	/*
 	 * If FOLL_FORCE is set then do not force a full fault as the hinting
 	 * fault information is unrelated to the reference behaviour of a task
@@ -1025,7 +1024,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 	lock_dropped = false;
 	for (;;) {
 		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
-				       vmas, locked);
+				       vmas, locked, vaddr_pin);
 		if (!locked)
 			/* VM_FAULT_RETRY couldn't trigger, bypass */
 			return ret;
@@ -1068,7 +1067,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		lock_dropped = true;
 		down_read(&mm->mmap_sem);
 		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-				       pages, NULL, NULL);
+				       pages, NULL, NULL, vaddr_pin);
 		if (ret != 1) {
 			BUG_ON(ret > 1);
 			if (!pages_done)
@@ -1226,7 +1225,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	 * not result in a stack expansion that recurses back here.
 	 */
 	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
-				NULL, NULL, nonblocking);
+				NULL, NULL, nonblocking, NULL);
 }
 
 /*
@@ -1311,7 +1310,7 @@ struct page *get_dump_page(unsigned long addr)
 
 	if (__get_user_pages(current, current->mm, addr, 1,
 			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
-			     NULL) < 1)
+			     NULL, NULL) < 1)
 		return NULL;
 	flush_cache_page(vma, addr, page_to_pfn(page));
 	return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bc1a07a55be1..7e09f2f17ed8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -916,8 +916,9 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
+		pmd_t *pmd, int flags, struct follow_page_context *ctx)
 {
+	struct dev_pagemap **pgmap = &ctx->pgmap;
 	unsigned long pfn = pmd_pfn(*pmd);
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -1068,8 +1069,9 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 }
 
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap)
+		pud_t *pud, int flags, struct follow_page_context *ctx)
 {
+	struct dev_pagemap **pgmap = &ctx->pgmap;
 	unsigned long pfn = pud_pfn(*pud);
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
diff --git a/mm/internal.h b/mm/internal.h
index 0d5f720c75ab..46ada5279856 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -12,6 +12,34 @@
 #include <linux/pagemap.h>
 #include <linux/tracepoint-defs.h>
 
+struct follow_page_context {
+	struct dev_pagemap *pgmap;
+	unsigned int page_mask;
+	struct vaddr_pin *vaddr_pin;
+};
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
+		pmd_t *pmd, int flags, struct follow_page_context *ctx);
+struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
+		pud_t *pud, int flags, struct follow_page_context *ctx);
+#else
+static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
+	unsigned long addr, pmd_t *pmd, int flags,
+	struct follow_page_context *ctx)
+{
+	return NULL;
+}
+
+static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
+	unsigned long addr, pud_t *pud, int flags,
+	struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+
 /*
  * The set of flags that only affect watermark checking and reclaim
  * behaviour. This is used by the MM to obey the caller constraints
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (10 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-10  0:30   ` John Hubbard
  2019-08-09 22:58 ` [RFC PATCH v2 13/19] {mm,file}: Add file_pins objects ira.weiny
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Once callers start to use vaddr_pin the put_user_pages calls will need
to have access to this data coming in.  Prep put_user_pages() for this
data.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/mm.h |  20 +-------
 mm/gup.c           | 122 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 88 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index befe150d17be..9d37cafbef9a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1064,25 +1064,7 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
-/**
- * put_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
- *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
- *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
- */
-static inline void put_user_page(struct page *page)
-{
-	put_page(page);
-}
+void put_user_page(struct page *page);
 
 void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 			       bool make_dirty);
diff --git a/mm/gup.c b/mm/gup.c
index a7a9d2f5278c..10cfd30ff668 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -24,30 +24,41 @@
 
 #include "internal.h"
 
-/**
- * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
- * @pages:  array of pages to be maybe marked dirty, and definitely released.
- * @npages: number of pages in the @pages array.
- * @make_dirty: whether to mark the pages dirty
- *
- * "gup-pinned page" refers to a page that has had one of the get_user_pages()
- * variants called on that page.
- *
- * For each page in the @pages array, make that page (or its head page, if a
- * compound page) dirty, if @make_dirty is true, and if the page was previously
- * listed as clean. In any case, releases all pages using put_user_page(),
- * possibly via put_user_pages(), for the non-dirty case.
- *
- * Please see the put_user_page() documentation for details.
- *
- * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
- * required, then the caller should a) verify that this is really correct,
- * because _lock() is usually required, and b) hand code it:
- * set_page_dirty_lock(), put_user_page().
- *
- */
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
-			       bool make_dirty)
+static void __put_user_page(struct vaddr_pin *vaddr_pin, struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (put_devmap_managed_page(page))
+		return;
+
+	if (put_page_testzero(page))
+		__put_page(page);
+}
+
+static void __put_user_pages(struct vaddr_pin *vaddr_pin, struct page **pages,
+			     unsigned long npages)
+{
+	unsigned long index;
+
+	/*
+	 * TODO: this can be optimized for huge pages: if a series of pages is
+	 * physically contiguous and part of the same compound page, then a
+	 * single operation to the head page should suffice.
+	 */
+	for (index = 0; index < npages; index++)
+		__put_user_page(vaddr_pin, pages[index]);
+}
+
+static void __put_user_pages_dirty_lock(struct vaddr_pin *vaddr_pin,
+					struct page **pages,
+					unsigned long npages,
+					bool make_dirty)
 {
 	unsigned long index;
 
@@ -58,7 +69,7 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 	 */
 
 	if (!make_dirty) {
-		put_user_pages(pages, npages);
+		__put_user_pages(vaddr_pin, pages, npages);
 		return;
 	}
 
@@ -86,9 +97,58 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		 */
 		if (!PageDirty(page))
 			set_page_dirty_lock(page);
-		put_user_page(page);
+		__put_user_page(vaddr_pin, page);
 	}
 }
+
+/**
+ * put_user_page() - release a gup-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ *
+ * put_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. put_user_page() calls must
+ * be perfectly matched up with get_user_page() calls.
+ */
+void put_user_page(struct page *page)
+{
+	__put_user_page(NULL, page);
+}
+EXPORT_SYMBOL(put_user_page);
+
+/**
+ * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
+ * @pages:  array of pages to be maybe marked dirty, and definitely released.
+ * @npages: number of pages in the @pages array.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page" refers to a page that has had one of the get_user_pages()
+ * variants called on that page.
+ *
+ * For each page in the @pages array, make that page (or its head page, if a
+ * compound page) dirty, if @make_dirty is true, and if the page was previously
+ * listed as clean. In any case, releases all pages using put_user_page(),
+ * possibly via put_user_pages(), for the non-dirty case.
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), put_user_page().
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+			       bool make_dirty)
+{
+	__put_user_pages_dirty_lock(NULL, pages, npages, make_dirty);
+}
 EXPORT_SYMBOL(put_user_pages_dirty_lock);
 
 /**
@@ -102,15 +162,7 @@ EXPORT_SYMBOL(put_user_pages_dirty_lock);
  */
 void put_user_pages(struct page **pages, unsigned long npages)
 {
-	unsigned long index;
-
-	/*
-	 * TODO: this can be optimized for huge pages: if a series of pages is
-	 * physically contiguous and part of the same compound page, then a
-	 * single operation to the head page should suffice.
-	 */
-	for (index = 0; index < npages; index++)
-		put_user_page(pages[index]);
+	__put_user_pages(NULL, pages, npages);
 }
 EXPORT_SYMBOL(put_user_pages);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 13/19] {mm,file}: Add file_pins objects
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (11 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 14/19] fs/locks: Associate file pins while performing GUP ira.weiny
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

User page pins (aka GUP) needs to track file information of files being
pinned by those calls.  Depending on the needs of the caller this
information is stored in 1 of 2 ways.

1) Some subsystems like RDMA associate GUP pins with file descriptors
   which can be passed around to other process'.  In this case a file
   being pined must be associated with an owning file object (which can
   then be resolved back to any of the processes which have a file
   descriptor 'pointing' to that file object).

2) Other subsystems do not have an owning file and can therefore
   associate the file pin directly to the mm of the process which
   created them.

This patch introduces the new file pin structures and ensures struct
file and struct mm_struct are prepared to store them.

In subsequent patches the required information will be passed into new
pin page calls and procfs is enhanced to show this information to the user.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/file_table.c          |  4 ++++
 include/linux/file.h     | 49 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h       |  2 ++
 include/linux/mm_types.h |  2 ++
 kernel/fork.c            |  3 +++
 5 files changed, 60 insertions(+)

diff --git a/fs/file_table.c b/fs/file_table.c
index b07b53f24ff5..38947b9a4769 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -46,6 +46,7 @@ static void file_free_rcu(struct rcu_head *head)
 {
 	struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
 
+	WARN_ON(!list_empty(&f->file_pins));
 	put_cred(f->f_cred);
 	kmem_cache_free(filp_cachep, f);
 }
@@ -118,6 +119,9 @@ static struct file *__alloc_file(int flags, const struct cred *cred)
 	f->f_mode = OPEN_FMODE(flags);
 	/* f->f_version: 0 */
 
+	INIT_LIST_HEAD(&f->file_pins);
+	spin_lock_init(&f->fp_lock);
+
 	return f;
 }
 
diff --git a/include/linux/file.h b/include/linux/file.h
index 3fcddff56bc4..cd79adad5b23 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -9,6 +9,7 @@
 #include <linux/compiler.h>
 #include <linux/types.h>
 #include <linux/posix_types.h>
+#include <linux/kref.h>
 
 struct file;
 
@@ -91,4 +92,52 @@ extern void fd_install(unsigned int fd, struct file *file);
 extern void flush_delayed_fput(void);
 extern void __fput_sync(struct file *);
 
+/**
+ * struct file_file_pin
+ *
+ * Associate a pin'ed file with another file owner.
+ *
+ * Subsystems such as RDMA have the ability to pin memory which is associated
+ * with a file descriptor which can be passed to other processes without
+ * necessarily having that memory accessed in the remote processes address
+ * space.
+ *
+ * @file file backing memory which was pined by a GUP caller
+ * @f_owner the file representing the GUP owner
+ * @list of all file pins this owner has
+ *       (struct file *)->file_pins
+ * @ref number of times this pin was taken (roughly the number of pages pinned
+ *      in the file)
+ */
+struct file_file_pin {
+	struct file *file;
+	struct file *f_owner;
+	struct list_head list;
+	struct kref ref;
+};
+
+/*
+ * struct mm_file_pin
+ *
+ * Some GUP callers do not have an "owning" file.  Those pins are accounted for
+ * in the mm of the process that called GUP.
+ *
+ * The tuple {file, inode} is used to track this as a unique file pin and to
+ * track when this pin has been removed.
+ *
+ * @file file backing memory which was pined by a GUP caller
+ * @mm back point to owning mm
+ * @inode backing the file
+ * @list of all file pins this owner has
+ *       (struct mm_struct *)->file_pins
+ * @ref number of times this pin was taken
+ */
+struct mm_file_pin {
+	struct file *file;
+	struct mm_struct *mm;
+	struct inode *inode;
+	struct list_head list;
+	struct kref ref;
+};
+
 #endif /* __LINUX_FILE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2e41ce547913..d2e08feb9737 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -963,6 +963,8 @@ struct file {
 #endif /* #ifdef CONFIG_EPOLL */
 	struct address_space	*f_mapping;
 	errseq_t		f_wb_err;
+	struct list_head        file_pins;
+	spinlock_t              fp_lock;
 } __randomize_layout
   __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..4f6ea4acddbd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -516,6 +516,8 @@ struct mm_struct {
 		/* HMM needs to track a few things per mm */
 		struct hmm *hmm;
 #endif
+		struct list_head file_pins;
+		spinlock_t fp_lock; /* lock file_pins */
 	} __randomize_layout;
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 0e2f9a2c132c..093f2f2fce1a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -675,6 +675,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 	WARN_ON_ONCE(mm == current->active_mm);
+	WARN_ON(!list_empty(&mm->file_pins));
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
@@ -1013,6 +1014,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	INIT_LIST_HEAD(&mm->file_pins);
+	spin_lock_init(&mm->fp_lock);
 
 	if (current->mm) {
 		mm->flags = current->mm->flags & MMF_INIT_MASK;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 14/19] fs/locks: Associate file pins while performing GUP
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (12 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 13/19] {mm,file}: Add file_pins objects ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages() ira.weiny
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

When a file back area is being pinned add the appropriate file pin
information to the appropriate file or mm owner.  This information can
then be used by admins to determine who is causing a failure to change
the layout of a file.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/locks.c         | 195 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/mm.h |  35 +++++++-
 mm/gup.c           |   8 +-
 mm/huge_memory.c   |   4 +-
 4 files changed, 230 insertions(+), 12 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 14892c84844b..02c525446d25 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -168,6 +168,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/hashtable.h>
 #include <linux/percpu.h>
+#include <linux/sched/mm.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/filelock.h>
@@ -2972,9 +2973,194 @@ static int __init filelock_init(void)
 }
 core_initcall(filelock_init);
 
+static struct file_file_pin *alloc_file_file_pin(struct inode *inode,
+						 struct file *file)
+{
+	struct file_file_pin *fp = kzalloc(sizeof(*fp), GFP_ATOMIC);
+
+	if (!fp)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&fp->list);
+	kref_init(&fp->ref);
+	return fp;
+}
+
+static int add_file_pin_to_f_owner(struct vaddr_pin *vaddr_pin,
+				   struct inode *inode,
+				   struct file *file)
+{
+	struct file_file_pin *fp;
+
+	list_for_each_entry(fp, &vaddr_pin->f_owner->file_pins, list) {
+		if (fp->file == file) {
+			kref_get(&fp->ref);
+			return 0;
+		}
+	}
+
+	fp = alloc_file_file_pin(inode, file);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	fp->file = get_file(file);
+	/* NOTE no reference needed here.
+	 * It is expected that the caller holds a reference to the owner file
+	 * for the duration of this pin.
+	 */
+	fp->f_owner = vaddr_pin->f_owner;
+
+	spin_lock(&fp->f_owner->fp_lock);
+	list_add(&fp->list, &fp->f_owner->file_pins);
+	spin_unlock(&fp->f_owner->fp_lock);
+
+	return 0;
+}
+
+static void release_file_file_pin(struct kref *ref)
+{
+	struct file_file_pin *fp = container_of(ref, struct file_file_pin, ref);
+
+	spin_lock(&fp->f_owner->fp_lock);
+	list_del(&fp->list);
+	spin_unlock(&fp->f_owner->fp_lock);
+	fput(fp->file);
+	kfree(fp);
+}
+
+static struct mm_file_pin *alloc_mm_file_pin(struct inode *inode,
+					     struct file *file)
+{
+	struct mm_file_pin *fp = kzalloc(sizeof(*fp), GFP_ATOMIC);
+
+	if (!fp)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&fp->list);
+	kref_init(&fp->ref);
+	return fp;
+}
+
+/**
+ * This object bridges files and the mm struct for the purpose of tracking
+ * which files have GUP pins on them.
+ */
+static int add_file_pin_to_mm(struct vaddr_pin *vaddr_pin, struct inode *inode,
+			      struct file *file)
+{
+	struct mm_file_pin *fp;
+
+	list_for_each_entry(fp, &vaddr_pin->mm->file_pins, list) {
+		if (fp->inode == inode) {
+			kref_get(&fp->ref);
+			return 0;
+		}
+	}
+
+	fp = alloc_mm_file_pin(inode, file);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	fp->inode = igrab(inode);
+	if (!fp->inode) {
+		kfree(fp);
+		return -EFAULT;
+	}
+
+	fp->file = get_file(file);
+	fp->mm = vaddr_pin->mm;
+	mmgrab(fp->mm);
+
+	spin_lock(&fp->mm->fp_lock);
+	list_add(&fp->list, &fp->mm->file_pins);
+	spin_unlock(&fp->mm->fp_lock);
+
+	return 0;
+}
+
+static void release_mm_file_pin(struct kref *ref)
+{
+	struct mm_file_pin *fp = container_of(ref, struct mm_file_pin, ref);
+
+	spin_lock(&fp->mm->fp_lock);
+	list_del(&fp->list);
+	spin_unlock(&fp->mm->fp_lock);
+
+	mmdrop(fp->mm);
+	fput(fp->file);
+	iput(fp->inode);
+	kfree(fp);
+}
+
+static void remove_file_file_pin(struct vaddr_pin *vaddr_pin)
+{
+	struct file_file_pin *fp;
+	struct file_file_pin *tmp;
+
+	list_for_each_entry_safe(fp, tmp, &vaddr_pin->f_owner->file_pins,
+				 list) {
+		kref_put(&fp->ref, release_file_file_pin);
+	}
+}
+
+static void remove_mm_file_pin(struct vaddr_pin *vaddr_pin,
+			       struct inode *inode)
+{
+	struct mm_file_pin *fp;
+	struct mm_file_pin *tmp;
+
+	list_for_each_entry_safe(fp, tmp, &vaddr_pin->mm->file_pins, list) {
+		if (fp->inode == inode)
+			kref_put(&fp->ref, release_mm_file_pin);
+	}
+}
+
+static bool add_file_pin(struct vaddr_pin *vaddr_pin, struct inode *inode,
+			 struct file *file)
+{
+	bool ret = true;
+
+	if (!vaddr_pin || (!vaddr_pin->f_owner && !vaddr_pin->mm))
+		return false;
+
+	if (vaddr_pin->f_owner) {
+		if (add_file_pin_to_f_owner(vaddr_pin, inode, file))
+			ret = false;
+	} else {
+		if (add_file_pin_to_mm(vaddr_pin, inode, file))
+			ret = false;
+	}
+
+	return ret;
+}
+
+void mapping_release_file(struct vaddr_pin *vaddr_pin, struct page *page)
+{
+	struct inode *inode;
+
+	if (WARN_ON(!page) || WARN_ON(!vaddr_pin) ||
+	    WARN_ON(!vaddr_pin->mm && !vaddr_pin->f_owner))
+		return;
+
+	if (PageAnon(page) ||
+	    !page->mapping ||
+	    !page->mapping->host)
+		return;
+
+	inode = page->mapping->host;
+
+	if (vaddr_pin->f_owner)
+		remove_file_file_pin(vaddr_pin);
+	else
+		remove_mm_file_pin(vaddr_pin, inode);
+}
+EXPORT_SYMBOL_GPL(mapping_release_file);
+
 /**
  * mapping_inode_has_layout - ensure a file mapped page has a layout lease
  * taken
+ * @vaddr_pin: pin owner information to store with this pin if a proper layout
+ * is lease is found.
  * @page: page we are trying to GUP
  *
  * This should only be called on DAX pages.  DAX pages which are mapped through
@@ -2983,9 +3169,12 @@ core_initcall(filelock_init);
  * This allows the user to opt-into the fact that truncation operations will
  * fail for the duration of the pin.
  *
+ * Also if the proper layout leases are found we store pining information into
+ * the owner passed in via the vaddr_pin structure.
+ *
  * Return true if the page has a LAYOUT lease associated with it's file.
  */
-bool mapping_inode_has_layout(struct page *page)
+bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page)
 {
 	bool ret = false;
 	struct inode *inode;
@@ -3003,12 +3192,12 @@ bool mapping_inode_has_layout(struct page *page)
 	if (inode->i_flctx &&
 	    !list_empty_careful(&inode->i_flctx->flc_lease)) {
 		spin_lock(&inode->i_flctx->flc_lock);
-		ret = false;
 		list_for_each_entry(fl, &inode->i_flctx->flc_lease, fl_list) {
 			if (fl->fl_pid == current->tgid &&
 			    (fl->fl_flags & FL_LAYOUT) &&
 			    (fl->fl_flags & FL_EXCLUSIVE)) {
-				ret = true;
+				ret = add_file_pin(vaddr_pin, inode,
+						   fl->fl_file);
 				break;
 			}
 		}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d37cafbef9a..657c947bda49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -981,9 +981,11 @@ struct vaddr_pin {
 };
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
+void mapping_release_file(struct vaddr_pin *vaddr_pin, struct page *page);
 void __put_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	if (!static_branch_unlikely(&devmap_managed_key))
 		return false;
@@ -992,7 +994,6 @@ static inline bool put_devmap_managed_page(struct page *page)
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_FS_DAX:
-		__put_devmap_managed_page(page);
 		return true;
 	default:
 		break;
@@ -1000,11 +1001,39 @@ static inline bool put_devmap_managed_page(struct page *page)
 	return false;
 }
 
+static inline bool put_devmap_managed_page(struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+	if (is_devmap)
+		__put_devmap_managed_page(page);
+	return is_devmap;
+}
+
+static inline bool put_devmap_managed_user_page(struct vaddr_pin *vaddr_pin,
+						struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+
+	if (is_devmap) {
+		if (page->pgmap->type == MEMORY_DEVICE_FS_DAX)
+			mapping_release_file(vaddr_pin, page);
+
+		__put_devmap_managed_page(page);
+	}
+
+	return is_devmap;
+}
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline bool put_devmap_managed_page(struct page *page)
 {
 	return false;
 }
+static inline bool put_devmap_managed_user_page(struct vaddr_pin *vaddr_pin,
+						struct page *page)
+{
+	return false;
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1574,7 +1603,7 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
 
-bool mapping_inode_has_layout(struct page *page);
+bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page);
 
 /* Container for pinned pfns / pages */
 struct frame_vector {
diff --git a/mm/gup.c b/mm/gup.c
index 10cfd30ff668..eeaa0ddd08a6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -34,7 +34,7 @@ static void __put_user_page(struct vaddr_pin *vaddr_pin, struct page *page)
 	 * page is free and we need to inform the device driver through
 	 * callback. See include/linux/memremap.h and HMM for details.
 	 */
-	if (put_devmap_managed_page(page))
+	if (put_devmap_managed_user_page(vaddr_pin, page))
 		return;
 
 	if (put_page_testzero(page))
@@ -272,7 +272,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 
 		if (unlikely(flags & FOLL_LONGTERM) &&
 		    (*pgmap)->type == MEMORY_DEVICE_FS_DAX &&
-		    !mapping_inode_has_layout(page)) {
+		    !mapping_inode_has_layout(ctx->vaddr_pin, page)) {
 			page = ERR_PTR(-EPERM);
 			goto out;
 		}
@@ -1915,7 +1915,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (pte_devmap(pte) &&
 		    unlikely(flags & FOLL_LONGTERM) &&
 		    pgmap->type == MEMORY_DEVICE_FS_DAX &&
-		    !mapping_inode_has_layout(head)) {
+		    !mapping_inode_has_layout(vaddr_pin, head)) {
 			put_user_page(head);
 			goto pte_unmap;
 		}
@@ -1972,7 +1972,7 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
 		if (unlikely(flags & FOLL_LONGTERM) &&
 		    pgmap->type == MEMORY_DEVICE_FS_DAX &&
-		    !mapping_inode_has_layout(page)) {
+		    !mapping_inode_has_layout(vaddr_pin, page)) {
 			undo_dev_pagemap(nr, nr_start, pages);
 			return 0;
 		}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7e09f2f17ed8..2d700e21d4af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -957,7 +957,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	if (unlikely(flags & FOLL_LONGTERM) &&
 	    (*pgmap)->type == MEMORY_DEVICE_FS_DAX &&
-	    !mapping_inode_has_layout(page))
+	    !mapping_inode_has_layout(ctx->vaddr_pin, page))
 		return ERR_PTR(-EPERM);
 
 	get_page(page);
@@ -1104,7 +1104,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 
 	if (unlikely(flags & FOLL_LONGTERM) &&
 	    (*pgmap)->type == MEMORY_DEVICE_FS_DAX &&
-	    !mapping_inode_has_layout(page))
+	    !mapping_inode_has_layout(ctx->vaddr_pin, page))
 		return ERR_PTR(-EPERM);
 
 	get_page(page);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (13 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 14/19] fs/locks: Associate file pins while performing GUP ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-10  0:09   ` John Hubbard
                     ` (2 more replies)
  2019-08-09 22:58 ` [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object ira.weiny
                   ` (4 subsequent siblings)
  19 siblings, 3 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

The addition of FOLL_LONGTERM has taken on additional meaning for CMA
pages.

In addition subsystems such as RDMA require new information to be passed
to the GUP interface to track file owning information.  As such a simple
FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.

Introduce a new GUP like call which takes the newly introduced vaddr_pin
information.  Failure to pass the vaddr_pin object back to a vaddr_put*
call will result in a failure if pins were created on files during the
pin operation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from list:
	Change to vaddr_put_pages_dirty_lock
	Change to vaddr_unpin_pages_dirty_lock

 include/linux/mm.h |  5 ++++
 mm/gup.c           | 59 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 657c947bda49..90c5802866df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1603,6 +1603,11 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
 
+long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
+		     unsigned int gup_flags, struct page **pages,
+		     struct vaddr_pin *vaddr_pin);
+void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
+				  struct vaddr_pin *vaddr_pin, bool make_dirty);
 bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page);
 
 /* Container for pinned pfns / pages */
diff --git a/mm/gup.c b/mm/gup.c
index eeaa0ddd08a6..6d23f70d7847 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2536,3 +2536,62 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 	return ret;
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/**
+ * vaddr_pin_pages pin pages by virtual address and return the pages to the
+ * user.
+ *
+ * @addr, start address
+ * @nr_pages, number of pages to pin
+ * @gup_flags, flags to use for the pin
+ * @pages, array of pages returned
+ * @vaddr_pin, initalized meta information this pin is to be associated
+ * with.
+ *
+ * NOTE regarding vaddr_pin:
+ *
+ * Some callers can share pins via file descriptors to other processes.
+ * Callers such as this should use the f_owner field of vaddr_pin to indicate
+ * the file the fd points to.  All other callers should use the mm this pin is
+ * being made against.  Usually "current->mm".
+ *
+ * Expects mmap_sem to be read locked.
+ */
+long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
+		     unsigned int gup_flags, struct page **pages,
+		     struct vaddr_pin *vaddr_pin)
+{
+	long ret;
+
+	gup_flags |= FOLL_LONGTERM;
+
+	if (!vaddr_pin || (!vaddr_pin->mm && !vaddr_pin->f_owner))
+		return -EINVAL;
+
+	ret = __gup_longterm_locked(current,
+				    vaddr_pin->mm,
+				    addr, nr_pages,
+				    pages, NULL, gup_flags,
+				    vaddr_pin);
+	return ret;
+}
+EXPORT_SYMBOL(vaddr_pin_pages);
+
+/**
+ * vaddr_unpin_pages_dirty_lock - counterpart to vaddr_pin_pages
+ *
+ * @pages, array of pages returned
+ * @nr_pages, number of pages in pages
+ * @vaddr_pin, same information passed to vaddr_pin_pages
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * The semantics are similar to put_user_pages_dirty_lock but a vaddr_pin used
+ * in vaddr_pin_pages should be passed back into this call for propper
+ * tracking.
+ */
+void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
+				  struct vaddr_pin *vaddr_pin, bool make_dirty)
+{
+	__put_user_pages_dirty_lock(vaddr_pin, pages, nr_pages, make_dirty);
+}
+EXPORT_SYMBOL(vaddr_unpin_pages_dirty_lock);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (14 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages() ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-12 13:00   ` Jason Gunthorpe
  2019-08-09 22:58 ` [RFC PATCH v2 17/19] RDMA/umem: Convert to vaddr_[pin|unpin]* operations ira.weiny
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

In order for MRs to be tracked against the open verbs context the ufile
needs to have a pointer to hand to the GUP code.

No references need to be taken as this should be valid for the lifetime
of the context.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/infiniband/core/uverbs.h      | 1 +
 drivers/infiniband/core/uverbs_main.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 1e5aeb39f774..e802ba8c67d6 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -163,6 +163,7 @@ struct ib_uverbs_file {
 	struct page *disassociate_page;
 
 	struct xarray		idr;
+	struct file             *sys_file; /* backpointer to system file object */
 };
 
 struct ib_uverbs_event {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 11c13c1381cf..002c24e0d4db 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -1092,6 +1092,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	INIT_LIST_HEAD(&file->umaps);
 
 	filp->private_data = file;
+	file->sys_file = filp;
 	list_add_tail(&file->list, &dev->uverbs_file_list);
 	mutex_unlock(&dev->lists_mutex);
 	srcu_read_unlock(&dev->disassociate_srcu, srcu_key);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 17/19] RDMA/umem: Convert to vaddr_[pin|unpin]* operations.
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (15 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 18/19] {mm,procfs}: Add display file_pins proc ira.weiny
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

In order to properly track the pinning information we need to keep a
vaddr_pin object around.  Store that within the umem object directly.

The vaddr_pin object allows the GUP code to associate any files it pins
with the RDMA file descriptor associated with this GUP.

Furthermore, use the vaddr_pin object to store the owning mm while we
are at it.

No references need to be taken on the owing file as the lifetime of that
object is tied to all the umems being destroyed first.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/infiniband/core/umem.c     | 26 +++++++++++++++++---------
 drivers/infiniband/core/umem_odp.c | 16 ++++++++--------
 include/rdma/ib_umem.h             |  2 +-
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 965cf9dea71a..a9ce3e3816ef 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,8 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 
 	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
 		page = sg_page_iter_page(&sg_iter);
-		put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
+		vaddr_unpin_pages_dirty_lock(&page, 1, &umem->vaddr_pin,
+					     umem->writable && dirty);
 	}
 
 	sg_free_table(&umem->sg_head);
@@ -243,8 +244,15 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	umem->length     = size;
 	umem->address    = addr;
 	umem->writable   = ib_access_writable(access);
-	umem->owning_mm = mm = current->mm;
-	mmgrab(mm);
+	umem->vaddr_pin.mm = mm = current->mm;
+	mmgrab(umem->vaddr_pin.mm);
+
+	/* No need to get a reference to the core file object here.  The key is
+	 * that sys_file reference is held by the ufile.  Any duplication of
+	 * sys_file by the core will keep references active until all those
+	 * contexts are closed out.  No matter which process hold them open.
+	 */
+	umem->vaddr_pin.f_owner = context->ufile->sys_file;
 
 	if (access & IB_ACCESS_ON_DEMAND) {
 		if (WARN_ON_ONCE(!context->invalidate_range)) {
@@ -292,11 +300,11 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 
 	while (npages) {
 		down_read(&mm->mmap_sem);
-		ret = get_user_pages(cur_base,
+		ret = vaddr_pin_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     gup_flags | FOLL_LONGTERM,
-				     page_list, NULL);
+				     gup_flags,
+				     page_list, &umem->vaddr_pin);
 		if (ret < 0) {
 			up_read(&mm->mmap_sem);
 			goto umem_release;
@@ -336,7 +344,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	free_page((unsigned long) page_list);
 umem_kfree:
 	if (ret) {
-		mmdrop(umem->owning_mm);
+		mmdrop(umem->vaddr_pin.mm);
 		kfree(umem);
 	}
 	return ret ? ERR_PTR(ret) : umem;
@@ -345,7 +353,7 @@ EXPORT_SYMBOL(ib_umem_get);
 
 static void __ib_umem_release_tail(struct ib_umem *umem)
 {
-	mmdrop(umem->owning_mm);
+	mmdrop(umem->vaddr_pin.mm);
 	if (umem->is_odp)
 		kfree(to_ib_umem_odp(umem));
 	else
@@ -369,7 +377,7 @@ void ib_umem_release(struct ib_umem *umem)
 
 	__ib_umem_release(umem->context->device, umem, 1);
 
-	atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
+	atomic64_sub(ib_umem_num_pages(umem), &umem->vaddr_pin.mm->pinned_vm);
 	__ib_umem_release_tail(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 2a75c6f8d827..53085896d718 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -278,11 +278,11 @@ static int get_per_mm(struct ib_umem_odp *umem_odp)
 	 */
 	mutex_lock(&ctx->per_mm_list_lock);
 	list_for_each_entry(per_mm, &ctx->per_mm_list, ucontext_list) {
-		if (per_mm->mm == umem_odp->umem.owning_mm)
+		if (per_mm->mm == umem_odp->umem.vaddr_pin.mm)
 			goto found;
 	}
 
-	per_mm = alloc_per_mm(ctx, umem_odp->umem.owning_mm);
+	per_mm = alloc_per_mm(ctx, umem_odp->umem.vaddr_pin.mm);
 	if (IS_ERR(per_mm)) {
 		mutex_unlock(&ctx->per_mm_list_lock);
 		return PTR_ERR(per_mm);
@@ -355,8 +355,8 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 	umem->writable   = root->umem.writable;
 	umem->is_odp = 1;
 	odp_data->per_mm = per_mm;
-	umem->owning_mm  = per_mm->mm;
-	mmgrab(umem->owning_mm);
+	umem->vaddr_pin.mm  = per_mm->mm;
+	mmgrab(umem->vaddr_pin.mm);
 
 	mutex_init(&odp_data->umem_mutex);
 	init_completion(&odp_data->notifier_completion);
@@ -389,7 +389,7 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct ib_umem_odp *root,
 out_page_list:
 	vfree(odp_data->page_list);
 out_odp_data:
-	mmdrop(umem->owning_mm);
+	mmdrop(umem->vaddr_pin.mm);
 	kfree(odp_data);
 	return ERR_PTR(ret);
 }
@@ -399,10 +399,10 @@ int ib_umem_odp_get(struct ib_umem_odp *umem_odp, int access)
 {
 	struct ib_umem *umem = &umem_odp->umem;
 	/*
-	 * NOTE: This must called in a process context where umem->owning_mm
+	 * NOTE: This must called in a process context where umem->vaddr_pin.mm
 	 * == current->mm
 	 */
-	struct mm_struct *mm = umem->owning_mm;
+	struct mm_struct *mm = umem->vaddr_pin.mm;
 	int ret_val;
 
 	umem_odp->page_shift = PAGE_SHIFT;
@@ -581,7 +581,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 			      unsigned long current_seq)
 {
 	struct task_struct *owning_process  = NULL;
-	struct mm_struct *owning_mm = umem_odp->umem.owning_mm;
+	struct mm_struct *owning_mm = umem_odp->umem.vaddr_pin.mm;
 	struct page       **local_page_list = NULL;
 	u64 page_mask, off;
 	int j, k, ret = 0, start_idx, npages = 0;
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1052d0d62be7..ab677c799e29 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -43,7 +43,6 @@ struct ib_umem_odp;
 
 struct ib_umem {
 	struct ib_ucontext     *context;
-	struct mm_struct       *owning_mm;
 	size_t			length;
 	unsigned long		address;
 	u32 writable : 1;
@@ -52,6 +51,7 @@ struct ib_umem {
 	struct sg_table sg_head;
 	int             nmap;
 	unsigned int    sg_nents;
+	struct vaddr_pin vaddr_pin;
 };
 
 /* Returns the offset of the umem start relative to the first page. */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 18/19] {mm,procfs}: Add display file_pins proc
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (16 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 17/19] RDMA/umem: Convert to vaddr_[pin|unpin]* operations ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-09 22:58 ` [RFC PATCH v2 19/19] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
  2019-08-14 10:17 ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Jan Kara
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Now that we have the file pins information stored add a new procfs entry
to display them to the user.

NOTE output will be dependant on where the file pin is tied to.  Some
processes may have the pin associated with a file descriptor in which
case that file is reported as well.

Others are associated directly with the process mm and are reported as
such.

For example of a file pinned to an RDMA open context (fd 4) and a file
pinned to the mm of that process:

4: /dev/infiniband/uverbs0
   /mnt/pmem/foo
/mnt/pmem/bar

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/proc/base.c | 214 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 214 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..f4d219172235 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2995,6 +2995,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
  */
 static const struct file_operations proc_task_operations;
 static const struct inode_operations proc_task_inode_operations;
+static const struct file_operations proc_pid_file_pins_operations;
 
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
@@ -3024,6 +3025,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
 	REG("maps",       S_IRUGO, proc_pid_maps_operations),
+	REG("file_pins",  S_IRUGO, proc_pid_file_pins_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
 #endif
@@ -3422,6 +3424,7 @@ static const struct pid_entry tid_base_stuff[] = {
 	ONE("stat",      S_IRUGO, proc_tid_stat),
 	ONE("statm",     S_IRUGO, proc_pid_statm),
 	REG("maps",      S_IRUGO, proc_pid_maps_operations),
+	REG("file_pins", S_IRUGO, proc_pid_file_pins_operations),
 #ifdef CONFIG_PROC_CHILDREN
 	REG("children",  S_IRUGO, proc_tid_children_operations),
 #endif
@@ -3718,3 +3721,214 @@ void __init set_proc_pid_nlink(void)
 	nlink_tid = pid_entry_nlink(tid_base_stuff, ARRAY_SIZE(tid_base_stuff));
 	nlink_tgid = pid_entry_nlink(tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
 }
+
+/**
+ * file_pin information below.
+ */
+
+struct proc_file_pins_private {
+	struct inode *inode;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	struct files_struct *files;
+	unsigned int nr_pins;
+	struct xarray fps;
+} __randomize_layout;
+
+static void release_fp(struct proc_file_pins_private *priv)
+{
+	up_read(&priv->mm->mmap_sem);
+	mmput(priv->mm);
+}
+
+static void print_fd_file_pin(struct seq_file *m, struct file *file,
+			    unsigned long i)
+{
+	struct file_file_pin *fp;
+	struct file_file_pin *tmp;
+
+	if (list_empty_careful(&file->file_pins))
+		return;
+
+	seq_printf(m, "%lu: ", i);
+	seq_file_path(m, file, "\n");
+	seq_putc(m, '\n');
+
+	list_for_each_entry_safe(fp, tmp, &file->file_pins, list) {
+		seq_puts(m, "   ");
+		seq_file_path(m, fp->file, "\n");
+		seq_putc(m, '\n');
+	}
+}
+
+/* We are storing the index's within the FD table for later retrieval */
+static int store_fd(const void *priv , struct file *file, unsigned i)
+{
+	struct proc_file_pins_private *fp_priv;
+
+	/* cast away const... */
+	fp_priv = (struct proc_file_pins_private *)priv;
+
+	if (list_empty_careful(&file->file_pins))
+		return 0;
+
+	/* can't sleep in the iterate of the fd table */
+	xa_store(&fp_priv->fps, fp_priv->nr_pins, xa_mk_value(i), GFP_ATOMIC);
+	fp_priv->nr_pins++;
+
+	return 0;
+}
+
+static void store_mm_pins(struct proc_file_pins_private *priv)
+{
+	struct mm_file_pin *fp;
+	struct mm_file_pin *tmp;
+
+	list_for_each_entry_safe(fp, tmp, &priv->mm->file_pins, list) {
+		xa_store(&priv->fps, priv->nr_pins, fp, GFP_KERNEL);
+		priv->nr_pins++;
+	}
+}
+
+
+static void *fp_start(struct seq_file *m, loff_t *ppos)
+{
+	struct proc_file_pins_private *priv = m->private;
+	unsigned int pos = *ppos;
+
+	priv->task = get_proc_task(priv->inode);
+	if (!priv->task)
+		return ERR_PTR(-ESRCH);
+
+	if (!priv->mm || !mmget_not_zero(priv->mm))
+		return NULL;
+
+	priv->files = get_files_struct(priv->task);
+	down_read(&priv->mm->mmap_sem);
+
+	xa_destroy(&priv->fps);
+	priv->nr_pins = 0;
+
+	/* grab fds of "files" which have pins and store as xa values */
+	if (priv->files)
+		iterate_fd(priv->files, 0, store_fd, priv);
+
+	/* store mm_file_pins as xa entries */
+	store_mm_pins(priv);
+
+	if (pos >= priv->nr_pins) {
+		release_fp(priv);
+		return NULL;
+	}
+
+	return xa_load(&priv->fps, pos);
+}
+
+static void *fp_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct proc_file_pins_private *priv = m->private;
+
+	(*pos)++;
+	if ((*pos) >= priv->nr_pins) {
+		release_fp(priv);
+		return NULL;
+	}
+
+	return xa_load(&priv->fps, *pos);
+}
+
+static void fp_stop(struct seq_file *m, void *v)
+{
+	struct proc_file_pins_private *priv = m->private;
+
+	if (v)
+		release_fp(priv);
+
+	if (priv->task) {
+		put_task_struct(priv->task);
+		priv->task = NULL;
+	}
+
+	if (priv->files) {
+		put_files_struct(priv->files);
+		priv->files = NULL;
+	}
+}
+
+static int show_fp(struct seq_file *m, void *v)
+{
+	struct proc_file_pins_private *priv = m->private;
+
+	if (xa_is_value(v)) {
+		struct file *file;
+		unsigned long fd = xa_to_value(v);
+
+		rcu_read_lock();
+		file = fcheck_files(priv->files, fd);
+		if (file)
+			print_fd_file_pin(m, file, fd);
+		rcu_read_unlock();
+	} else {
+		struct mm_file_pin *fp = v;
+
+		seq_puts(m, "mm: ");
+		seq_file_path(m, fp->file, "\n");
+	}
+
+	return 0;
+}
+
+static const struct seq_operations proc_pid_file_pins_op = {
+	.start	= fp_start,
+	.next	= fp_next,
+	.stop	= fp_stop,
+	.show	= show_fp
+};
+
+static int proc_file_pins_open(struct inode *inode, struct file *file)
+{
+	struct proc_file_pins_private *priv = __seq_open_private(file,
+						&proc_pid_file_pins_op,
+						sizeof(*priv));
+
+	if (!priv)
+		return -ENOMEM;
+
+	xa_init(&priv->fps);
+	priv->inode = inode;
+	priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	priv->task = NULL;
+	if (IS_ERR(priv->mm)) {
+		int err = PTR_ERR(priv->mm);
+
+		seq_release_private(inode, file);
+		return err;
+	}
+
+	return 0;
+}
+
+static int proc_file_pins_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct proc_file_pins_private *priv = seq->private;
+
+	/* This is for "protection" not sure when these may end up not being
+	 * NULL here... */
+	WARN_ON(priv->files);
+	WARN_ON(priv->task);
+
+	if (priv->mm)
+		mmdrop(priv->mm);
+
+	xa_destroy(&priv->fps);
+
+	return seq_release_private(inode, file);
+}
+
+static const struct file_operations proc_pid_file_pins_operations = {
+	.open		= proc_file_pins_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_file_pins_release,
+};
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH v2 19/19] mm/gup: Remove FOLL_LONGTERM DAX exclusion
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (17 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 18/19] {mm,procfs}: Add display file_pins proc ira.weiny
@ 2019-08-09 22:58 ` ira.weiny
  2019-08-14 10:17 ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Jan Kara
  19 siblings, 0 replies; 110+ messages in thread
From: ira.weiny @ 2019-08-09 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm, Ira Weiny

From: Ira Weiny <ira.weiny@intel.com>

Now that there is a mechanism for users to safely take LONGTERM pins on
FS DAX pages, remove the FS DAX exclusion from the GUP implementation.

Special processing remains in effect for CONFIG_CMA

NOTE: Some callers still fail because the vaddr_pin information has not
been passed into the new interface.  As new users appear they can start
to use the new interface to support FS DAX.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 mm/gup.c | 78 ++++++--------------------------------------------------
 1 file changed, 8 insertions(+), 70 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6d23f70d7847..58f008a3c153 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1415,26 +1415,6 @@ static long __get_user_pages_locked(struct task_struct *tsk,
 }
 #endif /* !CONFIG_MMU */
 
-#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
-static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
-{
-	long i;
-	struct vm_area_struct *vma_prev = NULL;
-
-	for (i = 0; i < nr_pages; i++) {
-		struct vm_area_struct *vma = vmas[i];
-
-		if (vma == vma_prev)
-			continue;
-
-		vma_prev = vma;
-
-		if (vma_is_fsdax(vma))
-			return true;
-	}
-	return false;
-}
-
 #ifdef CONFIG_CMA
 static struct page *new_non_cma_page(struct page *page, unsigned long private)
 {
@@ -1568,18 +1548,6 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 
 	return nr_pages;
 }
-#else
-static long check_and_migrate_cma_pages(struct task_struct *tsk,
-					struct mm_struct *mm,
-					unsigned long start,
-					unsigned long nr_pages,
-					struct page **pages,
-					struct vm_area_struct **vmas,
-					unsigned int gup_flags)
-{
-	return nr_pages;
-}
-#endif /* CONFIG_CMA */
 
 /*
  * __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
@@ -1594,49 +1562,28 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 				  unsigned int gup_flags,
 				  struct vaddr_pin *vaddr_pin)
 {
-	struct vm_area_struct **vmas_tmp = vmas;
 	unsigned long flags = 0;
-	long rc, i;
+	long rc;
 
-	if (gup_flags & FOLL_LONGTERM) {
-		if (!pages)
-			return -EINVAL;
-
-		if (!vmas_tmp) {
-			vmas_tmp = kcalloc(nr_pages,
-					   sizeof(struct vm_area_struct *),
-					   GFP_KERNEL);
-			if (!vmas_tmp)
-				return -ENOMEM;
-		}
+	if (flags & FOLL_LONGTERM)
 		flags = memalloc_nocma_save();
-	}
 
 	rc = __get_user_pages_locked(tsk, mm, start, nr_pages, pages,
-				     vmas_tmp, NULL, gup_flags, vaddr_pin);
+				     vmas, NULL, gup_flags, vaddr_pin);
 
 	if (gup_flags & FOLL_LONGTERM) {
 		memalloc_nocma_restore(flags);
 		if (rc < 0)
 			goto out;
 
-		if (check_dax_vmas(vmas_tmp, rc)) {
-			for (i = 0; i < rc; i++)
-				put_page(pages[i]);
-			rc = -EOPNOTSUPP;
-			goto out;
-		}
-
 		rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
-						 vmas_tmp, gup_flags);
+						 vmas, gup_flags);
 	}
 
 out:
-	if (vmas_tmp != vmas)
-		kfree(vmas_tmp);
 	return rc;
 }
-#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
+#else /* !CONFIG_CMA */
 static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
 						  struct mm_struct *mm,
 						  unsigned long start,
@@ -1649,7 +1596,7 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
 				       NULL, flags, vaddr_pin);
 }
-#endif /* CONFIG_FS_DAX || CONFIG_CMA */
+#endif /* CONFIG_CMA */
 
 /*
  * This is the same as get_user_pages_remote(), just with a
@@ -1887,9 +1834,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			goto pte_unmap;
 
 		if (pte_devmap(pte)) {
-			if (unlikely(flags & FOLL_LONGTERM))
-				goto pte_unmap;
-
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
 				undo_dev_pagemap(nr, nr_start, pages);
@@ -2139,12 +2083,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	if (pmd_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
+	if (pmd_devmap(orig))
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr,
 					     flags, vaddr_pin);
-	}
 
 	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
@@ -2182,12 +2123,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	if (pud_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
+	if (pud_devmap(orig))
 		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr,
 					     flags, vaddr_pin);
-	}
 
 	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken
  2019-08-09 22:58 ` [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken ira.weiny
@ 2019-08-09 23:22   ` Dave Chinner
  2019-08-12 18:08     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-09 23:22 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 03:58:22PM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> If pages are under a lease fail the truncate operation.  We change the order of
> lease breaks to directly fail the operation if the lease exists.
> 
> Select EXPORT_BLOCK_OPS for FS_DAX to ensure that xfs_break_lease_layouts() is
> defined for FS_DAX as well as pNFS.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/Kconfig        | 1 +
>  fs/xfs/xfs_file.c | 5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 14cd4abdc143..c10b91f92528 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -48,6 +48,7 @@ config FS_DAX
>  	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
>  	select FS_IOMAP
>  	select DAX
> +	select EXPORTFS_BLOCK_OPS
>  	help
>  	  Direct Access (DAX) can be used on memory-backed block devices.
>  	  If the block device supports DAX and the filesystem supports DAX,

That looks wrong. If you require xfs_break_lease_layouts() outside
of pnfs context, then move the function in the XFS code base to a
file that is built in. It's only external dependency is on the
break_layout() function, and XFS already has other unconditional
direct calls to break_layout()...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()
  2019-08-09 22:58 ` [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
@ 2019-08-09 23:30   ` Dave Chinner
  2019-08-12 18:05     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-09 23:30 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 03:58:21PM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> dax_layout_busy_page() can now operate on a sub-range of the
> address_space provided.
> 
> Have xfs specify the sub range to dax_layout_busy_page()

Hmmm. I've got patches that change all these XFS interfaces to
support range locks. I'm not sure the way the ranges are passed here
is the best way to do it, and I suspect they aren't correct in some
cases, either....

> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ff3c1fae5357..f0de5486f6c1 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1042,10 +1042,16 @@ xfs_vn_setattr(
>  		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  
> -		error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
> -		if (error) {
> -			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> -			return error;
> +		if (iattr->ia_size < inode->i_size) {
> +			loff_t                  off = iattr->ia_size;
> +			loff_t                  len = inode->i_size - iattr->ia_size;
> +
> +			error = xfs_break_layouts(inode, &iolock, off, len,
> +						  BREAK_UNMAP);
> +			if (error) {
> +				xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> +				return error;
> +			}

This isn't right - truncate up still needs to break the layout on
the last filesystem block of the file, and truncate down needs to
extend to "maximum file offset" because we remove all extents beyond
EOF on a truncate down.

i.e. when we use preallocation, the extent map extends beyond EOF,
and layout leases need to be able to extend beyond the current EOF
to allow the lease owner to do extending writes, extending truncate,
preallocation beyond EOF, etc safely without having to get a new
lease to cover the new region in the extended file...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space
  2019-08-09 22:58 ` [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space ira.weiny
@ 2019-08-09 23:52   ` Dave Chinner
  2019-08-12 17:36     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-09 23:52 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> In order to support an opt-in policy for users to allow long term pins
> of FS DAX pages we need to export the LAYOUT lease to user space.
> 
> This is the first of 2 new lease flags which must be used to allow a
> long term pin to be made on a file.
> 
> After the complete series:
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>    layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>    encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> 
> 4) The user has the option of holding the lease or releasing it.  If they
>    release it no other pin calls will work on the file.
> 
> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>    depending on the internal subsystem needs.  With RDMA there is an owning
>    file which is related to the pined file.
> 
> 8) Only RDMA is currently supported
> 
> 9) Truncation of pages which are not actively pinned nor covered by a lease
>    will succeed.

This has nothing to do with layout leases or what they provide
access arbitration over. Layout leases have _nothing_ to do with
page pinning or RDMA - they arbitrate behaviour the file offset ->
physical block device mapping within the filesystem and the
behaviour that will occur when a specific lease is held.

The commit descripting needs to describe what F_LAYOUT actually
protects, when they'll get broken, etc, not how RDMA is going to use
it.

> @@ -2022,8 +2030,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
>  	struct file_lock *fl;
>  	struct fasync_struct *new;
>  	int error;
> +	unsigned int flags = 0;
> +
> +	/*
> +	 * NOTE on F_LAYOUT lease
> +	 *
> +	 * LAYOUT lease types are taken on files which the user knows that
> +	 * they will be pinning in memory for some indeterminate amount of
> +	 * time.

Indeed, layout leases have nothing to do with pinning of memory.
That's something an application taht uses layout leases might do,
but it largely irrelevant to the functionality layout leases
provide. What needs to be done here is explain what the layout lease
API actually guarantees w.r.t. the physical file layout, not what
some application is going to do with a lease. e.g.

	The layout lease F_RDLCK guarantees that the holder will be
	notified that the physical file layout is about to be
	changed, and that it needs to release any resources it has
	over the range of this lease, drop the lease and then
	request it again to wait for the kernel to finish whatever
	it is doing on that range.

	The layout lease F_RDLCK also allows the holder to modify
	the physical layout of the file. If an operation from the
	lease holder occurs that would modify the layout, that lease
	holder does not get notification that a change will occur,
	but it will block until all other F_RDLCK leases have been
	released by their holders before going ahead.

	If there is a F_WRLCK lease held on the file, then a F_RDLCK
	holder will fail any operation that may modify the physical
	layout of the file. F_WRLCK provides exclusive physical
	modification access to the holder, guaranteeing nothing else
	will change the layout of the file while it holds the lease.

	The F_WRLCK holder can change the physical layout of the
	file if it so desires, this will block while F_RDLCK holders
	are notified and release their leases before the
	modification will take place.

We need to define the semantics we expose to userspace first.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 09/19] mm/gup: Introduce vaddr_pin structure
  2019-08-09 22:58 ` [RFC PATCH v2 09/19] mm/gup: Introduce vaddr_pin structure ira.weiny
@ 2019-08-10  0:06   ` John Hubbard
  0 siblings, 0 replies; 110+ messages in thread
From: John Hubbard @ 2019-08-10  0:06 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Some subsystems need to pass owning file information to GUP calls to
> allow for GUP to associate the "owning file" to any files being pinned
> within the GUP call.
> 
> Introduce an object to specify this information and pass it down through
> some of the GUP call stack.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  include/linux/mm.h |  9 +++++++++
>  mm/gup.c           | 36 ++++++++++++++++++++++--------------
>  2 files changed, 31 insertions(+), 14 deletions(-)
> 

Looks good, although you may want to combine it with the next patch. 
Otherwise it feels like a "to be continued" when you're reading them.

Either way, though:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 04f22722b374..befe150d17be 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -971,6 +971,15 @@ static inline bool is_zone_device_page(const struct page *page)
>  }
>  #endif
>  
> +/**
> + * @f_owner The file who "owns this GUP"
> + * @mm The mm who "owns this GUP"
> + */
> +struct vaddr_pin {
> +	struct file *f_owner;
> +	struct mm_struct *mm;
> +};
> +
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
>  void __put_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> diff --git a/mm/gup.c b/mm/gup.c
> index 0b05e22ac05f..7a449500f0a6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1005,7 +1005,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  						struct page **pages,
>  						struct vm_area_struct **vmas,
>  						int *locked,
> -						unsigned int flags)
> +						unsigned int flags,
> +						struct vaddr_pin *vaddr_pin)
>  {
>  	long ret, pages_done;
>  	bool lock_dropped;
> @@ -1165,7 +1166,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  
>  	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
>  				       locked,
> -				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
> +				       gup_flags | FOLL_TOUCH | FOLL_REMOTE,
> +				       NULL);
>  }
>  EXPORT_SYMBOL(get_user_pages_remote);
>  
> @@ -1320,7 +1322,8 @@ static long __get_user_pages_locked(struct task_struct *tsk,
>  		struct mm_struct *mm, unsigned long start,
>  		unsigned long nr_pages, struct page **pages,
>  		struct vm_area_struct **vmas, int *locked,
> -		unsigned int foll_flags)
> +		unsigned int foll_flags,
> +		struct vaddr_pin *vaddr_pin)
>  {
>  	struct vm_area_struct *vma;
>  	unsigned long vm_flags;
> @@ -1504,7 +1507,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
>  		 */
>  		nr_pages = __get_user_pages_locked(tsk, mm, start, nr_pages,
>  						   pages, vmas, NULL,
> -						   gup_flags);
> +						   gup_flags, NULL);
>  
>  		if ((nr_pages > 0) && migrate_allow) {
>  			drain_allow = true;
> @@ -1537,7 +1540,8 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>  				  unsigned long nr_pages,
>  				  struct page **pages,
>  				  struct vm_area_struct **vmas,
> -				  unsigned int gup_flags)
> +				  unsigned int gup_flags,
> +				  struct vaddr_pin *vaddr_pin)
>  {
>  	struct vm_area_struct **vmas_tmp = vmas;
>  	unsigned long flags = 0;
> @@ -1558,7 +1562,7 @@ static long __gup_longterm_locked(struct task_struct *tsk,
>  	}
>  
>  	rc = __get_user_pages_locked(tsk, mm, start, nr_pages, pages,
> -				     vmas_tmp, NULL, gup_flags);
> +				     vmas_tmp, NULL, gup_flags, vaddr_pin);
>  
>  	if (gup_flags & FOLL_LONGTERM) {
>  		memalloc_nocma_restore(flags);
> @@ -1588,10 +1592,11 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
>  						  unsigned long nr_pages,
>  						  struct page **pages,
>  						  struct vm_area_struct **vmas,
> -						  unsigned int flags)
> +						  unsigned int flags,
> +						  struct vaddr_pin *vaddr_pin)
>  {
>  	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> -				       NULL, flags);
> +				       NULL, flags, vaddr_pin);
>  }
>  #endif /* CONFIG_FS_DAX || CONFIG_CMA */
>  
> @@ -1607,7 +1612,8 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
>  		struct vm_area_struct **vmas)
>  {
>  	return __gup_longterm_locked(current, current->mm, start, nr_pages,
> -				     pages, vmas, gup_flags | FOLL_TOUCH);
> +				     pages, vmas, gup_flags | FOLL_TOUCH,
> +				     NULL);
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> @@ -1647,7 +1653,7 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
>  
>  	return __get_user_pages_locked(current, current->mm, start, nr_pages,
>  				       pages, NULL, locked,
> -				       gup_flags | FOLL_TOUCH);
> +				       gup_flags | FOLL_TOUCH, NULL);
>  }
>  EXPORT_SYMBOL(get_user_pages_locked);
>  
> @@ -1684,7 +1690,7 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  
>  	down_read(&mm->mmap_sem);
>  	ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
> -				      &locked, gup_flags | FOLL_TOUCH);
> +				      &locked, gup_flags | FOLL_TOUCH, NULL);
>  	if (locked)
>  		up_read(&mm->mmap_sem);
>  	return ret;
> @@ -2377,7 +2383,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  EXPORT_SYMBOL_GPL(__get_user_pages_fast);
>  
>  static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> -				   unsigned int gup_flags, struct page **pages)
> +				   unsigned int gup_flags, struct page **pages,
> +				   struct vaddr_pin *vaddr_pin)
>  {
>  	int ret;
>  
> @@ -2389,7 +2396,8 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  		down_read(&current->mm->mmap_sem);
>  		ret = __gup_longterm_locked(current, current->mm,
>  					    start, nr_pages,
> -					    pages, NULL, gup_flags);
> +					    pages, NULL, gup_flags,
> +					    vaddr_pin);
>  		up_read(&current->mm->mmap_sem);
>  	} else {
>  		ret = get_user_pages_unlocked(start, nr_pages,
> @@ -2448,7 +2456,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  		pages += nr;
>  
>  		ret = __gup_longterm_unlocked(start, nr_pages - nr,
> -					      gup_flags, pages);
> +					      gup_flags, pages, NULL);
>  
>  		/* Have to be a bit careful with return values */
>  		if (nr > 0) {
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 10/19] mm/gup: Pass a NULL vaddr_pin through GUP fast
  2019-08-09 22:58 ` [RFC PATCH v2 10/19] mm/gup: Pass a NULL vaddr_pin through GUP fast ira.weiny
@ 2019-08-10  0:06   ` John Hubbard
  0 siblings, 0 replies; 110+ messages in thread
From: John Hubbard @ 2019-08-10  0:06 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Internally GUP fast needs to know that fast users will not support file
> pins.  Pass NULL for vaddr_pin through the fast call stack so that the
> pin code can return an error if it encounters file backed memory within
> the address range.
> 

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  mm/gup.c | 65 ++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 40 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 7a449500f0a6..504af3e9a942 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1813,7 +1813,8 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
>  
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +			 unsigned int flags, struct page **pages, int *nr,
> +			 struct vaddr_pin *vaddr_pin)
>  {
>  	struct dev_pagemap *pgmap = NULL;
>  	int nr_start = *nr, ret = 0;
> @@ -1894,7 +1895,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   * useful to have gup_huge_pmd even if we can't operate on ptes.
>   */
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +			 unsigned int flags, struct page **pages, int *nr,
> +			 struct vaddr_pin *vaddr_pin)
>  {
>  	return 0;
>  }
> @@ -1903,7 +1905,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr,
> -		unsigned int flags)
> +		unsigned int flags, struct vaddr_pin *vaddr_pin)
>  {
>  	int nr_start = *nr;
>  	struct dev_pagemap *pgmap = NULL;
> @@ -1938,13 +1940,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr,
> -		unsigned int flags)
> +		unsigned int flags, struct vaddr_pin *vaddr_pin)
>  {
>  	unsigned long fault_pfn;
>  	int nr_start = *nr;
>  
>  	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags))
> +	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags,
> +			       vaddr_pin))
>  		return 0;
>  
>  	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -1957,13 +1960,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr,
> -		unsigned int flags)
> +		unsigned int flags, struct vaddr_pin *vaddr_pin)
>  {
>  	unsigned long fault_pfn;
>  	int nr_start = *nr;
>  
>  	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags))
> +	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr, flags,
> +			       vaddr_pin))
>  		return 0;
>  
>  	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -1975,7 +1979,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr,
> -		unsigned int flags)
> +		unsigned int flags, struct vaddr_pin *vaddr_pin)
>  {
>  	BUILD_BUG();
>  	return 0;
> @@ -1983,7 +1987,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>  		unsigned long end, struct page **pages, int *nr,
> -		unsigned int flags)
> +		unsigned int flags, struct vaddr_pin *vaddr_pin)
>  {
>  	BUILD_BUG();
>  	return 0;
> @@ -2075,7 +2079,8 @@ static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
>  #endif /* CONFIG_ARCH_HAS_HUGEPD */
>  
>  static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> -		unsigned long end, unsigned int flags, struct page **pages, int *nr)
> +		unsigned long end, unsigned int flags, struct page **pages,
> +		int *nr, struct vaddr_pin *vaddr_pin)
>  {
>  	struct page *head, *page;
>  	int refs;
> @@ -2087,7 +2092,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		if (unlikely(flags & FOLL_LONGTERM))
>  			return 0;
>  		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr,
> -					     flags);
> +					     flags, vaddr_pin);
>  	}
>  
>  	refs = 0;
> @@ -2117,7 +2122,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> -		unsigned long end, unsigned int flags, struct page **pages, int *nr)
> +		unsigned long end, unsigned int flags, struct page **pages, int *nr,
> +		struct vaddr_pin *vaddr_pin)
>  {
>  	struct page *head, *page;
>  	int refs;
> @@ -2129,7 +2135,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  		if (unlikely(flags & FOLL_LONGTERM))
>  			return 0;
>  		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr,
> -					     flags);
> +					     flags, vaddr_pin);
>  	}
>  
>  	refs = 0;
> @@ -2196,7 +2202,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  }
>  
>  static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> -		unsigned int flags, struct page **pages, int *nr)
> +		unsigned int flags, struct page **pages, int *nr,
> +		struct vaddr_pin *vaddr_pin)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
> @@ -2220,7 +2227,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  				return 0;
>  
>  			if (!gup_huge_pmd(pmd, pmdp, addr, next, flags,
> -				pages, nr))
> +				pages, nr, vaddr_pin))
>  				return 0;
>  
>  		} else if (unlikely(is_hugepd(__hugepd(pmd_val(pmd))))) {
> @@ -2231,7 +2238,8 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
>  					 PMD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> +		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr,
> +					  vaddr_pin))
>  			return 0;
>  	} while (pmdp++, addr = next, addr != end);
>  
> @@ -2239,7 +2247,8 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  }
>  
>  static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +			 unsigned int flags, struct page **pages, int *nr,
> +			 struct vaddr_pin *vaddr_pin)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
> @@ -2253,13 +2262,14 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			return 0;
>  		if (unlikely(pud_huge(pud))) {
>  			if (!gup_huge_pud(pud, pudp, addr, next, flags,
> -					  pages, nr))
> +					  pages, nr, vaddr_pin))
>  				return 0;
>  		} else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) {
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr,
> +					  vaddr_pin))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
> @@ -2267,7 +2277,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  }
>  
>  static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> -			 unsigned int flags, struct page **pages, int *nr)
> +			 unsigned int flags, struct page **pages, int *nr,
> +			 struct vaddr_pin *vaddr_pin)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
> @@ -2284,7 +2295,8 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr,
> +					  vaddr_pin))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2292,7 +2304,8 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  }
>  
>  static void gup_pgd_range(unsigned long addr, unsigned long end,
> -		unsigned int flags, struct page **pages, int *nr)
> +		unsigned int flags, struct page **pages, int *nr,
> +		struct vaddr_pin *vaddr_pin)
>  {
>  	unsigned long next;
>  	pgd_t *pgdp;
> @@ -2312,7 +2325,8 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr,
> +					  vaddr_pin))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> @@ -2374,7 +2388,8 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>  	    gup_fast_permitted(start, end)) {
>  		local_irq_save(flags);
> -		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
> +		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr,
> +			      NULL);
>  		local_irq_restore(flags);
>  	}
>  
> @@ -2445,7 +2460,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>  	    gup_fast_permitted(start, end)) {
>  		local_irq_disable();
> -		gup_pgd_range(addr, end, gup_flags, pages, &nr);
> +		gup_pgd_range(addr, end, gup_flags, pages, &nr, NULL);
>  		local_irq_enable();
>  		ret = nr;
>  	}
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-09 22:58 ` [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages() ira.weiny
@ 2019-08-10  0:09   ` John Hubbard
  2019-08-12 21:00     ` Ira Weiny
  2019-08-11 23:07   ` John Hubbard
  2019-08-12 12:28   ` Jason Gunthorpe
  2 siblings, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-10  0:09 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> pages.
> 
> In addition subsystems such as RDMA require new information to be passed
> to the GUP interface to track file owning information.  As such a simple
> FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> 
> Introduce a new GUP like call which takes the newly introduced vaddr_pin
> information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> call will result in a failure if pins were created on files during the
> pin operation.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from list:
> 	Change to vaddr_put_pages_dirty_lock
> 	Change to vaddr_unpin_pages_dirty_lock
> 
>  include/linux/mm.h |  5 ++++
>  mm/gup.c           | 59 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 657c947bda49..90c5802866df 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1603,6 +1603,11 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
>  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
>  			struct task_struct *task, bool bypass_rlim);
>  
> +long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
> +		     unsigned int gup_flags, struct page **pages,
> +		     struct vaddr_pin *vaddr_pin);
> +void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
> +				  struct vaddr_pin *vaddr_pin, bool make_dirty);

Hi Ira,

OK, the API seems fine to me, anyway. :)

A bit more below...

>  bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page);
>  
>  /* Container for pinned pfns / pages */
> diff --git a/mm/gup.c b/mm/gup.c
> index eeaa0ddd08a6..6d23f70d7847 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2536,3 +2536,62 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> +
> +/**
> + * vaddr_pin_pages pin pages by virtual address and return the pages to the
> + * user.
> + *
> + * @addr, start address

What's with the commas? I thought kernel-doc wants colons, like this, right?

@addr: start address


> + * @nr_pages, number of pages to pin
> + * @gup_flags, flags to use for the pin
> + * @pages, array of pages returned
> + * @vaddr_pin, initalized meta information this pin is to be associated
> + * with.
> + *
> + * NOTE regarding vaddr_pin:
> + *
> + * Some callers can share pins via file descriptors to other processes.
> + * Callers such as this should use the f_owner field of vaddr_pin to indicate
> + * the file the fd points to.  All other callers should use the mm this pin is
> + * being made against.  Usually "current->mm".
> + *
> + * Expects mmap_sem to be read locked.
> + */
> +long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
> +		     unsigned int gup_flags, struct page **pages,
> +		     struct vaddr_pin *vaddr_pin)
> +{
> +	long ret;
> +
> +	gup_flags |= FOLL_LONGTERM;


Is now the right time to introduce and use FOLL_PIN? If not, then I can always
add it on top of this later, as part of gup-tracking patches. But you did point
out that FOLL_LONGTERM is taking on additional meaning, and so maybe it's better
to split that meaning up right from the start.


> +
> +	if (!vaddr_pin || (!vaddr_pin->mm && !vaddr_pin->f_owner))
> +		return -EINVAL;
> +
> +	ret = __gup_longterm_locked(current,
> +				    vaddr_pin->mm,
> +				    addr, nr_pages,
> +				    pages, NULL, gup_flags,
> +				    vaddr_pin);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vaddr_pin_pages);
> +
> +/**
> + * vaddr_unpin_pages_dirty_lock - counterpart to vaddr_pin_pages
> + *
> + * @pages, array of pages returned
> + * @nr_pages, number of pages in pages
> + * @vaddr_pin, same information passed to vaddr_pin_pages
> + * @make_dirty: whether to mark the pages dirty
> + *
> + * The semantics are similar to put_user_pages_dirty_lock but a vaddr_pin used
> + * in vaddr_pin_pages should be passed back into this call for propper

Typo:
                                                                  proper

> + * tracking.
> + */
> +void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
> +				  struct vaddr_pin *vaddr_pin, bool make_dirty)
> +{
> +	__put_user_pages_dirty_lock(vaddr_pin, pages, nr_pages, make_dirty);
> +}
> +EXPORT_SYMBOL(vaddr_unpin_pages_dirty_lock);
> 

OK, whew, I'm glad to see the updated _dirty_lock() API used here. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack
  2019-08-09 22:58 ` [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack ira.weiny
@ 2019-08-10  0:18   ` John Hubbard
  2019-08-12 19:01     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-10  0:18 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> In preparation for passing more information (vaddr_pin) into
> follow_page_pte(), follow_devmap_pud(), and follow_devmap_pmd().
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  include/linux/huge_mm.h | 17 -----------------
>  mm/gup.c                | 31 +++++++++++++++----------------
>  mm/huge_memory.c        |  6 ++++--
>  mm/internal.h           | 28 ++++++++++++++++++++++++++++
>  4 files changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 45ede62aa85b..b01a20ce0bb9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -233,11 +233,6 @@ static inline int hpage_nr_pages(struct page *page)
>  	return 1;
>  }
>  
> -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> -		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
> -struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> -		pud_t *pud, int flags, struct dev_pagemap **pgmap);
> -
>  extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>  
>  extern struct page *huge_zero_page;
> @@ -375,18 +370,6 @@ static inline void mm_put_huge_zero_page(struct mm_struct *mm)
>  	return;
>  }
>  
> -static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
> -	unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
> -{
> -	return NULL;
> -}
> -
> -static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
> -	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
> -{
> -	return NULL;
> -}
> -
>  static inline bool thp_migration_supported(void)
>  {
>  	return false;
> diff --git a/mm/gup.c b/mm/gup.c
> index 504af3e9a942..a7a9d2f5278c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -24,11 +24,6 @@
>  
>  #include "internal.h"
>  
> -struct follow_page_context {
> -	struct dev_pagemap *pgmap;
> -	unsigned int page_mask;
> -};
> -
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -172,8 +167,9 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>  
>  static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		unsigned long address, pmd_t *pmd, unsigned int flags,
> -		struct dev_pagemap **pgmap)
> +		struct follow_page_context *ctx)
>  {
> +	struct dev_pagemap **pgmap = &ctx->pgmap;
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page;
>  	spinlock_t *ptl;
> @@ -363,13 +359,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  	}
>  	if (pmd_devmap(pmdval)) {
>  		ptl = pmd_lock(mm, pmd);
> -		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
> +		page = follow_devmap_pmd(vma, address, pmd, flags, ctx);
>  		spin_unlock(ptl);
>  		if (page)
>  			return page;
>  	}
>  	if (likely(!pmd_trans_huge(pmdval)))
> -		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> +		return follow_page_pte(vma, address, pmd, flags, ctx);
>  
>  	if ((flags & FOLL_NUMA) && pmd_protnone(pmdval))
>  		return no_page_table(vma, flags);
> @@ -389,7 +385,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  	}
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
> -		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> +		return follow_page_pte(vma, address, pmd, flags, ctx);
>  	}
>  	if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
>  		int ret;
> @@ -419,7 +415,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
>  		}
>  
>  		return ret ? ERR_PTR(ret) :
> -			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> +			follow_page_pte(vma, address, pmd, flags, ctx);
>  	}
>  	page = follow_trans_huge_pmd(vma, address, pmd, flags);
>  	spin_unlock(ptl);
> @@ -456,7 +452,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
>  	}
>  	if (pud_devmap(*pud)) {
>  		ptl = pud_lock(mm, pud);
> -		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
> +		page = follow_devmap_pud(vma, address, pud, flags, ctx);
>  		spin_unlock(ptl);
>  		if (page)
>  			return page;
> @@ -786,7 +782,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned long start, unsigned long nr_pages,
>  		unsigned int gup_flags, struct page **pages,
> -		struct vm_area_struct **vmas, int *nonblocking)
> +		struct vm_area_struct **vmas, int *nonblocking,
> +		struct vaddr_pin *vaddr_pin)

I didn't expect to see more vaddr_pin arg passing, based on the commit
description. Did you want this as part of patch 9 or 10 instead? If not,
then let's mention it in the commit description.

>  {
>  	long ret = 0, i = 0;
>  	struct vm_area_struct *vma = NULL;
> @@ -797,6 +794,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  
>  	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
>  
> +	ctx.vaddr_pin = vaddr_pin;
> +
>  	/*
>  	 * If FOLL_FORCE is set then do not force a full fault as the hinting
>  	 * fault information is unrelated to the reference behaviour of a task
> @@ -1025,7 +1024,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  	lock_dropped = false;
>  	for (;;) {
>  		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> -				       vmas, locked);
> +				       vmas, locked, vaddr_pin);
>  		if (!locked)
>  			/* VM_FAULT_RETRY couldn't trigger, bypass */
>  			return ret;
> @@ -1068,7 +1067,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		lock_dropped = true;
>  		down_read(&mm->mmap_sem);
>  		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
> -				       pages, NULL, NULL);
> +				       pages, NULL, NULL, vaddr_pin);
>  		if (ret != 1) {
>  			BUG_ON(ret > 1);
>  			if (!pages_done)
> @@ -1226,7 +1225,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>  	 * not result in a stack expansion that recurses back here.
>  	 */
>  	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
> -				NULL, NULL, nonblocking);
> +				NULL, NULL, nonblocking, NULL);
>  }
>  
>  /*
> @@ -1311,7 +1310,7 @@ struct page *get_dump_page(unsigned long addr)
>  
>  	if (__get_user_pages(current, current->mm, addr, 1,
>  			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
> -			     NULL) < 1)
> +			     NULL, NULL) < 1)
>  		return NULL;
>  	flush_cache_page(vma, addr, page_to_pfn(page));
>  	return page;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bc1a07a55be1..7e09f2f17ed8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -916,8 +916,9 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>  }
>  
>  struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> -		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
> +		pmd_t *pmd, int flags, struct follow_page_context *ctx)
>  {
> +	struct dev_pagemap **pgmap = &ctx->pgmap;
>  	unsigned long pfn = pmd_pfn(*pmd);
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page;
> @@ -1068,8 +1069,9 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
>  }
>  
>  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> -		pud_t *pud, int flags, struct dev_pagemap **pgmap)
> +		pud_t *pud, int flags, struct follow_page_context *ctx)
>  {
> +	struct dev_pagemap **pgmap = &ctx->pgmap;
>  	unsigned long pfn = pud_pfn(*pud);
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page;
> diff --git a/mm/internal.h b/mm/internal.h
> index 0d5f720c75ab..46ada5279856 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -12,6 +12,34 @@
>  #include <linux/pagemap.h>
>  #include <linux/tracepoint-defs.h>
>  
> +struct follow_page_context {
> +	struct dev_pagemap *pgmap;
> +	unsigned int page_mask;
> +	struct vaddr_pin *vaddr_pin;
> +};
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> +		pmd_t *pmd, int flags, struct follow_page_context *ctx);
> +struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> +		pud_t *pud, int flags, struct follow_page_context *ctx);
> +#else
> +static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
> +	unsigned long addr, pmd_t *pmd, int flags,
> +	struct follow_page_context *ctx)
> +{
> +	return NULL;
> +}
> +
> +static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
> +	unsigned long addr, pud_t *pud, int flags,
> +	struct follow_page_context *ctx)
> +{
> +	return NULL;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +
>  /*
>   * The set of flags that only affect watermark checking and reclaim
>   * behaviour. This is used by the MM to obey the caller constraints
> 




thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct
  2019-08-09 22:58 ` [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct ira.weiny
@ 2019-08-10  0:30   ` John Hubbard
  2019-08-12 20:46     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-10  0:30 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Once callers start to use vaddr_pin the put_user_pages calls will need
> to have access to this data coming in.  Prep put_user_pages() for this
> data.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  include/linux/mm.h |  20 +-------
>  mm/gup.c           | 122 ++++++++++++++++++++++++++++++++-------------
>  2 files changed, 88 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index befe150d17be..9d37cafbef9a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1064,25 +1064,7 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> - *
> - * Pages that were pinned via get_user_pages*() must be released via
> - * either put_user_page(), or one of the put_user_pages*() routines
> - * below. This is so that eventually, pages that are pinned via
> - * get_user_pages*() can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special
> - * handling.
> - *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with get_user_page() calls.
> - */
> -static inline void put_user_page(struct page *page)
> -{
> -	put_page(page);
> -}
> +void put_user_page(struct page *page);
>  
>  void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  			       bool make_dirty);
> diff --git a/mm/gup.c b/mm/gup.c
> index a7a9d2f5278c..10cfd30ff668 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -24,30 +24,41 @@
>  
>  #include "internal.h"
>  
> -/**
> - * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> - * @pages:  array of pages to be maybe marked dirty, and definitely released.

A couple comments from our circular review chain: some fellow with the same
last name as you, recommended wording it like this:

      @pages:  array of pages to be put

> - * @npages: number of pages in the @pages array.
> - * @make_dirty: whether to mark the pages dirty
> - *
> - * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> - * variants called on that page.
> - *
> - * For each page in the @pages array, make that page (or its head page, if a
> - * compound page) dirty, if @make_dirty is true, and if the page was previously
> - * listed as clean. In any case, releases all pages using put_user_page(),
> - * possibly via put_user_pages(), for the non-dirty case.
> - *
> - * Please see the put_user_page() documentation for details.
> - *
> - * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> - * required, then the caller should a) verify that this is really correct,
> - * because _lock() is usually required, and b) hand code it:
> - * set_page_dirty_lock(), put_user_page().
> - *
> - */
> -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> -			       bool make_dirty)
> +static void __put_user_page(struct vaddr_pin *vaddr_pin, struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	/*
> +	 * For devmap managed pages we need to catch refcount transition from
> +	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> +	 * page is free and we need to inform the device driver through
> +	 * callback. See include/linux/memremap.h and HMM for details.
> +	 */
> +	if (put_devmap_managed_page(page))
> +		return;
> +
> +	if (put_page_testzero(page))
> +		__put_page(page);
> +}
> +
> +static void __put_user_pages(struct vaddr_pin *vaddr_pin, struct page **pages,
> +			     unsigned long npages)
> +{
> +	unsigned long index;
> +
> +	/*
> +	 * TODO: this can be optimized for huge pages: if a series of pages is
> +	 * physically contiguous and part of the same compound page, then a
> +	 * single operation to the head page should suffice.
> +	 */

As discussed in the other review thread (""), let's just delete that comment,
as long as you're moving things around.


> +	for (index = 0; index < npages; index++)
> +		__put_user_page(vaddr_pin, pages[index]);
> +}
> +
> +static void __put_user_pages_dirty_lock(struct vaddr_pin *vaddr_pin,
> +					struct page **pages,
> +					unsigned long npages,
> +					bool make_dirty)

Elsewhere in this series, we pass vaddr_pin at the end of the arg list.
Here we pass it at the beginning, and it caused a minor jar when reading it.
Obviously just bike shedding at this point, though. Either way. :)

>  {
>  	unsigned long index;
>  
> @@ -58,7 +69,7 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  	 */
>  
>  	if (!make_dirty) {
> -		put_user_pages(pages, npages);
> +		__put_user_pages(vaddr_pin, pages, npages);
>  		return;
>  	}
>  
> @@ -86,9 +97,58 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  		 */
>  		if (!PageDirty(page))
>  			set_page_dirty_lock(page);
> -		put_user_page(page);
> +		__put_user_page(vaddr_pin, page);
>  	}
>  }
> +
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely handled. In
> + * particular, interactions with RDMA and filesystems need special
> + * handling.
> + *
> + * put_user_page() and put_page() are not interchangeable, despite this early
> + * implementation that makes them look the same. put_user_page() calls must
> + * be perfectly matched up with get_user_page() calls.
> + */
> +void put_user_page(struct page *page)
> +{
> +	__put_user_page(NULL, page);
> +}
> +EXPORT_SYMBOL(put_user_page);
> +
> +/**
> + * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> + * @pages:  array of pages to be maybe marked dirty, and definitely released.

Same here:

      @pages:  array of pages to be put

> + * @npages: number of pages in the @pages array.
> + * @make_dirty: whether to mark the pages dirty
> + *
> + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> + * variants called on that page.
> + *
> + * For each page in the @pages array, make that page (or its head page, if a
> + * compound page) dirty, if @make_dirty is true, and if the page was previously
> + * listed as clean. In any case, releases all pages using put_user_page(),
> + * possibly via put_user_pages(), for the non-dirty case.
> + *
> + * Please see the put_user_page() documentation for details.
> + *
> + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> + * required, then the caller should a) verify that this is really correct,
> + * because _lock() is usually required, and b) hand code it:
> + * set_page_dirty_lock(), put_user_page().
> + *
> + */
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> +			       bool make_dirty)
> +{
> +	__put_user_pages_dirty_lock(NULL, pages, npages, make_dirty);
> +}
>  EXPORT_SYMBOL(put_user_pages_dirty_lock);
>  
>  /**
> @@ -102,15 +162,7 @@ EXPORT_SYMBOL(put_user_pages_dirty_lock);
>   */
>  void put_user_pages(struct page **pages, unsigned long npages)
>  {
> -	unsigned long index;
> -
> -	/*
> -	 * TODO: this can be optimized for huge pages: if a series of pages is
> -	 * physically contiguous and part of the same compound page, then a
> -	 * single operation to the head page should suffice.
> -	 */
> -	for (index = 0; index < npages; index++)
> -		put_user_page(pages[index]);
> +	__put_user_pages(NULL, pages, npages);
>  }
>  EXPORT_SYMBOL(put_user_pages);
>  
> 

This all looks pretty good, so regardless of the outcome of the minor
points above,
   
    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-09 22:58 ` [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages() ira.weiny
  2019-08-10  0:09   ` John Hubbard
@ 2019-08-11 23:07   ` John Hubbard
  2019-08-12 21:01     ` Ira Weiny
  2019-08-12 12:28   ` Jason Gunthorpe
  2 siblings, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-11 23:07 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> pages.
> 
> In addition subsystems such as RDMA require new information to be passed
> to the GUP interface to track file owning information.  As such a simple
> FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> 
> Introduce a new GUP like call which takes the newly introduced vaddr_pin
> information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> call will result in a failure if pins were created on files during the
> pin operation.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 

I'm creating a new call site conversion series, to replace the 
"put_user_pages(): miscellaneous call sites" series. This uses
vaddr_pin_pages*() where appropriate. So it's based on your series here.

btw, while doing that, I noticed one more typo while re-reading some of the comments. 
Thought you probably want to collect them all for the next spin. Below...

> ---
> Changes from list:
> 	Change to vaddr_put_pages_dirty_lock
> 	Change to vaddr_unpin_pages_dirty_lock
> 
>  include/linux/mm.h |  5 ++++
>  mm/gup.c           | 59 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 657c947bda49..90c5802866df 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1603,6 +1603,11 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
>  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
>  			struct task_struct *task, bool bypass_rlim);
>  
> +long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
> +		     unsigned int gup_flags, struct page **pages,
> +		     struct vaddr_pin *vaddr_pin);
> +void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
> +				  struct vaddr_pin *vaddr_pin, bool make_dirty);
>  bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page);
>  
>  /* Container for pinned pfns / pages */
> diff --git a/mm/gup.c b/mm/gup.c
> index eeaa0ddd08a6..6d23f70d7847 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2536,3 +2536,62 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> +
> +/**
> + * vaddr_pin_pages pin pages by virtual address and return the pages to the
> + * user.
> + *
> + * @addr, start address
> + * @nr_pages, number of pages to pin
> + * @gup_flags, flags to use for the pin
> + * @pages, array of pages returned
> + * @vaddr_pin, initalized meta information this pin is to be associated

Typo:
                  initialized


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-09 22:58 ` [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages() ira.weiny
  2019-08-10  0:09   ` John Hubbard
  2019-08-11 23:07   ` John Hubbard
@ 2019-08-12 12:28   ` Jason Gunthorpe
  2019-08-12 21:48     ` Ira Weiny
  2 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-12 12:28 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 03:58:29PM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> pages.
> 
> In addition subsystems such as RDMA require new information to be passed
> to the GUP interface to track file owning information.  As such a simple
> FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> 
> Introduce a new GUP like call which takes the newly introduced vaddr_pin
> information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> call will result in a failure if pins were created on files during the
> pin operation.

Is this a 'vaddr' in the traditional sense, ie does it work with
something returned by valloc?

Maybe another name would be better?

I also wish GUP like functions took in a 'void __user *' instead of
the unsigned long to make this clear :\

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-09 22:58 ` [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object ira.weiny
@ 2019-08-12 13:00   ` Jason Gunthorpe
  2019-08-12 17:28     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-12 13:00 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> In order for MRs to be tracked against the open verbs context the ufile
> needs to have a pointer to hand to the GUP code.
> 
> No references need to be taken as this should be valid for the lifetime
> of the context.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>  drivers/infiniband/core/uverbs.h      | 1 +
>  drivers/infiniband/core/uverbs_main.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> index 1e5aeb39f774..e802ba8c67d6 100644
> +++ b/drivers/infiniband/core/uverbs.h
> @@ -163,6 +163,7 @@ struct ib_uverbs_file {
>  	struct page *disassociate_page;
>  
>  	struct xarray		idr;
> +	struct file             *sys_file; /* backpointer to system file object */
>  };

The 'struct file' has a lifetime strictly shorter than the
ib_uverbs_file, which is kref'd on its own lifetime. Having a back
pointer like this is confouding as it will be invalid for some of the
lifetime of the struct.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-12 13:00   ` Jason Gunthorpe
@ 2019-08-12 17:28     ` Ira Weiny
  2019-08-12 17:56       ` Jason Gunthorpe
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > In order for MRs to be tracked against the open verbs context the ufile
> > needs to have a pointer to hand to the GUP code.
> > 
> > No references need to be taken as this should be valid for the lifetime
> > of the context.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >  drivers/infiniband/core/uverbs.h      | 1 +
> >  drivers/infiniband/core/uverbs_main.c | 1 +
> >  2 files changed, 2 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > index 1e5aeb39f774..e802ba8c67d6 100644
> > +++ b/drivers/infiniband/core/uverbs.h
> > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> >  	struct page *disassociate_page;
> >  
> >  	struct xarray		idr;
> > +	struct file             *sys_file; /* backpointer to system file object */
> >  };
> 
> The 'struct file' has a lifetime strictly shorter than the
> ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> pointer like this is confouding as it will be invalid for some of the
> lifetime of the struct.

Ah...  ok.  I really thought it was the other way around.

__fput() should not call ib_uverbs_close() until the last reference on struct
file is released...  What holds references to struct ib_uverbs_file past that?

Perhaps I need to add this (untested)?

diff --git a/drivers/infiniband/core/uverbs_main.c
b/drivers/infiniband/core/uverbs_main.c
index f628f9e4c09f..654e774d9cf2 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -1125,6 +1125,8 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
        list_del_init(&file->list);
        mutex_unlock(&file->device->lists_mutex);
 
+       file->sys_file = NULL;
+
        kref_put(&file->ref, ib_uverbs_release_file);
 
        return 0;


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space
  2019-08-09 23:52   ` Dave Chinner
@ 2019-08-12 17:36     ` Ira Weiny
  2019-08-14  8:05       ` Dave Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 17:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > In order to support an opt-in policy for users to allow long term pins
> > of FS DAX pages we need to export the LAYOUT lease to user space.
> > 
> > This is the first of 2 new lease flags which must be used to allow a
> > long term pin to be made on a file.
> > 
> > After the complete series:
> > 
> > 0) Registrations to Device DAX char devs are not affected
> > 
> > 1) The user has to opt in to allowing page pins on a file with an exclusive
> >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > 
> > 2) page pins will fail if the lease is not active when the file back page is
> >    encountered.
> > 
> > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > 
> > 4) The user has the option of holding the lease or releasing it.  If they
> >    release it no other pin calls will work on the file.
> > 
> > 5) Closing the file is ok.
> > 
> > 6) Unmapping the file is ok
> > 
> > 7) Pins against the files are tracked back to an owning file or an owning mm
> >    depending on the internal subsystem needs.  With RDMA there is an owning
> >    file which is related to the pined file.
> > 
> > 8) Only RDMA is currently supported
> > 
> > 9) Truncation of pages which are not actively pinned nor covered by a lease
> >    will succeed.
> 
> This has nothing to do with layout leases or what they provide
> access arbitration over. Layout leases have _nothing_ to do with
> page pinning or RDMA - they arbitrate behaviour the file offset ->
> physical block device mapping within the filesystem and the
> behaviour that will occur when a specific lease is held.
> 
> The commit descripting needs to describe what F_LAYOUT actually
> protects, when they'll get broken, etc, not how RDMA is going to use
> it.

Ok yes I've been lax in mixing the cover letter for the series and this first
commit message.  My apologies.

> 
> > @@ -2022,8 +2030,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> >  	struct file_lock *fl;
> >  	struct fasync_struct *new;
> >  	int error;
> > +	unsigned int flags = 0;
> > +
> > +	/*
> > +	 * NOTE on F_LAYOUT lease
> > +	 *
> > +	 * LAYOUT lease types are taken on files which the user knows that
> > +	 * they will be pinning in memory for some indeterminate amount of
> > +	 * time.
> 
> Indeed, layout leases have nothing to do with pinning of memory.

Yep, Fair enough.  I'll rework the comment.

> That's something an application taht uses layout leases might do,
> but it largely irrelevant to the functionality layout leases
> provide. What needs to be done here is explain what the layout lease
> API actually guarantees w.r.t. the physical file layout, not what
> some application is going to do with a lease. e.g.
> 
> 	The layout lease F_RDLCK guarantees that the holder will be
> 	notified that the physical file layout is about to be
> 	changed, and that it needs to release any resources it has
> 	over the range of this lease, drop the lease and then
> 	request it again to wait for the kernel to finish whatever
> 	it is doing on that range.
> 
> 	The layout lease F_RDLCK also allows the holder to modify
> 	the physical layout of the file. If an operation from the
> 	lease holder occurs that would modify the layout, that lease
> 	holder does not get notification that a change will occur,
> 	but it will block until all other F_RDLCK leases have been
> 	released by their holders before going ahead.
> 
> 	If there is a F_WRLCK lease held on the file, then a F_RDLCK
> 	holder will fail any operation that may modify the physical
> 	layout of the file. F_WRLCK provides exclusive physical
> 	modification access to the holder, guaranteeing nothing else
> 	will change the layout of the file while it holds the lease.
> 
> 	The F_WRLCK holder can change the physical layout of the
> 	file if it so desires, this will block while F_RDLCK holders
> 	are notified and release their leases before the
> 	modification will take place.
> 
> We need to define the semantics we expose to userspace first.....

Agreed.  I believe I have implemented the semantics you describe above.  Do I
have your permission to use your verbiage as part of reworking the comment and
commit message?

Thanks,
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-12 17:28     ` Ira Weiny
@ 2019-08-12 17:56       ` Jason Gunthorpe
  2019-08-12 21:15         ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-12 17:56 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 10:28:27AM -0700, Ira Weiny wrote:
> On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > In order for MRs to be tracked against the open verbs context the ufile
> > > needs to have a pointer to hand to the GUP code.
> > > 
> > > No references need to be taken as this should be valid for the lifetime
> > > of the context.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > >  drivers/infiniband/core/uverbs.h      | 1 +
> > >  drivers/infiniband/core/uverbs_main.c | 1 +
> > >  2 files changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > > index 1e5aeb39f774..e802ba8c67d6 100644
> > > +++ b/drivers/infiniband/core/uverbs.h
> > > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> > >  	struct page *disassociate_page;
> > >  
> > >  	struct xarray		idr;
> > > +	struct file             *sys_file; /* backpointer to system file object */
> > >  };
> > 
> > The 'struct file' has a lifetime strictly shorter than the
> > ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> > pointer like this is confouding as it will be invalid for some of the
> > lifetime of the struct.
> 
> Ah...  ok.  I really thought it was the other way around.
> 
> __fput() should not call ib_uverbs_close() until the last reference on struct
> file is released...  What holds references to struct ib_uverbs_file past that?

Child fds hold onto the internal ib_uverbs_file until they are closed

> Perhaps I need to add this (untested)?
> 
> diff --git a/drivers/infiniband/core/uverbs_main.c
> b/drivers/infiniband/core/uverbs_main.c
> index f628f9e4c09f..654e774d9cf2 100644
> +++ b/drivers/infiniband/core/uverbs_main.c
> @@ -1125,6 +1125,8 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
>         list_del_init(&file->list);
>         mutex_unlock(&file->device->lists_mutex);
>  
> +       file->sys_file = NULL;

Now this has unlocked updates to that data.. you'd need some lock and
get not zero pattern

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()
  2019-08-09 23:30   ` Dave Chinner
@ 2019-08-12 18:05     ` Ira Weiny
  2019-08-14  8:04       ` Dave Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 18:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Sat, Aug 10, 2019 at 09:30:37AM +1000, Dave Chinner wrote:
> On Fri, Aug 09, 2019 at 03:58:21PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > dax_layout_busy_page() can now operate on a sub-range of the
> > address_space provided.
> > 
> > Have xfs specify the sub range to dax_layout_busy_page()
> 
> Hmmm. I've got patches that change all these XFS interfaces to
> support range locks. I'm not sure the way the ranges are passed here
> is the best way to do it, and I suspect they aren't correct in some
> cases, either....
> 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index ff3c1fae5357..f0de5486f6c1 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -1042,10 +1042,16 @@ xfs_vn_setattr(
> >  		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> >  		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> >  
> > -		error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
> > -		if (error) {
> > -			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > -			return error;
> > +		if (iattr->ia_size < inode->i_size) {
> > +			loff_t                  off = iattr->ia_size;
> > +			loff_t                  len = inode->i_size - iattr->ia_size;
> > +
> > +			error = xfs_break_layouts(inode, &iolock, off, len,
> > +						  BREAK_UNMAP);
> > +			if (error) {
> > +				xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > +				return error;
> > +			}
> 
> This isn't right - truncate up still needs to break the layout on
> the last filesystem block of the file,

I'm not following this?  From a user perspective they can't have done anything
with the data beyond the EOF.  So isn't it safe to allow EOF to grow without
changing the layout of that last block?

> and truncate down needs to
> extend to "maximum file offset" because we remove all extents beyond
> EOF on a truncate down.

Ok, I was trying to allow a user to extend the file without conflicts if they
were to have a pin on the 'beginning' of the original file.  This sounds like
you are saying that a layout lease must be dropped to do that?  In some ways I
think I understand what you are driving at and I think I see how I may have
been playing "fast and loose" with the strictness of the layout lease.  But
from a user perspective if there is a part of the file which "does not exist"
(beyond EOF) does it matter that the layout there may change?

> 
> i.e. when we use preallocation, the extent map extends beyond EOF,
> and layout leases need to be able to extend beyond the current EOF
> to allow the lease owner to do extending writes, extending truncate,
> preallocation beyond EOF, etc safely without having to get a new
> lease to cover the new region in the extended file...

I'm not following this.  What determines when preallocation is done?

Forgive my ignorance on file systems but how can we have a layout for every
file which is "maximum file offset" for every file even if a file is only 1
page long?

Thanks,
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken
  2019-08-09 23:22   ` Dave Chinner
@ 2019-08-12 18:08     ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 18:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Sat, Aug 10, 2019 at 09:22:09AM +1000, Dave Chinner wrote:
> On Fri, Aug 09, 2019 at 03:58:22PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > If pages are under a lease fail the truncate operation.  We change the order of
> > lease breaks to directly fail the operation if the lease exists.
> > 
> > Select EXPORT_BLOCK_OPS for FS_DAX to ensure that xfs_break_lease_layouts() is
> > defined for FS_DAX as well as pNFS.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  fs/Kconfig        | 1 +
> >  fs/xfs/xfs_file.c | 5 +++--
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 14cd4abdc143..c10b91f92528 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -48,6 +48,7 @@ config FS_DAX
> >  	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
> >  	select FS_IOMAP
> >  	select DAX
> > +	select EXPORTFS_BLOCK_OPS
> >  	help
> >  	  Direct Access (DAX) can be used on memory-backed block devices.
> >  	  If the block device supports DAX and the filesystem supports DAX,
> 
> That looks wrong.

It may be...

>
> If you require xfs_break_lease_layouts() outside
> of pnfs context, then move the function in the XFS code base to a
> file that is built in. It's only external dependency is on the
> break_layout() function, and XFS already has other unconditional
> direct calls to break_layout()...

I'll check.  This patch was part of the original series and I must admit I
don't remember why I did it this way...

Thanks,
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack
  2019-08-10  0:18   ` John Hubbard
@ 2019-08-12 19:01     ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 19:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 05:18:31PM -0700, John Hubbard wrote:
> On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > In preparation for passing more information (vaddr_pin) into
> > follow_page_pte(), follow_devmap_pud(), and follow_devmap_pmd().
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>

[snip]

> > @@ -786,7 +782,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  		unsigned long start, unsigned long nr_pages,
> >  		unsigned int gup_flags, struct page **pages,
> > -		struct vm_area_struct **vmas, int *nonblocking)
> > +		struct vm_area_struct **vmas, int *nonblocking,
> > +		struct vaddr_pin *vaddr_pin)
> 
> I didn't expect to see more vaddr_pin arg passing, based on the commit
> description. Did you want this as part of patch 9 or 10 instead? If not,
> then let's mention it in the commit description.

Yea that does seem out of place now that I look at it.  I'll add to the commit
message because this is really getting vaddr_pin into the context _and_ passing
it down the stack.  With all the rebasing I may have squashed something I did
not mean to.  But I think this patch is ok because it is not to complicated to
see what is going on.

Thanks,
Ira

> 
> >  {
> >  	long ret = 0, i = 0;
> >  	struct vm_area_struct *vma = NULL;
> > @@ -797,6 +794,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  
> >  	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> >  
> > +	ctx.vaddr_pin = vaddr_pin;
> > +
> >  	/*
> >  	 * If FOLL_FORCE is set then do not force a full fault as the hinting
> >  	 * fault information is unrelated to the reference behaviour of a task
> > @@ -1025,7 +1024,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >  	lock_dropped = false;
> >  	for (;;) {
> >  		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> > -				       vmas, locked);
> > +				       vmas, locked, vaddr_pin);
> >  		if (!locked)
> >  			/* VM_FAULT_RETRY couldn't trigger, bypass */
> >  			return ret;
> > @@ -1068,7 +1067,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >  		lock_dropped = true;
> >  		down_read(&mm->mmap_sem);
> >  		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
> > -				       pages, NULL, NULL);
> > +				       pages, NULL, NULL, vaddr_pin);
> >  		if (ret != 1) {
> >  			BUG_ON(ret > 1);
> >  			if (!pages_done)
> > @@ -1226,7 +1225,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >  	 * not result in a stack expansion that recurses back here.
> >  	 */
> >  	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
> > -				NULL, NULL, nonblocking);
> > +				NULL, NULL, nonblocking, NULL);
> >  }
> >  
> >  /*
> > @@ -1311,7 +1310,7 @@ struct page *get_dump_page(unsigned long addr)
> >  
> >  	if (__get_user_pages(current, current->mm, addr, 1,
> >  			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
> > -			     NULL) < 1)
> > +			     NULL, NULL) < 1)
> >  		return NULL;
> >  	flush_cache_page(vma, addr, page_to_pfn(page));
> >  	return page;
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bc1a07a55be1..7e09f2f17ed8 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -916,8 +916,9 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  
> >  struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> > -		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
> > +		pmd_t *pmd, int flags, struct follow_page_context *ctx)
> >  {
> > +	struct dev_pagemap **pgmap = &ctx->pgmap;
> >  	unsigned long pfn = pmd_pfn(*pmd);
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct page *page;
> > @@ -1068,8 +1069,9 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  
> >  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> > -		pud_t *pud, int flags, struct dev_pagemap **pgmap)
> > +		pud_t *pud, int flags, struct follow_page_context *ctx)
> >  {
> > +	struct dev_pagemap **pgmap = &ctx->pgmap;
> >  	unsigned long pfn = pud_pfn(*pud);
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct page *page;
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 0d5f720c75ab..46ada5279856 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -12,6 +12,34 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/tracepoint-defs.h>
> >  
> > +struct follow_page_context {
> > +	struct dev_pagemap *pgmap;
> > +	unsigned int page_mask;
> > +	struct vaddr_pin *vaddr_pin;
> > +};
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> > +		pmd_t *pmd, int flags, struct follow_page_context *ctx);
> > +struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> > +		pud_t *pud, int flags, struct follow_page_context *ctx);
> > +#else
> > +static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
> > +	unsigned long addr, pmd_t *pmd, int flags,
> > +	struct follow_page_context *ctx)
> > +{
> > +	return NULL;
> > +}
> > +
> > +static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
> > +	unsigned long addr, pud_t *pud, int flags,
> > +	struct follow_page_context *ctx)
> > +{
> > +	return NULL;
> > +}
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> > +
> >  /*
> >   * The set of flags that only affect watermark checking and reclaim
> >   * behaviour. This is used by the MM to obey the caller constraints
> > 
> 
> 
> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct
  2019-08-10  0:30   ` John Hubbard
@ 2019-08-12 20:46     ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 20:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 05:30:00PM -0700, John Hubbard wrote:
> On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Once callers start to use vaddr_pin the put_user_pages calls will need
> > to have access to this data coming in.  Prep put_user_pages() for this
> > data.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>

[snip]

> > diff --git a/mm/gup.c b/mm/gup.c
> > index a7a9d2f5278c..10cfd30ff668 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -24,30 +24,41 @@
> >  
> >  #include "internal.h"
> >  
> > -/**
> > - * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> > - * @pages:  array of pages to be maybe marked dirty, and definitely released.
> 
> A couple comments from our circular review chain: some fellow with the same
> last name as you, recommended wording it like this:
> 
>       @pages:  array of pages to be put

Sure, see below...

> 
> > - * @npages: number of pages in the @pages array.
> > - * @make_dirty: whether to mark the pages dirty
> > - *
> > - * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> > - * variants called on that page.
> > - *
> > - * For each page in the @pages array, make that page (or its head page, if a
> > - * compound page) dirty, if @make_dirty is true, and if the page was previously
> > - * listed as clean. In any case, releases all pages using put_user_page(),
> > - * possibly via put_user_pages(), for the non-dirty case.
> > - *
> > - * Please see the put_user_page() documentation for details.
> > - *
> > - * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> > - * required, then the caller should a) verify that this is really correct,
> > - * because _lock() is usually required, and b) hand code it:
> > - * set_page_dirty_lock(), put_user_page().
> > - *
> > - */
> > -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> > -			       bool make_dirty)
> > +static void __put_user_page(struct vaddr_pin *vaddr_pin, struct page *page)
> > +{
> > +	page = compound_head(page);
> > +
> > +	/*
> > +	 * For devmap managed pages we need to catch refcount transition from
> > +	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> > +	 * page is free and we need to inform the device driver through
> > +	 * callback. See include/linux/memremap.h and HMM for details.
> > +	 */
> > +	if (put_devmap_managed_page(page))
> > +		return;
> > +
> > +	if (put_page_testzero(page))
> > +		__put_page(page);
> > +}
> > +
> > +static void __put_user_pages(struct vaddr_pin *vaddr_pin, struct page **pages,
> > +			     unsigned long npages)
> > +{
> > +	unsigned long index;
> > +
> > +	/*
> > +	 * TODO: this can be optimized for huge pages: if a series of pages is
> > +	 * physically contiguous and part of the same compound page, then a
> > +	 * single operation to the head page should suffice.
> > +	 */
> 
> As discussed in the other review thread (""), let's just delete that comment,
> as long as you're moving things around.

Done.

> 
> 
> > +	for (index = 0; index < npages; index++)
> > +		__put_user_page(vaddr_pin, pages[index]);
> > +}
> > +
> > +static void __put_user_pages_dirty_lock(struct vaddr_pin *vaddr_pin,
> > +					struct page **pages,
> > +					unsigned long npages,
> > +					bool make_dirty)
> 
> Elsewhere in this series, we pass vaddr_pin at the end of the arg list.
> Here we pass it at the beginning, and it caused a minor jar when reading it.
> Obviously just bike shedding at this point, though. Either way. :)

Yea I guess that is odd...  I changed it.  Not a big deal.

> 
> >  {
> >  	unsigned long index;
> >  
> > @@ -58,7 +69,7 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> >  	 */
> >  
> >  	if (!make_dirty) {
> > -		put_user_pages(pages, npages);
> > +		__put_user_pages(vaddr_pin, pages, npages);
> >  		return;
> >  	}
> >  
> > @@ -86,9 +97,58 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> >  		 */
> >  		if (!PageDirty(page))
> >  			set_page_dirty_lock(page);
> > -		put_user_page(page);
> > +		__put_user_page(vaddr_pin, page);
> >  	}
> >  }
> > +
> > +/**
> > + * put_user_page() - release a gup-pinned page
> > + * @page:            pointer to page to be released
> > + *
> > + * Pages that were pinned via get_user_pages*() must be released via
> > + * either put_user_page(), or one of the put_user_pages*() routines
> > + * below. This is so that eventually, pages that are pinned via
> > + * get_user_pages*() can be separately tracked and uniquely handled. In
> > + * particular, interactions with RDMA and filesystems need special
> > + * handling.
> > + *
> > + * put_user_page() and put_page() are not interchangeable, despite this early
> > + * implementation that makes them look the same. put_user_page() calls must
> > + * be perfectly matched up with get_user_page() calls.
> > + */
> > +void put_user_page(struct page *page)
> > +{
> > +	__put_user_page(NULL, page);
> > +}
> > +EXPORT_SYMBOL(put_user_page);
> > +
> > +/**
> > + * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> > + * @pages:  array of pages to be maybe marked dirty, and definitely released.
> 
> Same here:
> 
>       @pages:  array of pages to be put

Actually here is the only place.  Above was removing the text to be put here...

Done -- I'll made a lead in patch because this was just copied text.

> 
> > + * @npages: number of pages in the @pages array.
> > + * @make_dirty: whether to mark the pages dirty
> > + *
> > + * "gup-pinned page" refers to a page that has had one of the get_user_pages()
> > + * variants called on that page.
> > + *
> > + * For each page in the @pages array, make that page (or its head page, if a
> > + * compound page) dirty, if @make_dirty is true, and if the page was previously
> > + * listed as clean. In any case, releases all pages using put_user_page(),
> > + * possibly via put_user_pages(), for the non-dirty case.
> > + *
> > + * Please see the put_user_page() documentation for details.
> > + *
> > + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> > + * required, then the caller should a) verify that this is really correct,
> > + * because _lock() is usually required, and b) hand code it:
> > + * set_page_dirty_lock(), put_user_page().
> > + *
> > + */
> > +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> > +			       bool make_dirty)
> > +{
> > +	__put_user_pages_dirty_lock(NULL, pages, npages, make_dirty);
> > +}
> >  EXPORT_SYMBOL(put_user_pages_dirty_lock);
> >  
> >  /**
> > @@ -102,15 +162,7 @@ EXPORT_SYMBOL(put_user_pages_dirty_lock);
> >   */
> >  void put_user_pages(struct page **pages, unsigned long npages)
> >  {
> > -	unsigned long index;
> > -
> > -	/*
> > -	 * TODO: this can be optimized for huge pages: if a series of pages is
> > -	 * physically contiguous and part of the same compound page, then a
> > -	 * single operation to the head page should suffice.
> > -	 */
> > -	for (index = 0; index < npages; index++)
> > -		put_user_page(pages[index]);
> > +	__put_user_pages(NULL, pages, npages);
> >  }
> >  EXPORT_SYMBOL(put_user_pages);
> >  
> > 
> 
> This all looks pretty good, so regardless of the outcome of the minor
> points above,
>    
>     Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thanks,
Ira

> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-10  0:09   ` John Hubbard
@ 2019-08-12 21:00     ` Ira Weiny
  2019-08-12 21:20       ` John Hubbard
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 21:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 05:09:54PM -0700, John Hubbard wrote:
> On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> > pages.
> > 
> > In addition subsystems such as RDMA require new information to be passed
> > to the GUP interface to track file owning information.  As such a simple
> > FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> > 
> > Introduce a new GUP like call which takes the newly introduced vaddr_pin
> > information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> > call will result in a failure if pins were created on files during the
> > pin operation.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from list:
> > 	Change to vaddr_put_pages_dirty_lock
> > 	Change to vaddr_unpin_pages_dirty_lock
> > 
> >  include/linux/mm.h |  5 ++++
> >  mm/gup.c           | 59 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 64 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 657c947bda49..90c5802866df 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1603,6 +1603,11 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
> >  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
> >  			struct task_struct *task, bool bypass_rlim);
> >  
> > +long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
> > +		     unsigned int gup_flags, struct page **pages,
> > +		     struct vaddr_pin *vaddr_pin);
> > +void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
> > +				  struct vaddr_pin *vaddr_pin, bool make_dirty);
> 
> Hi Ira,
> 
> OK, the API seems fine to me, anyway. :)
> 
> A bit more below...
> 
> >  bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page);
> >  
> >  /* Container for pinned pfns / pages */
> > diff --git a/mm/gup.c b/mm/gup.c
> > index eeaa0ddd08a6..6d23f70d7847 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2536,3 +2536,62 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >  	return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> > +
> > +/**
> > + * vaddr_pin_pages pin pages by virtual address and return the pages to the
> > + * user.
> > + *
> > + * @addr, start address
> 
> What's with the commas? I thought kernel-doc wants colons, like this, right?
> 
> @addr: start address

:-/  I don't know.

Fixed.

> 
> 
> > + * @nr_pages, number of pages to pin
> > + * @gup_flags, flags to use for the pin
> > + * @pages, array of pages returned
> > + * @vaddr_pin, initalized meta information this pin is to be associated
> > + * with.
> > + *
> > + * NOTE regarding vaddr_pin:
> > + *
> > + * Some callers can share pins via file descriptors to other processes.
> > + * Callers such as this should use the f_owner field of vaddr_pin to indicate
> > + * the file the fd points to.  All other callers should use the mm this pin is
> > + * being made against.  Usually "current->mm".
> > + *
> > + * Expects mmap_sem to be read locked.
> > + */
> > +long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
> > +		     unsigned int gup_flags, struct page **pages,
> > +		     struct vaddr_pin *vaddr_pin)
> > +{
> > +	long ret;
> > +
> > +	gup_flags |= FOLL_LONGTERM;
> 
> 
> Is now the right time to introduce and use FOLL_PIN? If not, then I can always
> add it on top of this later, as part of gup-tracking patches. But you did point
> out that FOLL_LONGTERM is taking on additional meaning, and so maybe it's better
> to split that meaning up right from the start.
> 

At one point I wanted to (and had in my tree) a new flag but I went away from
it.  Prior to the discussion on mlock last week I did not think we needed it.
But I'm ok to add it back in.

I was not ignoring the idea for this RFC I just wanted to get this out there
for people to see.  I see that you threw out a couple of patches which add this
flag in.

FWIW, I think it would be good to differentiate between an indefinite pinned
page vs a referenced "gotten" page.

What you and I have been working on is the former.  So it would be easy to
change your refcounting patches to simply key off of FOLL_PIN.

Would you like me to add in your FOLL_PIN patches to this series?

> 
> > +
> > +	if (!vaddr_pin || (!vaddr_pin->mm && !vaddr_pin->f_owner))
> > +		return -EINVAL;
> > +
> > +	ret = __gup_longterm_locked(current,
> > +				    vaddr_pin->mm,
> > +				    addr, nr_pages,
> > +				    pages, NULL, gup_flags,
> > +				    vaddr_pin);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(vaddr_pin_pages);
> > +
> > +/**
> > + * vaddr_unpin_pages_dirty_lock - counterpart to vaddr_pin_pages
> > + *
> > + * @pages, array of pages returned
> > + * @nr_pages, number of pages in pages
> > + * @vaddr_pin, same information passed to vaddr_pin_pages
> > + * @make_dirty: whether to mark the pages dirty
> > + *
> > + * The semantics are similar to put_user_pages_dirty_lock but a vaddr_pin used
> > + * in vaddr_pin_pages should be passed back into this call for propper
> 
> Typo:
                                                                   proper
Fixed.

> 
> > + * tracking.
> > + */
> > +void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
> > +				  struct vaddr_pin *vaddr_pin, bool make_dirty)
> > +{
> > +	__put_user_pages_dirty_lock(vaddr_pin, pages, nr_pages, make_dirty);
> > +}
> > +EXPORT_SYMBOL(vaddr_unpin_pages_dirty_lock);
> > 
> 
> OK, whew, I'm glad to see the updated _dirty_lock() API used here. :)

Yea this was pretty easy to change during the rebase.  Again I'm kind of
floating these quickly at this point.  So sorry about the nits...

Ira

> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-11 23:07   ` John Hubbard
@ 2019-08-12 21:01     ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 21:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Sun, Aug 11, 2019 at 04:07:23PM -0700, John Hubbard wrote:
> On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> > pages.
> > 
> > In addition subsystems such as RDMA require new information to be passed
> > to the GUP interface to track file owning information.  As such a simple
> > FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> > 
> > Introduce a new GUP like call which takes the newly introduced vaddr_pin
> > information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> > call will result in a failure if pins were created on files during the
> > pin operation.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> 
> I'm creating a new call site conversion series, to replace the 
> "put_user_pages(): miscellaneous call sites" series. This uses
> vaddr_pin_pages*() where appropriate. So it's based on your series here.
> 
> btw, while doing that, I noticed one more typo while re-reading some of the comments. 
> Thought you probably want to collect them all for the next spin. Below...
> 
> > ---
> > Changes from list:
> > 	Change to vaddr_put_pages_dirty_lock
> > 	Change to vaddr_unpin_pages_dirty_lock
> > 
> >  include/linux/mm.h |  5 ++++
> >  mm/gup.c           | 59 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 64 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 657c947bda49..90c5802866df 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1603,6 +1603,11 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
> >  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
> >  			struct task_struct *task, bool bypass_rlim);
> >  
> > +long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
> > +		     unsigned int gup_flags, struct page **pages,
> > +		     struct vaddr_pin *vaddr_pin);
> > +void vaddr_unpin_pages_dirty_lock(struct page **pages, unsigned long nr_pages,
> > +				  struct vaddr_pin *vaddr_pin, bool make_dirty);
> >  bool mapping_inode_has_layout(struct vaddr_pin *vaddr_pin, struct page *page);
> >  
> >  /* Container for pinned pfns / pages */
> > diff --git a/mm/gup.c b/mm/gup.c
> > index eeaa0ddd08a6..6d23f70d7847 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2536,3 +2536,62 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >  	return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> > +
> > +/**
> > + * vaddr_pin_pages pin pages by virtual address and return the pages to the
> > + * user.
> > + *
> > + * @addr, start address
> > + * @nr_pages, number of pages to pin
> > + * @gup_flags, flags to use for the pin
> > + * @pages, array of pages returned
> > + * @vaddr_pin, initalized meta information this pin is to be associated
> 
> Typo:
>                   initialized

Thanks fixed.
Ira

> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-12 17:56       ` Jason Gunthorpe
@ 2019-08-12 21:15         ` Ira Weiny
  2019-08-13 11:48           ` Jason Gunthorpe
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 21:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 02:56:15PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2019 at 10:28:27AM -0700, Ira Weiny wrote:
> > On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > In order for MRs to be tracked against the open verbs context the ufile
> > > > needs to have a pointer to hand to the GUP code.
> > > > 
> > > > No references need to be taken as this should be valid for the lifetime
> > > > of the context.
> > > > 
> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > >  drivers/infiniband/core/uverbs.h      | 1 +
> > > >  drivers/infiniband/core/uverbs_main.c | 1 +
> > > >  2 files changed, 2 insertions(+)
> > > > 
> > > > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > > > index 1e5aeb39f774..e802ba8c67d6 100644
> > > > +++ b/drivers/infiniband/core/uverbs.h
> > > > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> > > >  	struct page *disassociate_page;
> > > >  
> > > >  	struct xarray		idr;
> > > > +	struct file             *sys_file; /* backpointer to system file object */
> > > >  };
> > > 
> > > The 'struct file' has a lifetime strictly shorter than the
> > > ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> > > pointer like this is confouding as it will be invalid for some of the
> > > lifetime of the struct.
> > 
> > Ah...  ok.  I really thought it was the other way around.
> > 
> > __fput() should not call ib_uverbs_close() until the last reference on struct
> > file is released...  What holds references to struct ib_uverbs_file past that?
> 
> Child fds hold onto the internal ib_uverbs_file until they are closed

The FDs hold the struct file, don't they?

> 
> > Perhaps I need to add this (untested)?
> > 
> > diff --git a/drivers/infiniband/core/uverbs_main.c
> > b/drivers/infiniband/core/uverbs_main.c
> > index f628f9e4c09f..654e774d9cf2 100644
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -1125,6 +1125,8 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
> >         list_del_init(&file->list);
> >         mutex_unlock(&file->device->lists_mutex);
> >  
> > +       file->sys_file = NULL;
> 
> Now this has unlocked updates to that data.. you'd need some lock and
> get not zero pattern

You can't call "get" here because I'm 99% sure we only get here when struct
file has no references left...  I could be wrong.  It took me a while to work
through the reference counting so I could have missed something.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-12 21:00     ` Ira Weiny
@ 2019-08-12 21:20       ` John Hubbard
  0 siblings, 0 replies; 110+ messages in thread
From: John Hubbard @ 2019-08-12 21:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/12/19 2:00 PM, Ira Weiny wrote:
> On Fri, Aug 09, 2019 at 05:09:54PM -0700, John Hubbard wrote:
>> On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
>>> From: Ira Weiny <ira.weiny@intel.com>
...
> 
> At one point I wanted to (and had in my tree) a new flag but I went away from
> it.  Prior to the discussion on mlock last week I did not think we needed it.
> But I'm ok to add it back in.
> 
> I was not ignoring the idea for this RFC I just wanted to get this out there
> for people to see.  I see that you threw out a couple of patches which add this
> flag in.
> 
> FWIW, I think it would be good to differentiate between an indefinite pinned
> page vs a referenced "gotten" page.
> 
> What you and I have been working on is the former.  So it would be easy to
> change your refcounting patches to simply key off of FOLL_PIN.
> 
> Would you like me to add in your FOLL_PIN patches to this series?

Sure, that would be perfect. They don't make any sense on their own, and
it's all part of the same design idea.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-12 12:28   ` Jason Gunthorpe
@ 2019-08-12 21:48     ` Ira Weiny
  2019-08-13 11:47       ` Jason Gunthorpe
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-12 21:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 09:28:14AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2019 at 03:58:29PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> > pages.
> > 
> > In addition subsystems such as RDMA require new information to be passed
> > to the GUP interface to track file owning information.  As such a simple
> > FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> > 
> > Introduce a new GUP like call which takes the newly introduced vaddr_pin
> > information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> > call will result in a failure if pins were created on files during the
> > pin operation.
> 
> Is this a 'vaddr' in the traditional sense, ie does it work with
> something returned by valloc?

...or malloc in user space, yes.  I think the idea is that it is a user virtual
address.

> 
> Maybe another name would be better?

Maybe, the name I had was way worse...  So I'm not even going to admit to it...

;-)

So I'm open to suggestions.  Jan gave me this one, so I figured it was safer to
suggest it...

:-D

> 
> I also wish GUP like functions took in a 'void __user *' instead of
> the unsigned long to make this clear :\

Not a bad idea.  But I only see a couple of call sites who actually use a 'void
__user *' to pass into GUP...  :-/

For RDMA the address is _never_ a 'void __user *' AFAICS.

For the new API, it may be tractable to force users to cast to 'void __user *'
but it is not going to provide any type safety.

But it is easy to change in this series.

What do others think?

Ira

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-12 21:48     ` Ira Weiny
@ 2019-08-13 11:47       ` Jason Gunthorpe
  2019-08-13 17:46         ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-13 11:47 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 02:48:55PM -0700, Ira Weiny wrote:
> On Mon, Aug 12, 2019 at 09:28:14AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2019 at 03:58:29PM -0700, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> > > pages.
> > > 
> > > In addition subsystems such as RDMA require new information to be passed
> > > to the GUP interface to track file owning information.  As such a simple
> > > FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> > > 
> > > Introduce a new GUP like call which takes the newly introduced vaddr_pin
> > > information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> > > call will result in a failure if pins were created on files during the
> > > pin operation.
> > 
> > Is this a 'vaddr' in the traditional sense, ie does it work with
> > something returned by valloc?
> 
> ...or malloc in user space, yes.  I think the idea is that it is a user virtual
> address.

valloc is a kernel call

> So I'm open to suggestions.  Jan gave me this one, so I figured it was safer to
> suggest it...

Should have the word user in it, imho

> > I also wish GUP like functions took in a 'void __user *' instead of
> > the unsigned long to make this clear :\
> 
> Not a bad idea.  But I only see a couple of call sites who actually use a 'void
> __user *' to pass into GUP...  :-/
> 
> For RDMA the address is _never_ a 'void __user *' AFAICS.

That is actually a bug, converting from u64 to a 'user VA' needs to go
through u64_to_user_ptr().

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-12 21:15         ` Ira Weiny
@ 2019-08-13 11:48           ` Jason Gunthorpe
  2019-08-13 17:41             ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-13 11:48 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 02:15:37PM -0700, Ira Weiny wrote:
> On Mon, Aug 12, 2019 at 02:56:15PM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 12, 2019 at 10:28:27AM -0700, Ira Weiny wrote:
> > > On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > 
> > > > > In order for MRs to be tracked against the open verbs context the ufile
> > > > > needs to have a pointer to hand to the GUP code.
> > > > > 
> > > > > No references need to be taken as this should be valid for the lifetime
> > > > > of the context.
> > > > > 
> > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > > >  drivers/infiniband/core/uverbs.h      | 1 +
> > > > >  drivers/infiniband/core/uverbs_main.c | 1 +
> > > > >  2 files changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > > > > index 1e5aeb39f774..e802ba8c67d6 100644
> > > > > +++ b/drivers/infiniband/core/uverbs.h
> > > > > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> > > > >  	struct page *disassociate_page;
> > > > >  
> > > > >  	struct xarray		idr;
> > > > > +	struct file             *sys_file; /* backpointer to system file object */
> > > > >  };
> > > > 
> > > > The 'struct file' has a lifetime strictly shorter than the
> > > > ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> > > > pointer like this is confouding as it will be invalid for some of the
> > > > lifetime of the struct.
> > > 
> > > Ah...  ok.  I really thought it was the other way around.
> > > 
> > > __fput() should not call ib_uverbs_close() until the last reference on struct
> > > file is released...  What holds references to struct ib_uverbs_file past that?
> > 
> > Child fds hold onto the internal ib_uverbs_file until they are closed
> 
> The FDs hold the struct file, don't they?

Only dups, there are other 'child' FDs we can create

> > Now this has unlocked updates to that data.. you'd need some lock and
> > get not zero pattern
> 
> You can't call "get" here because I'm 99% sure we only get here when struct
> file has no references left...

Nope, like I said the other FDs hold the uverbs_file independent of
the struct file it is related too. 

This is why having a back pointer like this is so ugly, it creates a
reference counting cycle

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-13 11:48           ` Jason Gunthorpe
@ 2019-08-13 17:41             ` Ira Weiny
  2019-08-13 18:00               ` Jason Gunthorpe
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-13 17:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 13, 2019 at 08:48:42AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2019 at 02:15:37PM -0700, Ira Weiny wrote:
> > On Mon, Aug 12, 2019 at 02:56:15PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 12, 2019 at 10:28:27AM -0700, Ira Weiny wrote:
> > > > On Mon, Aug 12, 2019 at 10:00:40AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Aug 09, 2019 at 03:58:30PM -0700, ira.weiny@intel.com wrote:
> > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > 
> > > > > > In order for MRs to be tracked against the open verbs context the ufile
> > > > > > needs to have a pointer to hand to the GUP code.
> > > > > > 
> > > > > > No references need to be taken as this should be valid for the lifetime
> > > > > > of the context.
> > > > > > 
> > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > > > >  drivers/infiniband/core/uverbs.h      | 1 +
> > > > > >  drivers/infiniband/core/uverbs_main.c | 1 +
> > > > > >  2 files changed, 2 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
> > > > > > index 1e5aeb39f774..e802ba8c67d6 100644
> > > > > > +++ b/drivers/infiniband/core/uverbs.h
> > > > > > @@ -163,6 +163,7 @@ struct ib_uverbs_file {
> > > > > >  	struct page *disassociate_page;
> > > > > >  
> > > > > >  	struct xarray		idr;
> > > > > > +	struct file             *sys_file; /* backpointer to system file object */
> > > > > >  };
> > > > > 
> > > > > The 'struct file' has a lifetime strictly shorter than the
> > > > > ib_uverbs_file, which is kref'd on its own lifetime. Having a back
> > > > > pointer like this is confouding as it will be invalid for some of the
> > > > > lifetime of the struct.
> > > > 
> > > > Ah...  ok.  I really thought it was the other way around.
> > > > 
> > > > __fput() should not call ib_uverbs_close() until the last reference on struct
> > > > file is released...  What holds references to struct ib_uverbs_file past that?
> > > 
> > > Child fds hold onto the internal ib_uverbs_file until they are closed
> > 
> > The FDs hold the struct file, don't they?
> 
> Only dups, there are other 'child' FDs we can create
> 
> > > Now this has unlocked updates to that data.. you'd need some lock and
> > > get not zero pattern
> > 
> > You can't call "get" here because I'm 99% sure we only get here when struct
> > file has no references left...
> 
> Nope, like I said the other FDs hold the uverbs_file independent of
> the struct file it is related too. 

<sigh>

We don't allow memory registrations to be created with those other FDs...

And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
that some other thread is) destroying all the MR's we have associated with this
FD.

I'll have to think on this more since uverbs_destroy_ufile_hw() does not
block...  Which means there is a window here within the GUP code...  :-/

> 
> This is why having a back pointer like this is so ugly, it creates a
> reference counting cycle

Yep...  I worked through this...  and it was giving me fits...

Anyway, the struct file is the only object in the core which was reasonable to
store this information in since that is what is passed around to other
processes...

Another idea I explored was to create a callback into the driver from the core
which put the responsibility of printing the pin information on the driver.

But that started to be (and is likely going to be) a pretty complicated "dance"
between the core and the drivers so I went this way...

I also thought about holding some other reference on struct file which would
allow release to be called while keeping struct file around.  But that seemed
crazy...

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-13 11:47       ` Jason Gunthorpe
@ 2019-08-13 17:46         ` Ira Weiny
  2019-08-13 17:56           ` John Hubbard
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-13 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 13, 2019 at 08:47:06AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2019 at 02:48:55PM -0700, Ira Weiny wrote:
> > On Mon, Aug 12, 2019 at 09:28:14AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2019 at 03:58:29PM -0700, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > The addition of FOLL_LONGTERM has taken on additional meaning for CMA
> > > > pages.
> > > > 
> > > > In addition subsystems such as RDMA require new information to be passed
> > > > to the GUP interface to track file owning information.  As such a simple
> > > > FOLL_LONGTERM flag is no longer sufficient for these users to pin pages.
> > > > 
> > > > Introduce a new GUP like call which takes the newly introduced vaddr_pin
> > > > information.  Failure to pass the vaddr_pin object back to a vaddr_put*
> > > > call will result in a failure if pins were created on files during the
> > > > pin operation.
> > > 
> > > Is this a 'vaddr' in the traditional sense, ie does it work with
> > > something returned by valloc?
> > 
> > ...or malloc in user space, yes.  I think the idea is that it is a user virtual
> > address.
> 
> valloc is a kernel call

Oh...  I thought you meant this: https://linux.die.net/man/3/valloc

> 
> > So I'm open to suggestions.  Jan gave me this one, so I figured it was safer to
> > suggest it...
> 
> Should have the word user in it, imho

Fair enough...

user_addr_pin_pages(void __user * addr, ...) ?

uaddr_pin_pages(void __user * addr, ...) ?

I think I like uaddr...

> 
> > > I also wish GUP like functions took in a 'void __user *' instead of
> > > the unsigned long to make this clear :\
> > 
> > Not a bad idea.  But I only see a couple of call sites who actually use a 'void
> > __user *' to pass into GUP...  :-/
> > 
> > For RDMA the address is _never_ a 'void __user *' AFAICS.
> 
> That is actually a bug, converting from u64 to a 'user VA' needs to go
> through u64_to_user_ptr().

Fair enough.

But there are a lot of call sites throughout the kernel who have the same
bug...  I'm ok with forcing u64_to_user_ptr() to use this new call if others
are.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages()
  2019-08-13 17:46         ` Ira Weiny
@ 2019-08-13 17:56           ` John Hubbard
  0 siblings, 0 replies; 110+ messages in thread
From: John Hubbard @ 2019-08-13 17:56 UTC (permalink / raw)
  To: Ira Weiny, Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/13/19 10:46 AM, Ira Weiny wrote:
> On Tue, Aug 13, 2019 at 08:47:06AM -0300, Jason Gunthorpe wrote:
>> On Mon, Aug 12, 2019 at 02:48:55PM -0700, Ira Weiny wrote:
>>> On Mon, Aug 12, 2019 at 09:28:14AM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Aug 09, 2019 at 03:58:29PM -0700, ira.weiny@intel.com wrote:
>>>>> From: Ira Weiny <ira.weiny@intel.com>
...
>>> So I'm open to suggestions.  Jan gave me this one, so I figured it was safer to
>>> suggest it...
>>
>> Should have the word user in it, imho
> 
> Fair enough...
> 
> user_addr_pin_pages(void __user * addr, ...) ?
> 
> uaddr_pin_pages(void __user * addr, ...) ?
> 
> I think I like uaddr...
> 

Better to spell out "user". "u" prefixes are used for "unsigned" and it
is just too ambiguous here. Maybe:

    vaddr_pin_user_pages()

...which also sounds close enough to get_user_pages() that a bit of
history and continuity is preserved, too.



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-13 17:41             ` Ira Weiny
@ 2019-08-13 18:00               ` Jason Gunthorpe
  2019-08-13 20:38                 ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-13 18:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:

> And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> that some other thread is) destroying all the MR's we have associated with this
> FD.

fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
deletes any underlying HW resources, but the FD persists.
 
> > This is why having a back pointer like this is so ugly, it creates a
> > reference counting cycle
> 
> Yep...  I worked through this...  and it was giving me fits...
> 
> Anyway, the struct file is the only object in the core which was reasonable to
> store this information in since that is what is passed around to other
> processes...

It could be passed down in the uattr_bundle, once you are in file operations
handle the file is guarenteed to exist, and we've now arranged things
so the uattr_bundle flows into the umem, as umems can only be
established under a system call.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-13 18:00               ` Jason Gunthorpe
@ 2019-08-13 20:38                 ` Ira Weiny
  2019-08-14 12:23                   ` Jason Gunthorpe
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-13 20:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> 
> > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > that some other thread is) destroying all the MR's we have associated with this
> > FD.
> 
> fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> deletes any underlying HW resources, but the FD persists.

I misspoke.  I should have said associated with this "context".  And of course
uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
struct file which had file_pins hanging off of it would be getting its file
pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
after uverbs_destroy_ufile_hw() is done.

But since it does not block it may be that the struct file is gone before the
MR is actually destroyed.  Which means I think the GUP code would blow up in
that case...  :-(

I was thinking it was the other way around.  And in fact most of the time I
think it is.  But we can't depend on that...

>  
> > > This is why having a back pointer like this is so ugly, it creates a
> > > reference counting cycle
> > 
> > Yep...  I worked through this...  and it was giving me fits...
> > 
> > Anyway, the struct file is the only object in the core which was reasonable to
> > store this information in since that is what is passed around to other
> > processes...
> 
> It could be passed down in the uattr_bundle, once you are in file operations

What is "It"?  The struct file?  Or the file pin information?

> handle the file is guarenteed to exist, and we've now arranged things

I don't understand what you mean by "... once you are in file operations handle... "?

> so the uattr_bundle flows into the umem, as umems can only be
> established under a system call.

"uattr_bundle" == uverbs_attr_bundle right?

The problem is that I don't think the core should be handling
uverbs_attr_bundles directly.  So, I think you are driving at the same idea I
had WRT callbacks into the driver.

The drivers could provide some generic object (in RDMA this could be the
uverbs_attr_bundle) which represents their "context".

The GUP code calls back into the driver with file pin information as it
encounters and pins pages.  The driver, RDMA in this case, associates this
information with the "context".

But for the procfs interface, that context then needs to be associated with any
file which points to it...  For RDMA, or any other "FD based pin mechanism", it
would be up to the driver to "install" a procfs handler into any struct file
which _may_ point to this context.  (before _or_ after memory pins).

Then the procfs code can walk the FD array and if this handler is installed it
knows there is file pin information associated with that struct file and it can
be printed...

This is not impossible.  But I think is a lot harder for drivers to make
right...

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()
  2019-08-12 18:05     ` Ira Weiny
@ 2019-08-14  8:04       ` Dave Chinner
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Chinner @ 2019-08-14  8:04 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 11:05:51AM -0700, Ira Weiny wrote:
> On Sat, Aug 10, 2019 at 09:30:37AM +1000, Dave Chinner wrote:
> > On Fri, Aug 09, 2019 at 03:58:21PM -0700, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > dax_layout_busy_page() can now operate on a sub-range of the
> > > address_space provided.
> > > 
> > > Have xfs specify the sub range to dax_layout_busy_page()
> > 
> > Hmmm. I've got patches that change all these XFS interfaces to
> > support range locks. I'm not sure the way the ranges are passed here
> > is the best way to do it, and I suspect they aren't correct in some
> > cases, either....
> > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index ff3c1fae5357..f0de5486f6c1 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -1042,10 +1042,16 @@ xfs_vn_setattr(
> > >  		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > >  		iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > >  
> > > -		error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
> > > -		if (error) {
> > > -			xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > -			return error;
> > > +		if (iattr->ia_size < inode->i_size) {
> > > +			loff_t                  off = iattr->ia_size;
> > > +			loff_t                  len = inode->i_size - iattr->ia_size;
> > > +
> > > +			error = xfs_break_layouts(inode, &iolock, off, len,
> > > +						  BREAK_UNMAP);
> > > +			if (error) {
> > > +				xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > +				return error;
> > > +			}
> > 
> > This isn't right - truncate up still needs to break the layout on
> > the last filesystem block of the file,
> 
> I'm not following this?  From a user perspective they can't have done anything
> with the data beyond the EOF.  So isn't it safe to allow EOF to grow without
> changing the layout of that last block?


You're looking at this from the perspective of what RDMA page
pinning, not what the guarantees a filesystem has to provide layout
holders.

For example, truncate up has to zero the portion of the block beyond
EOF and that requires a data write. What happens if that block is a
shared extent and hence we have do a copy on write and alter the
file layout?

Or perhaps that tail block still has dirty data over it that is
marked for delayed allocation? Truncate up will have to write that
data to zero the delayed allocation extent that spans EOF, and hence
the truncate modifies the layout because it triggers allocation.

i.e. just because an operation does not change user data, it does
not mean that it will not change the file layout. There is a chance
that truncate up will modify the layout and so we need to break the
layout leases that span the range from the old size to the new
size...

> > and truncate down needs to
> > extend to "maximum file offset" because we remove all extents beyond
> > EOF on a truncate down.
> 
> Ok, I was trying to allow a user to extend the file without conflicts if they
> were to have a pin on the 'beginning' of the original file.

If we want to allow file extension under a layout lease, the lease
has to extend beyond EOF, otherwise the new section of the file is
not covered by a lease. If leases only extend to the existing
EOF, then once the new data is written and the file is extended,
then the lease owner needs to take a new lease on the range they
just wrote. SO the application ends up having to do write - lease
-write -lease - .... so that it has leases covering the range of the
file it is extending into.

Much better it to define a lease that extends to max file offset,
such that it always covers they range past the existing EOF and
extending writes will automatically be covered. What this then does
is to trigger layout break notifications on file size change, either
by write, truncate, fallocate, without having to actually know or
track the exactly file size in the lease....

> This sounds like
> you are saying that a layout lease must be dropped to do that?  In some ways I
> think I understand what you are driving at and I think I see how I may have
> been playing "fast and loose" with the strictness of the layout lease.  But
> from a user perspective if there is a part of the file which "does not exist"
> (beyond EOF) does it matter that the layout there may change?

Yes, it does, because userspace can directly manipulate the layout
beyond EOF via fallocate(). e.g. we can preallocation beyond EOF
without changing the file size, such that when we then do an
extending write no layout change actually takes place. The only
thing that happens from a layout point of view is that the file size
changes.

This becomes /interesting/ when you start doing things like

	lseek(fd, offset, SEEK_END);
	write(fd, buf, len);

which will trigger a write way beyond EOF into allocated space.
That will also trigger block zeroing at the old tail, and there may
be block zeroing around the write() as well. We've effectively
change the layout of the file at EOF,  We've effectively change the
layout of the file at EOF, and potentially beyond EOF.

Indeed, the app might be expecting the preallocation beyond EOF to
remain, so it might register a layout over that range to be notified
if the preallocation is removed or the EOF extends beyond it. It
needs to be notified on truncate down (which removes that
preallocated range the lease sits over) and EOF is moved beyond it
(layout range state has changed from inaccessable to valid file
data)....


> > i.e. when we use preallocation, the extent map extends beyond EOF,
> > and layout leases need to be able to extend beyond the current EOF
> > to allow the lease owner to do extending writes, extending truncate,
> > preallocation beyond EOF, etc safely without having to get a new
> > lease to cover the new region in the extended file...
> 
> I'm not following this.  What determines when preallocation is done?

The application can direct it via fallocate(FALLOC_FL_KEEPSIZE).
It's typically used for workloads that do appending O_DSYNC or
direct IO writes to minimise file fragmentation.

The filesystem can ialso choose to do allocation beyond EOFi
speculatively during writes. XFS does this extensively with delayed
allocation. And the filesystem can also remove this speculative
allocation beyond EOF, which it may do if there are no active pages
dirties on the inode for a period, it is reclaimed, the filesystem
is running low on space, the user/group is running low on quota
space, etc.

Again, just because user data does not change, it does not mean that
the file layout will not change....

> Forgive my ignorance on file systems but how can we have a layout for every
> file which is "maximum file offset" for every file even if a file is only 1
> page long?

The layout lease doesn't care what the file size it. It doesn't even
know what the file size is. The layout lease covers a range the
logical file offset with the intend that any change to the file
layout within that range will result in a notification. The layout
lease is not bound to the range of valid data in the file at all -
it doesn't matter if it points beyond EOF - if the file grows to
the size the it overlaps the layout lease, then that layout lease
needs to be notified by break_layouts....

I've had a stinking headache all day, so I'm struggling to make
sense right now. The best I can describe is that layout lease ranges
do not imply or require valid file data to exist within the range
they are taken over - they just cover a file offset range.

FWIW, the fcntl() locking interface uses a length of 0 to
indicate "to max file offset" rather than a specific length. e.g.
SETLK and friends:

	Specifying 0 for l_len has the special meaning: lock all
	bytes starting at the location specified by l_whence and
	l_start through to the end of file, no  matter
	how large the file grows.

That's exactly the semantics I'm talking about here - layout leases
need to be able to specify an extent anywhere within the valid file
offset range, and also to specify a nebulous "through to the end of
the layout range" so taht file growth can be done without needing
new leases to be taken as the file grows....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space
  2019-08-12 17:36     ` Ira Weiny
@ 2019-08-14  8:05       ` Dave Chinner
  2019-08-14 11:21         ` Jeff Layton
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-14  8:05 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.weiny@intel.com wrote:
> > > +	/*
> > > +	 * NOTE on F_LAYOUT lease
> > > +	 *
> > > +	 * LAYOUT lease types are taken on files which the user knows that
> > > +	 * they will be pinning in memory for some indeterminate amount of
> > > +	 * time.
> > 
> > Indeed, layout leases have nothing to do with pinning of memory.
> 
> Yep, Fair enough.  I'll rework the comment.
> 
> > That's something an application taht uses layout leases might do,
> > but it largely irrelevant to the functionality layout leases
> > provide. What needs to be done here is explain what the layout lease
> > API actually guarantees w.r.t. the physical file layout, not what
> > some application is going to do with a lease. e.g.
> > 
> > 	The layout lease F_RDLCK guarantees that the holder will be
> > 	notified that the physical file layout is about to be
> > 	changed, and that it needs to release any resources it has
> > 	over the range of this lease, drop the lease and then
> > 	request it again to wait for the kernel to finish whatever
> > 	it is doing on that range.
> > 
> > 	The layout lease F_RDLCK also allows the holder to modify
> > 	the physical layout of the file. If an operation from the
> > 	lease holder occurs that would modify the layout, that lease
> > 	holder does not get notification that a change will occur,
> > 	but it will block until all other F_RDLCK leases have been
> > 	released by their holders before going ahead.
> > 
> > 	If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > 	holder will fail any operation that may modify the physical
> > 	layout of the file. F_WRLCK provides exclusive physical
> > 	modification access to the holder, guaranteeing nothing else
> > 	will change the layout of the file while it holds the lease.
> > 
> > 	The F_WRLCK holder can change the physical layout of the
> > 	file if it so desires, this will block while F_RDLCK holders
> > 	are notified and release their leases before the
> > 	modification will take place.
> > 
> > We need to define the semantics we expose to userspace first.....
> 
> Agreed.  I believe I have implemented the semantics you describe above.  Do I
> have your permission to use your verbiage as part of reworking the comment and
> commit message?

Of course. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
                   ` (18 preceding siblings ...)
  2019-08-09 22:58 ` [RFC PATCH v2 19/19] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
@ 2019-08-14 10:17 ` Jan Kara
  2019-08-14 18:08   ` Ira Weiny
  19 siblings, 1 reply; 110+ messages in thread
From: Jan Kara @ 2019-08-14 10:17 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	Dave Chinner, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

Hello!

On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> Pre-requisites
> ==============
> 	Based on mmotm tree.
> 
> Based on the feedback from LSFmm, the LWN article, the RFC series since
> then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> 
> Solution summary
> ================
> 
> The real issue is that there is no use case for a user to have RDMA pinn'ed
> memory which is then truncated.  So really any solution we present which:
> 
> A) Prevents file system corruption or data leaks
> ...and...
> B) Informs the user that they did something wrong
> 
> Should be an acceptable solution.
> 
> Because this is slightly new behavior.  And because this is going to be
> specific to DAX (because of the lack of a page cache) we have made the user
> "opt in" to this behavior.
> 
> The following patches implement the following solution.
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>    layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>    encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.

So I didn't fully grok the patch set yet but by "pinned DAX page" do you
mean a page which has corresponding file_pin covering it? Or do you mean a
page which has pincount increased? If the first then I'd rephrase this to
be less ambiguous, if the second then I think it is wrong. 

> 4) The user has the option of holding the lease or releasing it.  If they
>    release it no other pin calls will work on the file.

Last time we spoke the plan was that the lease is kept while the pages are
pinned (and an attempt to release the lease would block until the pages are
unpinned). That also makes it clear that the *lease* is what is making
truncate and hole punch fail with ETXTBUSY and the file_pin structure is
just an implementation detail how the existence is efficiently tracked (and
what keeps the backing file for the pages open so that the lease does not
get auto-destroyed). Why did you change this?

> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>    depending on the internal subsystem needs.  With RDMA there is an owning
>    file which is related to the pined file.
> 
> 8) Only RDMA is currently supported

If you currently only need "owning file" variant in your patch set, then
I'd just implement that and leave "owning mm" variant for later if it
proves to be necessary. The things are complex enough as is...

> 9) Truncation of pages which are not actively pinned nor covered by a lease
>    will succeed.

Otherwise I like the design.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space
  2019-08-14  8:05       ` Dave Chinner
@ 2019-08-14 11:21         ` Jeff Layton
  2019-08-14 11:38           ` Dave Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: Jeff Layton @ 2019-08-14 11:21 UTC (permalink / raw)
  To: Dave Chinner, Ira Weiny
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Jan Kara, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, 2019-08-14 at 18:05 +1000, Dave Chinner wrote:
> On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> > On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.weiny@intel.com wrote:
> > > > +	/*
> > > > +	 * NOTE on F_LAYOUT lease
> > > > +	 *
> > > > +	 * LAYOUT lease types are taken on files which the user knows that
> > > > +	 * they will be pinning in memory for some indeterminate amount of
> > > > +	 * time.
> > > 
> > > Indeed, layout leases have nothing to do with pinning of memory.
> > 
> > Yep, Fair enough.  I'll rework the comment.
> > 
> > > That's something an application taht uses layout leases might do,
> > > but it largely irrelevant to the functionality layout leases
> > > provide. What needs to be done here is explain what the layout lease
> > > API actually guarantees w.r.t. the physical file layout, not what
> > > some application is going to do with a lease. e.g.
> > > 
> > > 	The layout lease F_RDLCK guarantees that the holder will be
> > > 	notified that the physical file layout is about to be
> > > 	changed, and that it needs to release any resources it has
> > > 	over the range of this lease, drop the lease and then
> > > 	request it again to wait for the kernel to finish whatever
> > > 	it is doing on that range.
> > > 
> > > 	The layout lease F_RDLCK also allows the holder to modify
> > > 	the physical layout of the file. If an operation from the
> > > 	lease holder occurs that would modify the layout, that lease
> > > 	holder does not get notification that a change will occur,
> > > 	but it will block until all other F_RDLCK leases have been
> > > 	released by their holders before going ahead.
> > > 
> > > 	If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > > 	holder will fail any operation that may modify the physical
> > > 	layout of the file. F_WRLCK provides exclusive physical
> > > 	modification access to the holder, guaranteeing nothing else
> > > 	will change the layout of the file while it holds the lease.
> > > 
> > > 	The F_WRLCK holder can change the physical layout of the
> > > 	file if it so desires, this will block while F_RDLCK holders
> > > 	are notified and release their leases before the
> > > 	modification will take place.
> > > 
> > > We need to define the semantics we expose to userspace first.....

Absolutely.

> > 
> > Agreed.  I believe I have implemented the semantics you describe above.  Do I
> > have your permission to use your verbiage as part of reworking the comment and
> > commit message?
> 
> Of course. :)
> 
> Cheers,
> 

I'll review this in more detail soon, but subsequent postings of the set
should probably also go to linux-api mailing list. This is a significant
API change. It might not also hurt to get the glibc folks involved here
too since you'll probably want to add the constants to the headers there
as well.

Finally, consider going ahead and drafting a patch to the fcntl(2)
manpage if you think you have the API mostly nailed down. This API is a
little counterintuitive (i.e. you can change the layout with an F_RDLCK
lease), so it will need to be very clearly documented. I've also found
that when creating a new API, documenting it tends to help highlight its
warts and areas where the behavior is not clearly defined.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space
  2019-08-14 11:21         ` Jeff Layton
@ 2019-08-14 11:38           ` Dave Chinner
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Chinner @ 2019-08-14 11:38 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ira Weiny, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Jan Kara, Theodore Ts'o, John Hubbard,
	Michal Hocko, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

On Wed, Aug 14, 2019 at 07:21:34AM -0400, Jeff Layton wrote:
> On Wed, 2019-08-14 at 18:05 +1000, Dave Chinner wrote:
> > On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> > > On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.weiny@intel.com wrote:
> > > > > +	/*
> > > > > +	 * NOTE on F_LAYOUT lease
> > > > > +	 *
> > > > > +	 * LAYOUT lease types are taken on files which the user knows that
> > > > > +	 * they will be pinning in memory for some indeterminate amount of
> > > > > +	 * time.
> > > > 
> > > > Indeed, layout leases have nothing to do with pinning of memory.
> > > 
> > > Yep, Fair enough.  I'll rework the comment.
> > > 
> > > > That's something an application taht uses layout leases might do,
> > > > but it largely irrelevant to the functionality layout leases
> > > > provide. What needs to be done here is explain what the layout lease
> > > > API actually guarantees w.r.t. the physical file layout, not what
> > > > some application is going to do with a lease. e.g.
> > > > 
> > > > 	The layout lease F_RDLCK guarantees that the holder will be
> > > > 	notified that the physical file layout is about to be
> > > > 	changed, and that it needs to release any resources it has
> > > > 	over the range of this lease, drop the lease and then
> > > > 	request it again to wait for the kernel to finish whatever
> > > > 	it is doing on that range.
> > > > 
> > > > 	The layout lease F_RDLCK also allows the holder to modify
> > > > 	the physical layout of the file. If an operation from the
> > > > 	lease holder occurs that would modify the layout, that lease
> > > > 	holder does not get notification that a change will occur,
> > > > 	but it will block until all other F_RDLCK leases have been
> > > > 	released by their holders before going ahead.
> > > > 
> > > > 	If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > > > 	holder will fail any operation that may modify the physical
> > > > 	layout of the file. F_WRLCK provides exclusive physical
> > > > 	modification access to the holder, guaranteeing nothing else
> > > > 	will change the layout of the file while it holds the lease.
> > > > 
> > > > 	The F_WRLCK holder can change the physical layout of the
> > > > 	file if it so desires, this will block while F_RDLCK holders
> > > > 	are notified and release their leases before the
> > > > 	modification will take place.
> > > > 
> > > > We need to define the semantics we expose to userspace first.....
> 
> Absolutely.
> 
> > > 
> > > Agreed.  I believe I have implemented the semantics you describe above.  Do I
> > > have your permission to use your verbiage as part of reworking the comment and
> > > commit message?
> > 
> > Of course. :)
> > 
> > Cheers,
> > 
> 
> I'll review this in more detail soon, but subsequent postings of the set
> should probably also go to linux-api mailing list. This is a significant
> API change. It might not also hurt to get the glibc folks involved here
> too since you'll probably want to add the constants to the headers there
> as well.

Sure, but lets first get it to the point where we have something
that is actually workable, much more complete and somewhat validated
with unit tests before we start involving too many people. Wide
review of prototype code isn't really a good use of resources given
how much it's probably going to change from here...

> Finally, consider going ahead and drafting a patch to the fcntl(2)
> manpage if you think you have the API mostly nailed down. This API is a
> little counterintuitive (i.e. you can change the layout with an F_RDLCK
> lease), so it will need to be very clearly documented. I've also found
> that when creating a new API, documenting it tends to help highlight its
> warts and areas where the behavior is not clearly defined.

I find writing unit tests for xfstests to validate the new APIs work
as intended finds far more problems with the API than writing the
documentation. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-13 20:38                 ` Ira Weiny
@ 2019-08-14 12:23                   ` Jason Gunthorpe
  2019-08-14 17:50                     ` Ira Weiny
  2019-09-04 22:25                     ` Ira Weiny
  0 siblings, 2 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-14 12:23 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > 
> > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > that some other thread is) destroying all the MR's we have associated with this
> > > FD.
> > 
> > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > deletes any underlying HW resources, but the FD persists.
> 
> I misspoke.  I should have said associated with this "context".  And of course
> uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> struct file which had file_pins hanging off of it would be getting its file
> pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> after uverbs_destroy_ufile_hw() is done.
> 
> But since it does not block it may be that the struct file is gone before the
> MR is actually destroyed.  Which means I think the GUP code would blow up in
> that case...  :-(

Oh, yes, that is true, you also can't rely on the struct file living
longer than the HW objects either, that isn't how the lifetime model
works.

If GUP consumes the struct file it must allow the struct file to be
deleted before the GUP pin is released.

> The drivers could provide some generic object (in RDMA this could be the
> uverbs_attr_bundle) which represents their "context".

For RDMA the obvious context is the struct ib_mr *

> But for the procfs interface, that context then needs to be associated with any
> file which points to it...  For RDMA, or any other "FD based pin mechanism", it
> would be up to the driver to "install" a procfs handler into any struct file
> which _may_ point to this context.  (before _or_ after memory pins).

Is this all just for debugging? Seems like a lot of complication just
to print a string

Generally, I think you'd be better to associate things with the
mm_struct not some struct file... The whole design is simpler as GUP
already has the mm_struct.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-09 22:58 ` [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease ira.weiny
@ 2019-08-14 14:15   ` Jeff Layton
  2019-08-14 21:56     ` Dave Chinner
  2019-09-04 23:12   ` John Hubbard
  1 sibling, 1 reply; 110+ messages in thread
From: Jeff Layton @ 2019-08-14 14:15 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, 2019-08-09 at 15:58 -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Add an exclusive lease flag which indicates that the layout mechanism
> can not be broken.
> 
> Exclusive layout leases allow the file system to know that pages may be
> GUP pined and that attempts to change the layout, ie truncate, should be
> failed.
> 
> A process which attempts to break it's own exclusive lease gets an
> EDEADLOCK return to help determine that this is likely a programming bug
> vs someone else holding a resource.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/locks.c                       | 23 +++++++++++++++++++++--
>  include/linux/fs.h               |  1 +
>  include/uapi/asm-generic/fcntl.h |  2 ++
>  3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index ad17c6ffca06..0c7359cdab92 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -626,6 +626,8 @@ static int lease_init(struct file *filp, long type, unsigned int flags,
>  	fl->fl_flags = FL_LEASE;
>  	if (flags & FL_LAYOUT)
>  		fl->fl_flags |= FL_LAYOUT;
> +	if (flags & FL_EXCLUSIVE)
> +		fl->fl_flags |= FL_EXCLUSIVE;
>  	fl->fl_start = 0;
>  	fl->fl_end = OFFSET_MAX;
>  	fl->fl_ops = NULL;
> @@ -1619,6 +1621,14 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  	list_for_each_entry_safe(fl, tmp, &ctx->flc_lease, fl_list) {
>  		if (!leases_conflict(fl, new_fl))
>  			continue;
> +		if (fl->fl_flags & FL_EXCLUSIVE) {
> +			error = -ETXTBSY;
> +			if (new_fl->fl_pid == fl->fl_pid) {
> +				error = -EDEADLOCK;
> +				goto out;
> +			}
> +			continue;
> +		}
>  		if (want_write) {
>  			if (fl->fl_flags & FL_UNLOCK_PENDING)
>  				continue;
> @@ -1634,6 +1644,13 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  			locks_delete_lock_ctx(fl, &dispose);
>  	}
>  
> +	/* We differentiate between -EDEADLOCK and -ETXTBSY so the above loop
> +	 * continues with -ETXTBSY looking for a potential deadlock instead.
> +	 * If deadlock is not found go ahead and return -ETXTBSY.
> +	 */
> +	if (error == -ETXTBSY)
> +		goto out;
> +
>  	if (list_empty(&ctx->flc_lease))
>  		goto out;
>  
> @@ -2044,9 +2061,11 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
>  	 * to revoke the lease in break_layout()  And this is done by using
>  	 * F_WRLCK in the break code.
>  	 */
> -	if (arg == F_LAYOUT) {
> +	if ((arg & F_LAYOUT) == F_LAYOUT) {
> +		if ((arg & F_EXCLUSIVE) == F_EXCLUSIVE)
> +			flags |= FL_EXCLUSIVE;
>  		arg = F_RDLCK;
> -		flags = FL_LAYOUT;
> +		flags |= FL_LAYOUT;
>  	}
>  
>  	fl = lease_alloc(filp, arg, flags);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index dd60d5be9886..2e41ce547913 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1005,6 +1005,7 @@ static inline struct file *get_file(struct file *f)
>  #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
>  #define FL_LAYOUT	2048	/* outstanding pNFS layout or user held pin */
> +#define FL_EXCLUSIVE	4096	/* Layout lease is exclusive */
>  
>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>  
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index baddd54f3031..88b175ceccbc 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -176,6 +176,8 @@ struct f_owner_ex {
>  
>  #define F_LAYOUT	16      /* layout lease to allow longterm pins such as
>  				   RDMA */
> +#define F_EXCLUSIVE	32      /* layout lease is exclusive */
> +				/* FIXME or shoudl this be F_EXLCK??? */
>  
>  /* operations for bsd flock(), also used by the kernel implementation */
>  #define LOCK_SH		1	/* shared lock */

This interface just seems weird to me. The existing F_*LCK values aren't
really set up to be flags, but are enumerated values (even if there are
some gaps on some arches). For instance, on parisc and sparc:

/* for posix fcntl() and lockf() */
#define F_RDLCK         01
#define F_WRLCK         02
#define F_UNLCK         03

While your new flag values are well above these values, it's still a bit
sketchy to do what you're proposing from a cross-platform interface
standpoint.

I think this would be a lot cleaner if you weren't overloading the
F_SETLEASE command with new flags, and instead added new
F_SETLAYOUT/F_GETLAYOUT cmd values.

You'd then be free to define a new set of "arg" values for use with
layouts, and there's be a clear distinction interface-wise between
setting a layout and a lease.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-14 12:23                   ` Jason Gunthorpe
@ 2019-08-14 17:50                     ` Ira Weiny
  2019-08-14 18:15                       ` Jason Gunthorpe
  2019-09-04 22:25                     ` Ira Weiny
  1 sibling, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-14 17:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 14, 2019 at 09:23:08AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> > On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > > 
> > > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > > that some other thread is) destroying all the MR's we have associated with this
> > > > FD.
> > > 
> > > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > > deletes any underlying HW resources, but the FD persists.
> > 
> > I misspoke.  I should have said associated with this "context".  And of course
> > uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> > struct file which had file_pins hanging off of it would be getting its file
> > pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> > after uverbs_destroy_ufile_hw() is done.
> > 
> > But since it does not block it may be that the struct file is gone before the
> > MR is actually destroyed.  Which means I think the GUP code would blow up in
> > that case...  :-(
> 
> Oh, yes, that is true, you also can't rely on the struct file living
> longer than the HW objects either, that isn't how the lifetime model
> works.
> 
> If GUP consumes the struct file it must allow the struct file to be
> deleted before the GUP pin is released.

I may have to think about this a bit.  But I'm starting to lean toward my
callback method as a solution...

> 
> > The drivers could provide some generic object (in RDMA this could be the
> > uverbs_attr_bundle) which represents their "context".
> 
> For RDMA the obvious context is the struct ib_mr *

Not really, but maybe.  See below regarding tracking this across processes.

> 
> > But for the procfs interface, that context then needs to be associated with any
> > file which points to it...  For RDMA, or any other "FD based pin mechanism", it
> > would be up to the driver to "install" a procfs handler into any struct file
> > which _may_ point to this context.  (before _or_ after memory pins).
> 
> Is this all just for debugging? Seems like a lot of complication just
> to print a string

No, this is a requirement to allow an admin to determine why their truncates
may be failing.  As per our discussion here:

https://lkml.org/lkml/2019/6/7/982

Looking back at the thread apparently no one confirmed my question (assertion).
But no one objected to it either!  :-D  From that post:

	"... if we can keep track of who has the pins in lsof can we agree no
	process needs to be SIGKILL'ed?  Admins can do this on their own
	"killing" if they really need to stop the use of these files, right?"

This is what I am trying to do here is ensure that no matter what the user
does.  Fork, munmap, SCM_RIGHTS, close (on any FD), the underlying pin is
associated to any process which has access to those pins and is holding
references to those pages.  Then any user of the system who gets a failing
truncate can figure out which processes are holding this up.

> 
> Generally, I think you'd be better to associate things with the
> mm_struct not some struct file... The whole design is simpler as GUP
> already has the mm_struct.

I wish I _could_ do that...  And for some simple users I do that.  This is why
rdma_pin has the option to track against mm_struct _OR_ struct file.

At first it seemed like carrying over the mm_struct info during fork would
work...  but then there is SCM_RIGHTS where one can share the RDMA context with
any "random" process...  AFAICS struct file has no concept of mm_struct (nor
should it) so the dup for SCM_RIGHTS processing would not be able to do this.
A further complication was that when the RDMA FD is dup'ed the RDMA subsystem
does not know about it...  So it was not straight forward to have the RDMA
subsystem do this either.  Not to mention that would be yet another
complication the drivers would have to deal with...  I think you had similar
issues which lead to the use of an "owning_mm" in the umem object.  So while
_some_ mm_struct is held it may not be visible to the user since that mm_struct
may belong to a process which is gone... Or even if not gone, killing it would not
fully remove the pin...

So keeping this tracked against struct file works (and seemed straight forward)
no matter where/how the RDMA FD is shared...  Even with the complication above
I still think it is easier to do this way.

If I am missing something WRT the mm_struct "I'm all ears".

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-14 10:17 ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Jan Kara
@ 2019-08-14 18:08   ` Ira Weiny
  2019-08-15 13:05     ` Jan Kara
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-14 18:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> Hello!
> 
> On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > Pre-requisites
> > ==============
> > 	Based on mmotm tree.
> > 
> > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > 
> > Solution summary
> > ================
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is going to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 0) Registrations to Device DAX char devs are not affected
> > 
> > 1) The user has to opt in to allowing page pins on a file with an exclusive
> >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > 
> > 2) page pins will fail if the lease is not active when the file back page is
> >    encountered.
> > 
> > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> 
> So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> mean a page which has corresponding file_pin covering it? Or do you mean a
> page which has pincount increased? If the first then I'd rephrase this to
> be less ambiguous, if the second then I think it is wrong. 

I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
pincount processing will hang the truncate.  Given the discussion with John H
we can make this a bit better if we use something like FOLL_PIN and the page
count bias to indicate this type of pin.  Then I could fail the truncate
outright.  but that is not done yet.

so... I used the word "fail" to be a bit more vague as the final implementation
may return ETXTBUSY or hang as noted.

> 
> > 4) The user has the option of holding the lease or releasing it.  If they
> >    release it no other pin calls will work on the file.
> 
> Last time we spoke the plan was that the lease is kept while the pages are
> pinned (and an attempt to release the lease would block until the pages are
> unpinned). That also makes it clear that the *lease* is what is making
> truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> just an implementation detail how the existence is efficiently tracked (and
> what keeps the backing file for the pages open so that the lease does not
> get auto-destroyed). Why did you change this?

closing the file _and_ unmaping it will cause the lease to be released
regardless of if we allow this or not.

As we discussed preventing the close seemed intractable.

I thought about failing the munmap but that seemed wrong as well.  But more
importantly AFAIK RDMA can pass its memory pins to other processes via FD
passing...  This means that one could pin this memory, pass it to another
process and exit.  The file lease on the pin'ed file is lost.

The file lease is just a key to get the memory pin.  Once unlocked the procfs
tracking keeps track of where that pin goes and which processes need to be
killed to get rid of it.

> 
> > 5) Closing the file is ok.
> > 
> > 6) Unmapping the file is ok
> > 
> > 7) Pins against the files are tracked back to an owning file or an owning mm
> >    depending on the internal subsystem needs.  With RDMA there is an owning
> >    file which is related to the pined file.
> > 
> > 8) Only RDMA is currently supported
> 
> If you currently only need "owning file" variant in your patch set, then
> I'd just implement that and leave "owning mm" variant for later if it
> proves to be necessary. The things are complex enough as is...

I can do that...  I was trying to get io_uring working as well with the
owning_mm but I should save that for later.

> 
> > 9) Truncation of pages which are not actively pinned nor covered by a lease
> >    will succeed.
> 
> Otherwise I like the design.

Thanks,
Ira

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-14 17:50                     ` Ira Weiny
@ 2019-08-14 18:15                       ` Jason Gunthorpe
  0 siblings, 0 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-14 18:15 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 14, 2019 at 10:50:45AM -0700, Ira Weiny wrote:
> On Wed, Aug 14, 2019 at 09:23:08AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> > > On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > > > 
> > > > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > > > that some other thread is) destroying all the MR's we have associated with this
> > > > > FD.
> > > > 
> > > > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > > > deletes any underlying HW resources, but the FD persists.
> > > 
> > > I misspoke.  I should have said associated with this "context".  And of course
> > > uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> > > struct file which had file_pins hanging off of it would be getting its file
> > > pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> > > after uverbs_destroy_ufile_hw() is done.
> > > 
> > > But since it does not block it may be that the struct file is gone before the
> > > MR is actually destroyed.  Which means I think the GUP code would blow up in
> > > that case...  :-(
> > 
> > Oh, yes, that is true, you also can't rely on the struct file living
> > longer than the HW objects either, that isn't how the lifetime model
> > works.
> > 
> > If GUP consumes the struct file it must allow the struct file to be
> > deleted before the GUP pin is released.
> 
> I may have to think about this a bit.  But I'm starting to lean toward my
> callback method as a solution...
> 
> > 
> > > The drivers could provide some generic object (in RDMA this could be the
> > > uverbs_attr_bundle) which represents their "context".
> > 
> > For RDMA the obvious context is the struct ib_mr *
> 
> Not really, but maybe.  See below regarding tracking this across processes.
> 
> > 
> > > But for the procfs interface, that context then needs to be associated with any
> > > file which points to it...  For RDMA, or any other "FD based pin mechanism", it
> > > would be up to the driver to "install" a procfs handler into any struct file
> > > which _may_ point to this context.  (before _or_ after memory pins).
> > 
> > Is this all just for debugging? Seems like a lot of complication just
> > to print a string
> 
> No, this is a requirement to allow an admin to determine why their truncates
> may be failing.  As per our discussion here:
> 
> https://lkml.org/lkml/2019/6/7/982

visibility/debugging..

I don't see any solution here with the struct file - we apparently
have a problem with deadlock if the uverbs close() waits as mmput()
can trigger a call close() - see the comment on top of
uverbs_destroy_ufile_hw()

However, I wonder if that is now old information since commit
4a9d4b024a31 ("switch fput to task_work_add") makes fput deferred, so
mmdrop() should not drop waiting on fput??

If you could unwrap this mystery, probably with some testing proof,
then we could make uverbs_destroy_ufile_hw() a fence even for close
and your task is much simpler.

The general flow to trigger is to have a process that has mmap'd
something from the uverbs fd, then trigger both device disassociate
and process exit with just the right race so that the process has
exited enough that the mmdrop on the disassociate threda does the
final cleanup triggering the VMAs inside the mm to do the final fput
on their FDs, triggering final fput() for uverbs inside the thread of
disassociate.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-14 14:15   ` Jeff Layton
@ 2019-08-14 21:56     ` Dave Chinner
  2019-08-26 10:41       ` Jeff Layton
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-14 21:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ira.weiny, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Jan Kara, Theodore Ts'o, John Hubbard,
	Michal Hocko, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

On Wed, Aug 14, 2019 at 10:15:06AM -0400, Jeff Layton wrote:
> On Fri, 2019-08-09 at 15:58 -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Add an exclusive lease flag which indicates that the layout mechanism
> > can not be broken.
> > 
> > Exclusive layout leases allow the file system to know that pages may be
> > GUP pined and that attempts to change the layout, ie truncate, should be
> > failed.
> > 
> > A process which attempts to break it's own exclusive lease gets an
> > EDEADLOCK return to help determine that this is likely a programming bug
> > vs someone else holding a resource.
.....
> > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > index baddd54f3031..88b175ceccbc 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -176,6 +176,8 @@ struct f_owner_ex {
> >  
> >  #define F_LAYOUT	16      /* layout lease to allow longterm pins such as
> >  				   RDMA */
> > +#define F_EXCLUSIVE	32      /* layout lease is exclusive */
> > +				/* FIXME or shoudl this be F_EXLCK??? */
> >  
> >  /* operations for bsd flock(), also used by the kernel implementation */
> >  #define LOCK_SH		1	/* shared lock */
> 
> This interface just seems weird to me. The existing F_*LCK values aren't
> really set up to be flags, but are enumerated values (even if there are
> some gaps on some arches). For instance, on parisc and sparc:

I don't think we need to worry about this - the F_WRLCK version of
the layout lease should have these exclusive access semantics (i.e
other ops fail rather than block waiting for lease recall) and hence
the API shouldn't need a new flag to specify them.

i.e. the primary difference between F_RDLCK and F_WRLCK layout
leases is that the F_RDLCK is a shared, co-operative lease model
where only delays in operations will be seen, while F_WRLCK is a
"guarantee exclusive access and I don't care what it breaks"
model... :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-14 18:08   ` Ira Weiny
@ 2019-08-15 13:05     ` Jan Kara
  2019-08-16 19:05       ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jan Kara @ 2019-08-15 13:05 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	Dave Chinner, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > Pre-requisites
> > > ==============
> > > 	Based on mmotm tree.
> > > 
> > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > 
> > > Solution summary
> > > ================
> > > 
> > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > memory which is then truncated.  So really any solution we present which:
> > > 
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > > 
> > > Should be an acceptable solution.
> > > 
> > > Because this is slightly new behavior.  And because this is going to be
> > > specific to DAX (because of the lack of a page cache) we have made the user
> > > "opt in" to this behavior.
> > > 
> > > The following patches implement the following solution.
> > > 
> > > 0) Registrations to Device DAX char devs are not affected
> > > 
> > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > 
> > > 2) page pins will fail if the lease is not active when the file back page is
> > >    encountered.
> > > 
> > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > 
> > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > mean a page which has corresponding file_pin covering it? Or do you mean a
> > page which has pincount increased? If the first then I'd rephrase this to
> > be less ambiguous, if the second then I think it is wrong. 
> 
> I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> pincount processing will hang the truncate.  Given the discussion with John H
> we can make this a bit better if we use something like FOLL_PIN and the page
> count bias to indicate this type of pin.  Then I could fail the truncate
> outright.  but that is not done yet.
> 
> so... I used the word "fail" to be a bit more vague as the final implementation
> may return ETXTBUSY or hang as noted.

Ah, OK. Hanging is fine in principle but with longterm pins, your work
makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
e.g. DIO will use page pins as well for its buffers and we must wait there
until the pin is released. So please just clarify your 'fail' here a bit
:).

> > > 4) The user has the option of holding the lease or releasing it.  If they
> > >    release it no other pin calls will work on the file.
> > 
> > Last time we spoke the plan was that the lease is kept while the pages are
> > pinned (and an attempt to release the lease would block until the pages are
> > unpinned). That also makes it clear that the *lease* is what is making
> > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > just an implementation detail how the existence is efficiently tracked (and
> > what keeps the backing file for the pages open so that the lease does not
> > get auto-destroyed). Why did you change this?
> 
> closing the file _and_ unmaping it will cause the lease to be released
> regardless of if we allow this or not.
> 
> As we discussed preventing the close seemed intractable.

Yes, preventing the application from closing the file is difficult. But
from a quick look at your patches it seemed to me that you actually hold a
backing file reference from the file_pin structure thus even though the
application closes its file descriptor, the struct file (and thus the
lease) lives further until the file_pin gets released. And that should last
as long as the pages are pinned. Am I missing something?

> I thought about failing the munmap but that seemed wrong as well.  But more
> importantly AFAIK RDMA can pass its memory pins to other processes via FD
> passing...  This means that one could pin this memory, pass it to another
> process and exit.  The file lease on the pin'ed file is lost.

Not if file_pin grabs struct file reference as I mentioned above...
 
> The file lease is just a key to get the memory pin.  Once unlocked the procfs
> tracking keeps track of where that pin goes and which processes need to be
> killed to get rid of it.

I think having file lease being just a key to get the pin is conceptually
wrong. The lease is what expresses: "I'm accessing these blocks directly,
don't touch them without coordinating with me." So it would be only natural
if we maintained the lease while we are accessing blocks instead of
transferring this protection responsibility to another structure - namely
file_pin - and letting the lease go. But maybe I miss some technical reason
why maintaining file lease is difficult. If that's the case, I'd like to hear
what...
 
> > > 5) Closing the file is ok.
> > > 
> > > 6) Unmapping the file is ok
> > > 
> > > 7) Pins against the files are tracked back to an owning file or an owning mm
> > >    depending on the internal subsystem needs.  With RDMA there is an owning
> > >    file which is related to the pined file.
> > > 
> > > 8) Only RDMA is currently supported
> > 
> > If you currently only need "owning file" variant in your patch set, then
> > I'd just implement that and leave "owning mm" variant for later if it
> > proves to be necessary. The things are complex enough as is...
> 
> I can do that...  I was trying to get io_uring working as well with the
> owning_mm but I should save that for later.

Ah, OK. Yes, I guess io_uring can be next step.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-15 13:05     ` Jan Kara
@ 2019-08-16 19:05       ` Ira Weiny
  2019-08-16 23:20         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -) Ira Weiny
  2019-08-17  2:26         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Dave Chinner
  0 siblings, 2 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-16 19:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > > Pre-requisites
> > > > ==============
> > > > 	Based on mmotm tree.
> > > > 
> > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > 
> > > > Solution summary
> > > > ================
> > > > 
> > > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > > memory which is then truncated.  So really any solution we present which:
> > > > 
> > > > A) Prevents file system corruption or data leaks
> > > > ...and...
> > > > B) Informs the user that they did something wrong
> > > > 
> > > > Should be an acceptable solution.
> > > > 
> > > > Because this is slightly new behavior.  And because this is going to be
> > > > specific to DAX (because of the lack of a page cache) we have made the user
> > > > "opt in" to this behavior.
> > > > 
> > > > The following patches implement the following solution.
> > > > 
> > > > 0) Registrations to Device DAX char devs are not affected
> > > > 
> > > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > > 
> > > > 2) page pins will fail if the lease is not active when the file back page is
> > > >    encountered.
> > > > 
> > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > 
> > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > page which has pincount increased? If the first then I'd rephrase this to
> > > be less ambiguous, if the second then I think it is wrong. 
> > 
> > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > pincount processing will hang the truncate.  Given the discussion with John H
> > we can make this a bit better if we use something like FOLL_PIN and the page
> > count bias to indicate this type of pin.  Then I could fail the truncate
> > outright.  but that is not done yet.
> > 
> > so... I used the word "fail" to be a bit more vague as the final implementation
> > may return ETXTBUSY or hang as noted.
> 
> Ah, OK. Hanging is fine in principle but with longterm pins, your work
> makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> e.g. DIO will use page pins as well for its buffers and we must wait there
> until the pin is released. So please just clarify your 'fail' here a bit
> :).

It will fail with ETXTBSY.  I've fixed a bug...  See below.

> 
> > > > 4) The user has the option of holding the lease or releasing it.  If they
> > > >    release it no other pin calls will work on the file.
> > > 
> > > Last time we spoke the plan was that the lease is kept while the pages are
> > > pinned (and an attempt to release the lease would block until the pages are
> > > unpinned). That also makes it clear that the *lease* is what is making
> > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > just an implementation detail how the existence is efficiently tracked (and
> > > what keeps the backing file for the pages open so that the lease does not
> > > get auto-destroyed). Why did you change this?
> > 
> > closing the file _and_ unmaping it will cause the lease to be released
> > regardless of if we allow this or not.
> > 
> > As we discussed preventing the close seemed intractable.
> 
> Yes, preventing the application from closing the file is difficult. But
> from a quick look at your patches it seemed to me that you actually hold a
> backing file reference from the file_pin structure thus even though the
> application closes its file descriptor, the struct file (and thus the
> lease) lives further until the file_pin gets released. And that should last
> as long as the pages are pinned. Am I missing something?
> 
> > I thought about failing the munmap but that seemed wrong as well.  But more
> > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > passing...  This means that one could pin this memory, pass it to another
> > process and exit.  The file lease on the pin'ed file is lost.
> 
> Not if file_pin grabs struct file reference as I mentioned above...
>  
> > The file lease is just a key to get the memory pin.  Once unlocked the procfs
> > tracking keeps track of where that pin goes and which processes need to be
> > killed to get rid of it.
> 
> I think having file lease being just a key to get the pin is conceptually
> wrong. The lease is what expresses: "I'm accessing these blocks directly,
> don't touch them without coordinating with me." So it would be only natural
> if we maintained the lease while we are accessing blocks instead of
> transferring this protection responsibility to another structure - namely
> file_pin - and letting the lease go.

We do transfer that protection to the file_pin but we don't have to "let the
lease" go.  We just keep the lease with the file_pin as you said.  See below...

> But maybe I miss some technical reason
> why maintaining file lease is difficult. If that's the case, I'd like to hear
> what...

Ok, I've thought a bit about what you said and indeed it should work that way.
The reason I had to think a bit is that I was not sure why I thought we needed
to hang...  Turns out there were a couple of reasons...  1 not so good and 1 ok
but still not good enough to allow this...

1) I had a bug in the XFS code which should have failed rather than hanging...
   So this was not a good reason...  And I was able to find/fix it...  Thanks!

2) Second reason is that I thought I did not have a good way to tell if the
   lease was actually in use.  What I mean is that letting the lease go should
   be ok IFF we don't have any pins...  I was thinking that without John's code
   we don't have a way to know if there are any pins...  But that is wrong...
   All we have to do is check

	!list_empty(file->file_pins)

So now with this detail I think you are right, we should be able to hold the
lease through the struct file even if the process no longer has any
"references" to it (ie closes and munmaps the file).

I'm going to add a patch to fail releasing the lease and remove this (item 4)
as part of the overall solution.

>  
> > > > 5) Closing the file is ok.
> > > > 
> > > > 6) Unmapping the file is ok
> > > > 
> > > > 7) Pins against the files are tracked back to an owning file or an owning mm
> > > >    depending on the internal subsystem needs.  With RDMA there is an owning
> > > >    file which is related to the pined file.
> > > > 
> > > > 8) Only RDMA is currently supported
> > > 
> > > If you currently only need "owning file" variant in your patch set, then
> > > I'd just implement that and leave "owning mm" variant for later if it
> > > proves to be necessary. The things are complex enough as is...
> > 
> > I can do that...  I was trying to get io_uring working as well with the
> > owning_mm but I should save that for later.
> 
> Ah, OK. Yes, I guess io_uring can be next step.

FWIW I have split the mm_struct stuff out.  I can keep it as a follow on series
for other users later.  At this point I have to solve the issue Jason brought
up WRT the RDMA file reference counting.

Thanks!
Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)
  2019-08-16 19:05       ` Ira Weiny
@ 2019-08-16 23:20         ` Ira Weiny
  2019-08-19  6:36           ` Jan Kara
  2019-08-17  2:26         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Dave Chinner
  1 sibling, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-16 23:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, Theodore Ts'o, linux-nvdimm, linux-rdma,
	John Hubbard, Dave Chinner, linux-kernel, Matthew Wilcox,
	linux-xfs, Jason Gunthorpe, linux-mm, linux-fsdevel,
	Andrew Morton, linux-ext4

On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > Hello!
> > > > 
> > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > > > Pre-requisites
> > > > > ==============
> > > > > 	Based on mmotm tree.
> > > > > 
> > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > > 
> > > > > Solution summary
> > > > > ================
> > > > > 
> > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > > > memory which is then truncated.  So really any solution we present which:
> > > > > 
> > > > > A) Prevents file system corruption or data leaks
> > > > > ...and...
> > > > > B) Informs the user that they did something wrong
> > > > > 
> > > > > Should be an acceptable solution.
> > > > > 
> > > > > Because this is slightly new behavior.  And because this is going to be
> > > > > specific to DAX (because of the lack of a page cache) we have made the user
> > > > > "opt in" to this behavior.
> > > > > 
> > > > > The following patches implement the following solution.
> > > > > 
> > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > 
> > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > > > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > > > 
> > > > > 2) page pins will fail if the lease is not active when the file back page is
> > > > >    encountered.
> > > > > 
> > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > > 
> > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > > page which has pincount increased? If the first then I'd rephrase this to
> > > > be less ambiguous, if the second then I think it is wrong. 
> > > 
> > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > > pincount processing will hang the truncate.  Given the discussion with John H
> > > we can make this a bit better if we use something like FOLL_PIN and the page
> > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > outright.  but that is not done yet.
> > > 
> > > so... I used the word "fail" to be a bit more vague as the final implementation
> > > may return ETXTBUSY or hang as noted.
> > 
> > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > e.g. DIO will use page pins as well for its buffers and we must wait there
> > until the pin is released. So please just clarify your 'fail' here a bit
> > :).
> 
> It will fail with ETXTBSY.  I've fixed a bug...  See below.
> 
> > 
> > > > > 4) The user has the option of holding the lease or releasing it.  If they
> > > > >    release it no other pin calls will work on the file.
> > > > 
> > > > Last time we spoke the plan was that the lease is kept while the pages are
> > > > pinned (and an attempt to release the lease would block until the pages are
> > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > > just an implementation detail how the existence is efficiently tracked (and
> > > > what keeps the backing file for the pages open so that the lease does not
> > > > get auto-destroyed). Why did you change this?
> > > 
> > > closing the file _and_ unmaping it will cause the lease to be released
> > > regardless of if we allow this or not.
> > > 
> > > As we discussed preventing the close seemed intractable.
> > 
> > Yes, preventing the application from closing the file is difficult. But
> > from a quick look at your patches it seemed to me that you actually hold a
> > backing file reference from the file_pin structure thus even though the
> > application closes its file descriptor, the struct file (and thus the
> > lease) lives further until the file_pin gets released. And that should last
> > as long as the pages are pinned. Am I missing something?
> > 
> > > I thought about failing the munmap but that seemed wrong as well.  But more
> > > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > > passing...  This means that one could pin this memory, pass it to another
> > > process and exit.  The file lease on the pin'ed file is lost.
> > 
> > Not if file_pin grabs struct file reference as I mentioned above...
> >  
> > > The file lease is just a key to get the memory pin.  Once unlocked the procfs
> > > tracking keeps track of where that pin goes and which processes need to be
> > > killed to get rid of it.
> > 
> > I think having file lease being just a key to get the pin is conceptually
> > wrong. The lease is what expresses: "I'm accessing these blocks directly,
> > don't touch them without coordinating with me." So it would be only natural
> > if we maintained the lease while we are accessing blocks instead of
> > transferring this protection responsibility to another structure - namely
> > file_pin - and letting the lease go.
> 
> We do transfer that protection to the file_pin but we don't have to "let the
> lease" go.  We just keep the lease with the file_pin as you said.  See below...
> 
> > But maybe I miss some technical reason
> > why maintaining file lease is difficult. If that's the case, I'd like to hear
> > what...
> 
> Ok, I've thought a bit about what you said and indeed it should work that way.
> The reason I had to think a bit is that I was not sure why I thought we needed
> to hang...  Turns out there were a couple of reasons...  1 not so good and 1 ok
> but still not good enough to allow this...
> 
> 1) I had a bug in the XFS code which should have failed rather than hanging...
>    So this was not a good reason...  And I was able to find/fix it...  Thanks!
> 
> 2) Second reason is that I thought I did not have a good way to tell if the
>    lease was actually in use.  What I mean is that letting the lease go should
>    be ok IFF we don't have any pins...  I was thinking that without John's code
>    we don't have a way to know if there are any pins...  But that is wrong...
>    All we have to do is check
> 
> 	!list_empty(file->file_pins)

Oops...  I got my "struct files" mixed up...  The RDMA struct file has the
file_pins hanging off it...  This will not work.

I'll have to try something else to prevent this.  However, I don't want to walk
all the pages of the inode.

Also I'm concerned about just failing if they happen to be pinned.  They need
to be LONGTERM pinned...  Otherwise we might have a transient failure of an
unlock based on some internal kernel transient pin...  :-/

Ira

> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).
> 
> I'm going to add a patch to fail releasing the lease and remove this (item 4)
> as part of the overall solution.
> 
> >  
> > > > > 5) Closing the file is ok.
> > > > > 
> > > > > 6) Unmapping the file is ok
> > > > > 
> > > > > 7) Pins against the files are tracked back to an owning file or an owning mm
> > > > >    depending on the internal subsystem needs.  With RDMA there is an owning
> > > > >    file which is related to the pined file.
> > > > > 
> > > > > 8) Only RDMA is currently supported
> > > > 
> > > > If you currently only need "owning file" variant in your patch set, then
> > > > I'd just implement that and leave "owning mm" variant for later if it
> > > > proves to be necessary. The things are complex enough as is...
> > > 
> > > I can do that...  I was trying to get io_uring working as well with the
> > > owning_mm but I should save that for later.
> > 
> > Ah, OK. Yes, I guess io_uring can be next step.
> 
> FWIW I have split the mm_struct stuff out.  I can keep it as a follow on series
> for other users later.  At this point I have to solve the issue Jason brought
> up WRT the RDMA file reference counting.
> 
> Thanks!
> Ira
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-16 19:05       ` Ira Weiny
  2019-08-16 23:20         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -) Ira Weiny
@ 2019-08-17  2:26         ` Dave Chinner
  2019-08-19  6:34           ` Jan Kara
  1 sibling, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-17  2:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> 2) Second reason is that I thought I did not have a good way to tell if the
>    lease was actually in use.  What I mean is that letting the lease go should
>    be ok IFF we don't have any pins...  I was thinking that without John's code
>    we don't have a way to know if there are any pins...  But that is wrong...
>    All we have to do is check
> 
> 	!list_empty(file->file_pins)
> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).

I really, really dislike the idea of zombie layout leases. It's a
nasty hack for poor application behaviour. This is a "we allow use
after layout lease release" API, and I think encoding largely
untraceable zombie objects into an API is very poor design.

From the fcntl man page:

LEASES
	Leases are associated with an open file description (see
	open(2)).  This means that duplicate file descriptors
	(created by, for example, fork(2) or dup(2))  re‐ fer  to
	the  same  lease,  and this lease may be modified or
	released using any of these descriptors.  Furthermore, the
	lease is released by either an explicit F_UNLCK operation on
	any of these duplicate file descriptors, or when all such
	file descriptors have been closed.

Leases are associated with *open* file descriptors, not the
lifetime of the struct file in the kernel. If the application closes
the open fds that refer to the lease, then the kernel does not
guarantee, and the application has no right to expect, that the
lease remains active in any way once the application closes all
direct references to the lease.

IOWs, applications using layout leases need to hold the lease fd
open for as long as the want access to the physical file layout. It
is a also a requirement of the layout lease that the holder releases
the resources it holds on the layout before it releases the layout
lease, exclusive lease or not. Closing the fd indicates they do not
need access to the file any more, and so the lease should be
reclaimed at that point.

I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-17  2:26         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Dave Chinner
@ 2019-08-19  6:34           ` Jan Kara
  2019-08-19  9:24             ` Dave Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: Jan Kara @ 2019-08-19  6:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, Jan Kara, Andrew Morton, Jason Gunthorpe,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, John Hubbard,
	Michal Hocko, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > 2) Second reason is that I thought I did not have a good way to tell if the
> >    lease was actually in use.  What I mean is that letting the lease go should
> >    be ok IFF we don't have any pins...  I was thinking that without John's code
> >    we don't have a way to know if there are any pins...  But that is wrong...
> >    All we have to do is check
> > 
> > 	!list_empty(file->file_pins)
> > 
> > So now with this detail I think you are right, we should be able to hold the
> > lease through the struct file even if the process no longer has any
> > "references" to it (ie closes and munmaps the file).
> 
> I really, really dislike the idea of zombie layout leases. It's a
> nasty hack for poor application behaviour. This is a "we allow use
> after layout lease release" API, and I think encoding largely
> untraceable zombie objects into an API is very poor design.
> 
> From the fcntl man page:
> 
> LEASES
> 	Leases are associated with an open file description (see
> 	open(2)).  This means that duplicate file descriptors
> 	(created by, for example, fork(2) or dup(2))  re‐ fer  to
> 	the  same  lease,  and this lease may be modified or
> 	released using any of these descriptors.  Furthermore, the
> 	lease is released by either an explicit F_UNLCK operation on
> 	any of these duplicate file descriptors, or when all such
> 	file descriptors have been closed.
> 
> Leases are associated with *open* file descriptors, not the
> lifetime of the struct file in the kernel. If the application closes
> the open fds that refer to the lease, then the kernel does not
> guarantee, and the application has no right to expect, that the
> lease remains active in any way once the application closes all
> direct references to the lease.
> 
> IOWs, applications using layout leases need to hold the lease fd
> open for as long as the want access to the physical file layout. It
> is a also a requirement of the layout lease that the holder releases
> the resources it holds on the layout before it releases the layout
> lease, exclusive lease or not. Closing the fd indicates they do not
> need access to the file any more, and so the lease should be
> reclaimed at that point.
> 
> I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released.

Yeah, so this was my initial though as well [1]. But as the discussion in
that thread revealed, the problem with blocking last close is that kernel
does not really expect close to block. You could easily deadlock e.g. if
the process gets SIGKILL, file with lease has fd 10, and the RDMA context
holding pages pinned has fd 15. Or you could wait for another process to
release page pins and blocking SIGKILL on that is also bad. So in the end
the least bad solution we've come up with were these "zombie" leases as you
call them and tracking them in /proc so that userspace at least has a way
of seeing them. But if you can come up with a different solution, I'm
certainly not attached to the current one...

								Honza

[1] https://lore.kernel.org/lkml/20190606104203.GF7433@quack2.suse.cz
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)
  2019-08-16 23:20         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -) Ira Weiny
@ 2019-08-19  6:36           ` Jan Kara
  0 siblings, 0 replies; 110+ messages in thread
From: Jan Kara @ 2019-08-19  6:36 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Michal Hocko, Theodore Ts'o, linux-nvdimm,
	linux-rdma, John Hubbard, Dave Chinner, linux-kernel,
	Matthew Wilcox, linux-xfs, Jason Gunthorpe, linux-mm,
	linux-fsdevel, Andrew Morton, linux-ext4

On Fri 16-08-19 16:20:07, Ira Weiny wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > Hello!
> > > > > 
> > > > > On Fri 09-08-19 15:58:14, ira.weiny@intel.com wrote:
> > > > > > Pre-requisites
> > > > > > ==============
> > > > > > 	Based on mmotm tree.
> > > > > > 
> > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > > > 
> > > > > > Solution summary
> > > > > > ================
> > > > > > 
> > > > > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > > > > memory which is then truncated.  So really any solution we present which:
> > > > > > 
> > > > > > A) Prevents file system corruption or data leaks
> > > > > > ...and...
> > > > > > B) Informs the user that they did something wrong
> > > > > > 
> > > > > > Should be an acceptable solution.
> > > > > > 
> > > > > > Because this is slightly new behavior.  And because this is going to be
> > > > > > specific to DAX (because of the lack of a page cache) we have made the user
> > > > > > "opt in" to this behavior.
> > > > > > 
> > > > > > The following patches implement the following solution.
> > > > > > 
> > > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > > 
> > > > > > 1) The user has to opt in to allowing page pins on a file with an exclusive
> > > > > >    layout lease.  Both exclusive and layout lease flags are user visible now.
> > > > > > 
> > > > > > 2) page pins will fail if the lease is not active when the file back page is
> > > > > >    encountered.
> > > > > > 
> > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > > > 
> > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > > > page which has pincount increased? If the first then I'd rephrase this to
> > > > > be less ambiguous, if the second then I think it is wrong. 
> > > > 
> > > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > > > pincount processing will hang the truncate.  Given the discussion with John H
> > > > we can make this a bit better if we use something like FOLL_PIN and the page
> > > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > > outright.  but that is not done yet.
> > > > 
> > > > so... I used the word "fail" to be a bit more vague as the final implementation
> > > > may return ETXTBUSY or hang as noted.
> > > 
> > > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > > e.g. DIO will use page pins as well for its buffers and we must wait there
> > > until the pin is released. So please just clarify your 'fail' here a bit
> > > :).
> > 
> > It will fail with ETXTBSY.  I've fixed a bug...  See below.
> > 
> > > 
> > > > > > 4) The user has the option of holding the lease or releasing it.  If they
> > > > > >    release it no other pin calls will work on the file.
> > > > > 
> > > > > Last time we spoke the plan was that the lease is kept while the pages are
> > > > > pinned (and an attempt to release the lease would block until the pages are
> > > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > > > just an implementation detail how the existence is efficiently tracked (and
> > > > > what keeps the backing file for the pages open so that the lease does not
> > > > > get auto-destroyed). Why did you change this?
> > > > 
> > > > closing the file _and_ unmaping it will cause the lease to be released
> > > > regardless of if we allow this or not.
> > > > 
> > > > As we discussed preventing the close seemed intractable.
> > > 
> > > Yes, preventing the application from closing the file is difficult. But
> > > from a quick look at your patches it seemed to me that you actually hold a
> > > backing file reference from the file_pin structure thus even though the
> > > application closes its file descriptor, the struct file (and thus the
> > > lease) lives further until the file_pin gets released. And that should last
> > > as long as the pages are pinned. Am I missing something?
> > > 
> > > > I thought about failing the munmap but that seemed wrong as well.  But more
> > > > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > > > passing...  This means that one could pin this memory, pass it to another
> > > > process and exit.  The file lease on the pin'ed file is lost.
> > > 
> > > Not if file_pin grabs struct file reference as I mentioned above...
> > >  
> > > > The file lease is just a key to get the memory pin.  Once unlocked the procfs
> > > > tracking keeps track of where that pin goes and which processes need to be
> > > > killed to get rid of it.
> > > 
> > > I think having file lease being just a key to get the pin is conceptually
> > > wrong. The lease is what expresses: "I'm accessing these blocks directly,
> > > don't touch them without coordinating with me." So it would be only natural
> > > if we maintained the lease while we are accessing blocks instead of
> > > transferring this protection responsibility to another structure - namely
> > > file_pin - and letting the lease go.
> > 
> > We do transfer that protection to the file_pin but we don't have to "let the
> > lease" go.  We just keep the lease with the file_pin as you said.  See below...
> > 
> > > But maybe I miss some technical reason
> > > why maintaining file lease is difficult. If that's the case, I'd like to hear
> > > what...
> > 
> > Ok, I've thought a bit about what you said and indeed it should work that way.
> > The reason I had to think a bit is that I was not sure why I thought we needed
> > to hang...  Turns out there were a couple of reasons...  1 not so good and 1 ok
> > but still not good enough to allow this...
> > 
> > 1) I had a bug in the XFS code which should have failed rather than hanging...
> >    So this was not a good reason...  And I was able to find/fix it...  Thanks!
> > 
> > 2) Second reason is that I thought I did not have a good way to tell if the
> >    lease was actually in use.  What I mean is that letting the lease go should
> >    be ok IFF we don't have any pins...  I was thinking that without John's code
> >    we don't have a way to know if there are any pins...  But that is wrong...
> >    All we have to do is check
> > 
> > 	!list_empty(file->file_pins)
> 
> Oops...  I got my "struct files" mixed up...  The RDMA struct file has the
> file_pins hanging off it...  This will not work.
> 
> I'll have to try something else to prevent this.  However, I don't want to walk
> all the pages of the inode.
> 
> Also I'm concerned about just failing if they happen to be pinned.  They need
> to be LONGTERM pinned...  Otherwise we might have a transient failure of an
> unlock based on some internal kernel transient pin...  :-/

My solution for this was that file_pin would contain counter of pinned
pages which vaddr_pin_pages() would increment and vaddr_unpin_pages() would
decrement. Checking whether there's any outstanding page pinned attached to
the file_pin is then trivial...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-19  6:34           ` Jan Kara
@ 2019-08-19  9:24             ` Dave Chinner
  2019-08-19 12:38               ` Jason Gunthorpe
  2019-08-20  0:05               ` John Hubbard
  0 siblings, 2 replies; 110+ messages in thread
From: Dave Chinner @ 2019-08-19  9:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ira Weiny, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > 2) Second reason is that I thought I did not have a good way to tell if the
> > >    lease was actually in use.  What I mean is that letting the lease go should
> > >    be ok IFF we don't have any pins...  I was thinking that without John's code
> > >    we don't have a way to know if there are any pins...  But that is wrong...
> > >    All we have to do is check
> > > 
> > > 	!list_empty(file->file_pins)
> > > 
> > > So now with this detail I think you are right, we should be able to hold the
> > > lease through the struct file even if the process no longer has any
> > > "references" to it (ie closes and munmaps the file).
> > 
> > I really, really dislike the idea of zombie layout leases. It's a
> > nasty hack for poor application behaviour. This is a "we allow use
> > after layout lease release" API, and I think encoding largely
> > untraceable zombie objects into an API is very poor design.
> > 
> > From the fcntl man page:
> > 
> > LEASES
> > 	Leases are associated with an open file description (see
> > 	open(2)).  This means that duplicate file descriptors
> > 	(created by, for example, fork(2) or dup(2))  re‐ fer  to
> > 	the  same  lease,  and this lease may be modified or
> > 	released using any of these descriptors.  Furthermore, the
> > 	lease is released by either an explicit F_UNLCK operation on
> > 	any of these duplicate file descriptors, or when all such
> > 	file descriptors have been closed.
> > 
> > Leases are associated with *open* file descriptors, not the
> > lifetime of the struct file in the kernel. If the application closes
> > the open fds that refer to the lease, then the kernel does not
> > guarantee, and the application has no right to expect, that the
> > lease remains active in any way once the application closes all
> > direct references to the lease.
> > 
> > IOWs, applications using layout leases need to hold the lease fd
> > open for as long as the want access to the physical file layout. It
> > is a also a requirement of the layout lease that the holder releases
> > the resources it holds on the layout before it releases the layout
> > lease, exclusive lease or not. Closing the fd indicates they do not
> > need access to the file any more, and so the lease should be
> > reclaimed at that point.
> > 
> > I'm of a mind to make the last close() on a file block if there's an
> > active layout lease to prevent processes from zombie-ing layout
> > leases like this. i.e. you can't close the fd until resources that
> > pin the lease have been released.
> 
> Yeah, so this was my initial though as well [1]. But as the discussion in
> that thread revealed, the problem with blocking last close is that kernel
> does not really expect close to block. You could easily deadlock e.g. if
> the process gets SIGKILL, file with lease has fd 10, and the RDMA context
> holding pages pinned has fd 15.

Sure, I did think about this a bit about it before suggesting it :)

The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
  tracehook_notify_resume()
    task_work_run()
      __fput()
	locks_remove_file()
	  locks_remove_lease()
	    ....

It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases....

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
  do_signal()
    get_signal()
      task_work_run()
        __fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it is close and we can't
easily back out, so we'd only break the block on a fatal signal....

And then, of course, is the call path through do_exit(), which has
the PF_EXITING task flag set:

do_exit()
  exit_task_work()
    task_work_run()
      __fput()

and so it's easy to avoid blocking in this case, too.

So that leaves just the normal close() syscall exit case, where the
application has full control of the order in which resources are
released. We've already established that we can block in this
context.  Blocking in an interruptible state will allow fatal signal
delivery to wake us, and then we fall into the
fatal_signal_pending() case if we get a SIGKILL while blocking.

Hence I think blocking in this case would be OK - it indicates an
application bug (releasing a lease before releasing the resources)
but leaves SIGKILL available to administrators to resolve situations
involving buggy applications.

This requires applications to follow the rules: any process
that pins physical resources must have an active reference to a
layout lease, either via a duplicated fd or it's own private lease.
If the app doesn't play by the rules, it hangs in close() until it
is killed.

> Or you could wait for another process to
> release page pins and blocking SIGKILL on that is also bad.

Again, each individual process that pins pages from the layout must
have it's own active layout lease reference.

> So in the end
> the least bad solution we've come up with were these "zombie" leases as you
> call them and tracking them in /proc so that userspace at least has a way
> of seeing them. But if you can come up with a different solution, I'm
> certainly not attached to the current one...

It might be the "least bad" solution, but it's still a pretty bad
one. And one that I don't think is necessary if we simply enforce
the "process must have active references for the entire time the
process uses the resource" rule. That's the way file access has
always worked, I don't see why we should be doing anything different
for access to the physical layout of files...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-19  9:24             ` Dave Chinner
@ 2019-08-19 12:38               ` Jason Gunthorpe
  2019-08-19 21:53                 ` Ira Weiny
  2019-08-20  1:12                 ` Dave Chinner
  2019-08-20  0:05               ` John Hubbard
  1 sibling, 2 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-19 12:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:

> So that leaves just the normal close() syscall exit case, where the
> application has full control of the order in which resources are
> released. We've already established that we can block in this
> context.  Blocking in an interruptible state will allow fatal signal
> delivery to wake us, and then we fall into the
> fatal_signal_pending() case if we get a SIGKILL while blocking.

The major problem with RDMA is that it doesn't always wait on close() for the
MR holding the page pins to be destoyed. This is done to avoid a
deadlock of the form:

   uverbs_destroy_ufile_hw()
      mutex_lock()
       [..]
        mmput()
         exit_mmap()
          remove_vma()
           fput();
            file_operations->release()
             ib_uverbs_close()
              uverbs_destroy_ufile_hw()
               mutex_lock()   <-- Deadlock

But, as I said to Ira earlier, I wonder if this is now impossible on
modern kernels and we can switch to making the whole thing
synchronous. That would resolve RDMA's main problem with this.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-19 12:38               ` Jason Gunthorpe
@ 2019-08-19 21:53                 ` Ira Weiny
  2019-08-20  1:12                 ` Dave Chinner
  1 sibling, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-19 21:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>    uverbs_destroy_ufile_hw()
>       mutex_lock()
>        [..]
>         mmput()
>          exit_mmap()
>           remove_vma()
>            fput();
>             file_operations->release()
>              ib_uverbs_close()
>               uverbs_destroy_ufile_hw()
>                mutex_lock()   <-- Deadlock
> 
> But, as I said to Ira earlier, I wonder if this is now impossible on
> modern kernels and we can switch to making the whole thing
> synchronous. That would resolve RDMA's main problem with this.

I'm still looking into this...  but my bigger concern is that the RDMA FD can
be passed to other processes via SCM_RIGHTS.  Which means the process holding
the pin may _not_ be the one with the open file and layout lease...

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-19  9:24             ` Dave Chinner
  2019-08-19 12:38               ` Jason Gunthorpe
@ 2019-08-20  0:05               ` John Hubbard
  2019-08-20  1:20                 ` Dave Chinner
  1 sibling, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-20  0:05 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Ira Weiny, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/19/19 2:24 AM, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
>> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
...
> The last close is an interesting case because the __fput() call
> actually runs from task_work() context, not where the last reference
> is actually dropped. So it already has certain specific interactions
> with signals and task exit processing via task_add_work() and
> task_work_run().
> 
> task_add_work() calls set_notify_resume(task), so if nothing else
> triggers when returning to userspace we run this path:
> 
> exit_to_usermode_loop()
>    tracehook_notify_resume()
>      task_work_run()
>        __fput()
> 	locks_remove_file()
> 	  locks_remove_lease()
> 	    ....
> 
> It's worth noting that locks_remove_lease() does a
> percpu_down_read() which means we can already block in this context
> removing leases....
> 
> If there is a signal pending, the task work is run this way (before
> the above notify path):
> 
> exit_to_usermode_loop()
>    do_signal()
>      get_signal()
>        task_work_run()
>          __fput()
> 
> We can detect this case via signal_pending() and even SIGKILL via
> fatal_signal_pending(), and so we can decide not to block based on
> the fact the process is about to be reaped and so the lease largely
> doesn't matter anymore. I'd argue that it is close and we can't
> easily back out, so we'd only break the block on a fatal signal....
> 
> And then, of course, is the call path through do_exit(), which has
> the PF_EXITING task flag set:
> 
> do_exit()
>    exit_task_work()
>      task_work_run()
>        __fput()
> 
> and so it's easy to avoid blocking in this case, too.

Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
memory with FOLL_LONGTERM, and wondering how to make that work here.

These are close to files, in how they're handled, but just different
enough that it's not clear to me how to make work with this system.


> 
> So that leaves just the normal close() syscall exit case, where the
> application has full control of the order in which resources are
> released. We've already established that we can block in this
> context.  Blocking in an interruptible state will allow fatal signal
> delivery to wake us, and then we fall into the
> fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> Hence I think blocking in this case would be OK - it indicates an
> application bug (releasing a lease before releasing the resources)
> but leaves SIGKILL available to administrators to resolve situations
> involving buggy applications.
> 
> This requires applications to follow the rules: any process
> that pins physical resources must have an active reference to a
> layout lease, either via a duplicated fd or it's own private lease.
> If the app doesn't play by the rules, it hangs in close() until it
> is killed.

+1 for these rules, assuming that we can make them work. They are
easy to explain and intuitive.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-19 12:38               ` Jason Gunthorpe
  2019-08-19 21:53                 ` Ira Weiny
@ 2019-08-20  1:12                 ` Dave Chinner
  2019-08-20 11:55                   ` Jason Gunthorpe
  1 sibling, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-20  1:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>    uverbs_destroy_ufile_hw()
>       mutex_lock()
>        [..]
>         mmput()
>          exit_mmap()
>           remove_vma()
>            fput();
>             file_operations->release()

I think this is wrong, and I'm pretty sure it's an example of why
the final __fput() call is moved out of line.

		fput()
		  fput_many()
		    task_add_work(f, __fput())

and the call chain ends there.

Before the syscall returns to userspace, it then runs the __fput()
call through the task_work_run() interfaces, and hence the call
chain is just:

	task_work_run
	  __fput
>             file_operations->release()
>              ib_uverbs_close()
>               uverbs_destroy_ufile_hw()
>                mutex_lock()   <-- Deadlock

And there is no deadlock because nothing holds the mutex at this
point.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-20  0:05               ` John Hubbard
@ 2019-08-20  1:20                 ` Dave Chinner
  2019-08-20  3:09                   ` John Hubbard
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-20  1:20 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Jason Gunthorpe,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> On 8/19/19 2:24 AM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> ...
> > The last close is an interesting case because the __fput() call
> > actually runs from task_work() context, not where the last reference
> > is actually dropped. So it already has certain specific interactions
> > with signals and task exit processing via task_add_work() and
> > task_work_run().
> > 
> > task_add_work() calls set_notify_resume(task), so if nothing else
> > triggers when returning to userspace we run this path:
> > 
> > exit_to_usermode_loop()
> >    tracehook_notify_resume()
> >      task_work_run()
> >        __fput()
> > 	locks_remove_file()
> > 	  locks_remove_lease()
> > 	    ....
> > 
> > It's worth noting that locks_remove_lease() does a
> > percpu_down_read() which means we can already block in this context
> > removing leases....
> > 
> > If there is a signal pending, the task work is run this way (before
> > the above notify path):
> > 
> > exit_to_usermode_loop()
> >    do_signal()
> >      get_signal()
> >        task_work_run()
> >          __fput()
> > 
> > We can detect this case via signal_pending() and even SIGKILL via
> > fatal_signal_pending(), and so we can decide not to block based on
> > the fact the process is about to be reaped and so the lease largely
> > doesn't matter anymore. I'd argue that it is close and we can't
> > easily back out, so we'd only break the block on a fatal signal....
> > 
> > And then, of course, is the call path through do_exit(), which has
> > the PF_EXITING task flag set:
> > 
> > do_exit()
> >    exit_task_work()
> >      task_work_run()
> >        __fput()
> > 
> > and so it's easy to avoid blocking in this case, too.
> 
> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> memory with FOLL_LONGTERM, and wondering how to make that work here.

I'm not sure how this interacts with file mappings? I mean, this
is just pinning anonymous pages for direct data placement into
userspace, right?

Are you asking "what if this pinned memory was a file mapping?",
or something else?

> These are close to files, in how they're handled, but just different
> enough that it's not clear to me how to make work with this system.

I'm guessing that if they are pinning a file backed mapping, they
are trying to dma direct to the file (zero copy into page cache?)
and so they'll need to either play by ODP rules or take layout
leases, too....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-20  1:20                 ` Dave Chinner
@ 2019-08-20  3:09                   ` John Hubbard
  2019-08-20  3:36                     ` Dave Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-20  3:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Jason Gunthorpe,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/19/19 6:20 PM, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
>> On 8/19/19 2:24 AM, Dave Chinner wrote:
>>> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
>>>> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
>>>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
>>>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
>>>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
>>>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
>> ...
>>
>> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
>> memory with FOLL_LONGTERM, and wondering how to make that work here.
> 
> I'm not sure how this interacts with file mappings? I mean, this
> is just pinning anonymous pages for direct data placement into
> userspace, right?
> 
> Are you asking "what if this pinned memory was a file mapping?",
> or something else?

Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
already there in xdp_umem_pin_pages(), unconditionally. So the
simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
not set) are not going to work here.


> 
>> These are close to files, in how they're handled, but just different
>> enough that it's not clear to me how to make work with this system.
> 
> I'm guessing that if they are pinning a file backed mapping, they
> are trying to dma direct to the file (zero copy into page cache?)
> and so they'll need to either play by ODP rules or take layout
> leases, too....
> 

OK. I was just wondering if there was some simple way to dig up a
struct file associated with a socket (I don't think so), but it sounds
like this is an exercise that's potentially different for each subsystem.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-20  3:09                   ` John Hubbard
@ 2019-08-20  3:36                     ` Dave Chinner
  2019-08-21 18:43                       ` John Hubbard
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-20  3:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Jason Gunthorpe,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> On 8/19/19 6:20 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > ...
> > > 
> > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> > > memory with FOLL_LONGTERM, and wondering how to make that work here.
> > 
> > I'm not sure how this interacts with file mappings? I mean, this
> > is just pinning anonymous pages for direct data placement into
> > userspace, right?
> > 
> > Are you asking "what if this pinned memory was a file mapping?",
> > or something else?
> 
> Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
> already there in xdp_umem_pin_pages(), unconditionally. So the
> simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
> not set) are not going to work here.
> 
> 
> > 
> > > These are close to files, in how they're handled, but just different
> > > enough that it's not clear to me how to make work with this system.
> > 
> > I'm guessing that if they are pinning a file backed mapping, they
> > are trying to dma direct to the file (zero copy into page cache?)
> > and so they'll need to either play by ODP rules or take layout
> > leases, too....
> > 
> 
> OK. I was just wondering if there was some simple way to dig up a
> struct file associated with a socket (I don't think so), but it sounds
> like this is an exercise that's potentially different for each subsystem.

AFAIA, there is no struct file here - the memory that has been pinned
is just something mapped into the application's address space.

It seems to me that the socket here is equivalent of the RDMA handle
that that owns the hardware that pins the pages. Again, that RDMA
handle is not aware of waht the mapping represents, hence need to
hold a layout lease if it's a file mapping.

SO from the filesystem persepctive, there's no difference between
XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
into the filesystem's backing store and that will require use of
layout leases to perform safely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-20  1:12                 ` Dave Chinner
@ 2019-08-20 11:55                   ` Jason Gunthorpe
  2019-08-21 18:02                     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-20 11:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > 
> > > So that leaves just the normal close() syscall exit case, where the
> > > application has full control of the order in which resources are
> > > released. We've already established that we can block in this
> > > context.  Blocking in an interruptible state will allow fatal signal
> > > delivery to wake us, and then we fall into the
> > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > 
> > The major problem with RDMA is that it doesn't always wait on close() for the
> > MR holding the page pins to be destoyed. This is done to avoid a
> > deadlock of the form:
> > 
> >    uverbs_destroy_ufile_hw()
> >       mutex_lock()
> >        [..]
> >         mmput()
> >          exit_mmap()
> >           remove_vma()
> >            fput();
> >             file_operations->release()
> 
> I think this is wrong, and I'm pretty sure it's an example of why
> the final __fput() call is moved out of line.

Yes, I think so too, all I can say is this *used* to happen, as we
have special code avoiding it, which is the code that is messing up
Ira's lifetime model.

Ira, you could try unraveling the special locking, that solves your
lifetime issues?

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-20 11:55                   ` Jason Gunthorpe
@ 2019-08-21 18:02                     ` Ira Weiny
  2019-08-21 18:13                       ` Jason Gunthorpe
  2019-08-23  0:59                       ` Dave Chinner
  0 siblings, 2 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-21 18:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > 
> > > > So that leaves just the normal close() syscall exit case, where the
> > > > application has full control of the order in which resources are
> > > > released. We've already established that we can block in this
> > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > delivery to wake us, and then we fall into the
> > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > 
> > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > MR holding the page pins to be destoyed. This is done to avoid a
> > > deadlock of the form:
> > > 
> > >    uverbs_destroy_ufile_hw()
> > >       mutex_lock()
> > >        [..]
> > >         mmput()
> > >          exit_mmap()
> > >           remove_vma()
> > >            fput();
> > >             file_operations->release()
> > 
> > I think this is wrong, and I'm pretty sure it's an example of why
> > the final __fput() call is moved out of line.
> 
> Yes, I think so too, all I can say is this *used* to happen, as we
> have special code avoiding it, which is the code that is messing up
> Ira's lifetime model.
> 
> Ira, you could try unraveling the special locking, that solves your
> lifetime issues?

Yes I will try to prove this out...  But I'm still not sure this fully solves
the problem.

This only ensures that the process which has the RDMA context (RDMA FD) is safe
with regard to hanging the close for the "data file FD" (the file which has
pinned pages) in that _same_ process.  But what about the scenario.

Process A has the RDMA context FD and data file FD (with lease) open.

Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Process A attempts to exit (hangs because data file FD is pinned).

Admin kills process A.  kill works because we have allowed for it...

Process B _still_ has the RDMA context FD open _and_ therefore still holds the
file pins.

Truncation still fails.

Admin does not know which process is holding the pin.

What am I missing?

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:02                     ` Ira Weiny
@ 2019-08-21 18:13                       ` Jason Gunthorpe
  2019-08-21 18:22                         ` John Hubbard
  2019-08-21 18:57                         ` Ira Weiny
  2019-08-23  0:59                       ` Dave Chinner
  1 sibling, 2 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-21 18:13 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >    uverbs_destroy_ufile_hw()
> > > >       mutex_lock()
> > > >        [..]
> > > >         mmput()
> > > >          exit_mmap()
> > > >           remove_vma()
> > > >            fput();
> > > >             file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.

Oh, I didn't think we were talking about that. Hanging the close of
the datafile fd contingent on some other FD's closure is a recipe for
deadlock..

IMHO the pin refcnt is held by the driver char dev FD, that is the
object you need to make it visible against.

Why not just have a single table someplace of all the layout leases
with the file they are held on and the FD/socket/etc that is holding
the pin? Make it independent of processes and FDs?

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:13                       ` Jason Gunthorpe
@ 2019-08-21 18:22                         ` John Hubbard
  2019-08-21 18:57                         ` Ira Weiny
  1 sibling, 0 replies; 110+ messages in thread
From: John Hubbard @ 2019-08-21 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Ira Weiny
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/21/19 11:13 AM, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
>> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
>>> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
>>>> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
>>>>>
>>>>>> So that leaves just the normal close() syscall exit case, where the
>>>>>> application has full control of the order in which resources are
>>>>>> released. We've already established that we can block in this
>>>>>> context.  Blocking in an interruptible state will allow fatal signal
>>>>>> delivery to wake us, and then we fall into the
>>>>>> fatal_signal_pending() case if we get a SIGKILL while blocking.
>>>>>
>>>>> The major problem with RDMA is that it doesn't always wait on close() for the
>>>>> MR holding the page pins to be destoyed. This is done to avoid a
>>>>> deadlock of the form:
>>>>>
>>>>>     uverbs_destroy_ufile_hw()
>>>>>        mutex_lock()
>>>>>         [..]
>>>>>          mmput()
>>>>>           exit_mmap()
>>>>>            remove_vma()
>>>>>             fput();
>>>>>              file_operations->release()
>>>>
>>>> I think this is wrong, and I'm pretty sure it's an example of why
>>>> the final __fput() call is moved out of line.
>>>
>>> Yes, I think so too, all I can say is this *used* to happen, as we
>>> have special code avoiding it, which is the code that is messing up
>>> Ira's lifetime model.
>>>
>>> Ira, you could try unraveling the special locking, that solves your
>>> lifetime issues?
>>
>> Yes I will try to prove this out...  But I'm still not sure this fully solves
>> the problem.
>>
>> This only ensures that the process which has the RDMA context (RDMA FD) is safe
>> with regard to hanging the close for the "data file FD" (the file which has
>> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..
> 
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.


If you do that, it might make it a lot simpler to add lease support
to drivers like XDP, which is otherwise looking pretty difficult to
set up with an fd. (It's socket-based, and not immediately clear where
to connect up the fd.)


thanks,
-- 
John Hubbard
NVIDIA

> 
> Why not just have a single table someplace of all the layout leases
> with the file they are held on and the FD/socket/etc that is holding
> the pin? Make it independent of processes and FDs?
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-20  3:36                     ` Dave Chinner
@ 2019-08-21 18:43                       ` John Hubbard
  2019-08-21 19:09                         ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-21 18:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ira Weiny, Andrew Morton, Jason Gunthorpe,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/19/19 8:36 PM, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
>> On 8/19/19 6:20 PM, Dave Chinner wrote:
>>> On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
>>>> On 8/19/19 2:24 AM, Dave Chinner wrote:
>>>>> On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
>>>>>> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
>>>>>>> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
>>>>>>>> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
>>>>>>>>> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
>>>>>>>>>> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
>>>> ...
> AFAIA, there is no struct file here - the memory that has been pinned
> is just something mapped into the application's address space.
> 
> It seems to me that the socket here is equivalent of the RDMA handle
> that that owns the hardware that pins the pages. Again, that RDMA
> handle is not aware of waht the mapping represents, hence need to
> hold a layout lease if it's a file mapping.
> 
> SO from the filesystem persepctive, there's no difference between
> XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
> into the filesystem's backing store and that will require use of
> layout leases to perform safely.
> 

OK, got it! Makes perfect sense.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:13                       ` Jason Gunthorpe
  2019-08-21 18:22                         ` John Hubbard
@ 2019-08-21 18:57                         ` Ira Weiny
  2019-08-21 19:06                           ` Ira Weiny
  2019-08-21 19:48                           ` Jason Gunthorpe
  1 sibling, 2 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-21 18:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >    uverbs_destroy_ufile_hw()
> > > > >       mutex_lock()
> > > > >        [..]
> > > > >         mmput()
> > > > >          exit_mmap()
> > > > >           remove_vma()
> > > > >            fput();
> > > > >             file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> 
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..

The discussion between Jan and Dave was concerning what happens when a user
calls

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
close(fd)

Dave suggested:

"I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released."

	-- Dave https://lkml.org/lkml/2019/8/16/994

> 
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.

I'm sorry but what do you mean by "driver char dev FD"?

> 
> Why not just have a single table someplace of all the layout leases
> with the file they are held on and the FD/socket/etc that is holding
> the pin? Make it independent of processes and FDs?

If it is independent of processes how will we know which process is blocking
the truncate?  Using a global table is an interesting idea but I still believe
the users are going to want to track this to specific processes.  It's not
clear to me how that would be done with a global table.

I agree the XDP/socket case is bothersome...  I was thinking that somewhere the
fd of the socket could be hooked up in this case.  But taking a look at it
reveals that is not going to be easy.  And I assume XDP has the same issue WRT
SCM_RIGHTS and the ability to share the xdp context?

Ira

> 
> Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:57                         ` Ira Weiny
@ 2019-08-21 19:06                           ` Ira Weiny
  2019-08-21 19:48                           ` Jason Gunthorpe
  1 sibling, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-21 19:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 11:57:03AM -0700, 'Ira Weiny' wrote:
> On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >    uverbs_destroy_ufile_hw()
> > > > > >       mutex_lock()
> > > > > >        [..]
> > > > > >         mmput()
> > > > > >          exit_mmap()
> > > > > >           remove_vma()
> > > > > >            fput();
> > > > > >             file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > > with regard to hanging the close for the "data file FD" (the file which has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
> 
> The discussion between Jan and Dave was concerning what happens when a user
> calls
> 
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> close(fd)
> 
> Dave suggested:
> 
> "I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released."
> 
> 	-- Dave https://lkml.org/lkml/2019/8/16/994

I think this may all be easier if there was a way to block a dup() if it comes
from an SCM_RIGHTS.  Does anyone know if that is easy?  I assume it would just
mean passing some flag through the dup() call chain.

Jason, if we did that would it break RDMA use cases?  I personally don't know
of any.  We could pass data back from vaddr_pin indicating that a file pin has
been taken and predicate the blocking of SCM_RIGHTS on that? 

Of course if the user called:

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() <pin>
munmap(addr...)
close(fd)
fork() <=== in another thread because close is hanging

Would that dup() "fd" above into the child?  Or maybe that would be part of the
work to make close() hang?  Ensure the fd/file is still in the FD table so it
gets dupped???

Ira


> 
> > 
> > IMHO the pin refcnt is held by the driver char dev FD, that is the
> > object you need to make it visible against.
> 
> I'm sorry but what do you mean by "driver char dev FD"?
> 
> > 
> > Why not just have a single table someplace of all the layout leases
> > with the file they are held on and the FD/socket/etc that is holding
> > the pin? Make it independent of processes and FDs?
> 
> If it is independent of processes how will we know which process is blocking
> the truncate?  Using a global table is an interesting idea but I still believe
> the users are going to want to track this to specific processes.  It's not
> clear to me how that would be done with a global table.
> 
> I agree the XDP/socket case is bothersome...  I was thinking that somewhere the
> fd of the socket could be hooked up in this case.  But taking a look at it
> reveals that is not going to be easy.  And I assume XDP has the same issue WRT
> SCM_RIGHTS and the ability to share the xdp context?
> 
> Ira
> 
> > 
> > Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:43                       ` John Hubbard
@ 2019-08-21 19:09                         ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-21 19:09 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Jason Gunthorpe,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 11:43:30AM -0700, John Hubbard wrote:
> On 8/19/19 8:36 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> > > On 8/19/19 6:20 PM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > ...
> > AFAIA, there is no struct file here - the memory that has been pinned
> > is just something mapped into the application's address space.
> > 
> > It seems to me that the socket here is equivalent of the RDMA handle
> > that that owns the hardware that pins the pages. Again, that RDMA
> > handle is not aware of waht the mapping represents, hence need to
> > hold a layout lease if it's a file mapping.
> > 
> > SO from the filesystem persepctive, there's no difference between
> > XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
> > into the filesystem's backing store and that will require use of
> > layout leases to perform safely.
> > 
> 
> OK, got it! Makes perfect sense.

Just to chime in here... Yea from the FS perspective it is the same.

But on the driver side it is more complicated because of how the references to
the pins can be shared among other processes.

See the other branch of this thread

https://lkml.org/lkml/2019/8/21/828

Ira

> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:57                         ` Ira Weiny
  2019-08-21 19:06                           ` Ira Weiny
@ 2019-08-21 19:48                           ` Jason Gunthorpe
  2019-08-21 20:44                             ` Ira Weiny
  1 sibling, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-21 19:48 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:

> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
> 
> The discussion between Jan and Dave was concerning what happens when a user
> calls
> 
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() <pin>
> munmap(addr...)
> close(fd)

I don't see how blocking close(fd) could work. Write it like this:

 fd = open()
 uverbs = open(/dev/uverbs)
 fnctl(...getlease...)
 addr = mmap(fd...)
 ib_reg_mr() <pin>
 munmap(addr...)
  <sigkill>

The order FD's are closed during sigkill is not deterministic, so when
all the fputs happen during a kill'd exit we could end up blocking in
close(fd) as close(uverbs) will come after in the close
list. close(uverbs) is the thing that does the dereg_mr and releases
the pin.

We don't need complexity with dup to create problems.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 19:48                           ` Jason Gunthorpe
@ 2019-08-21 20:44                             ` Ira Weiny
  2019-08-21 23:49                               ` Jason Gunthorpe
  2019-08-23  3:23                               ` Dave Chinner
  0 siblings, 2 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-21 20:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> 
> > > Oh, I didn't think we were talking about that. Hanging the close of
> > > the datafile fd contingent on some other FD's closure is a recipe for
> > > deadlock..
> > 
> > The discussion between Jan and Dave was concerning what happens when a user
> > calls
> > 
> > fd = open()
> > fnctl(...getlease...)
> > addr = mmap(fd...)
> > ib_reg_mr() <pin>
> > munmap(addr...)
> > close(fd)
> 
> I don't see how blocking close(fd) could work.

Well Dave was saying this _could_ work.  FWIW I'm not 100% sure it will but I
can't prove it won't..  Maybe we are all just touching a different part of this
elephant[1] but the above scenario or one without munmap is very reasonably
something a user would do.  So we can either allow the close to complete (my
current patches) or try to make it block like Dave is suggesting.

I don't disagree with Dave with the semantics being nice and clean for the
filesystem.  But the fact that RDMA, and potentially others, can "pass the
pins" to other processes is something I spent a lot of time trying to work out.

>
> Write it like this:
> 
>  fd = open()
>  uverbs = open(/dev/uverbs)
>  fnctl(...getlease...)
>  addr = mmap(fd...)
>  ib_reg_mr() <pin>
>  munmap(addr...)
>   <sigkill>
> 
> The order FD's are closed during sigkill is not deterministic, so when
> all the fputs happen during a kill'd exit we could end up blocking in
> close(fd) as close(uverbs) will come after in the close
> list. close(uverbs) is the thing that does the dereg_mr and releases
> the pin.

Of course, that is a different scenario which needs to be fixed in my patch
set.  Now that my servers are back up I can hopefully make progress.  (Power
was down for them yesterday).

> 
> We don't need complexity with dup to create problems.

No but that complexity _will_ come unless we "zombie" layout leases.

Ira

[1] https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 20:44                             ` Ira Weiny
@ 2019-08-21 23:49                               ` Jason Gunthorpe
  2019-08-23  3:23                               ` Dave Chinner
  1 sibling, 0 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-21 23:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:

> > The order FD's are closed during sigkill is not deterministic, so when
> > all the fputs happen during a kill'd exit we could end up blocking in
> > close(fd) as close(uverbs) will come after in the close
> > list. close(uverbs) is the thing that does the dereg_mr and releases
> > the pin.
> 
> Of course, that is a different scenario which needs to be fixed in my patch
> set.  Now that my servers are back up I can hopefully make progress.  (Power
> was down for them yesterday).

It isn't really a different scenario, the problem is that the
filesystem fd must be closable independenly of fencing the MR to avoid
deadlock cycles. Once you resolve that the issue of the uverbs FD out
living it won't matter one bit if it is in the same process or
another.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 18:02                     ` Ira Weiny
  2019-08-21 18:13                       ` Jason Gunthorpe
@ 2019-08-23  0:59                       ` Dave Chinner
  2019-08-23 17:15                         ` Ira Weiny
  1 sibling, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-23  0:59 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >    uverbs_destroy_ufile_hw()
> > > >       mutex_lock()
> > > >        [..]
> > > >         mmput()
> > > >          exit_mmap()
> > > >           remove_vma()
> > > >            fput();
> > > >             file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Process A has the RDMA context FD and data file FD (with lease) open.
> 
> Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Passing the RDMA context dependent on a file layout lease to another
process that doesn't have a file layout lease or a reference to the
original lease should be considered a violation of the layout lease.
Process B does not have an active layout lease, and so by the rules
of layout leases, it is not allowed to pin the layout of the file.

> Process A attempts to exit (hangs because data file FD is pinned).
> 
> Admin kills process A.  kill works because we have allowed for it...
> 
> Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> file pins.
> 
> Truncation still fails.
> 
> Admin does not know which process is holding the pin.
> 
> What am I missing?

Application does not hold the correct file layout lease references.
Passing the fd via SCM_RIGHTS to a process without a layout lease
is equivalent to not using layout leases in the first place.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-21 20:44                             ` Ira Weiny
  2019-08-21 23:49                               ` Jason Gunthorpe
@ 2019-08-23  3:23                               ` Dave Chinner
  2019-08-23 12:04                                 ` Jason Gunthorpe
  2019-08-24  4:49                                 ` Ira Weiny
  1 sibling, 2 replies; 110+ messages in thread
From: Dave Chinner @ 2019-08-23  3:23 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > 
> > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > deadlock..
> > > 
> > > The discussion between Jan and Dave was concerning what happens when a user
> > > calls
> > > 
> > > fd = open()
> > > fnctl(...getlease...)
> > > addr = mmap(fd...)
> > > ib_reg_mr() <pin>
> > > munmap(addr...)
> > > close(fd)
> > 
> > I don't see how blocking close(fd) could work.
> 
> Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> can't prove it won't..

Right, I proposed it as a possible way of making sure application
developers don't do this. It _could_ be made to work (e.g. recording
longterm page pins on the vma->file), but this is tangential to 
the discussion of requiring active references to all resources
covered by the layout lease.

I think allowing applications to behave like the above is simply
poor system level design, regardless of the interaction with
filesystems and layout leases.

> Maybe we are all just touching a different part of this
> elephant[1] but the above scenario or one without munmap is very reasonably
> something a user would do.  So we can either allow the close to complete (my
> current patches) or try to make it block like Dave is suggesting.
> 
> I don't disagree with Dave with the semantics being nice and clean for the
> filesystem.

I'm not trying to make it "nice and clean for the filesystem".

The problem is not just RDMA/DAX - anything that is directly
accessing the block device under the filesystem has the same set of
issues. That is, the filesystem controls the life cycle of the
blocks in the block device, so direct access to the blocks by any
means needs to be co-ordinated with the filesystem. Pinning direct
access to a file via page pins attached to a hardware context that
the filesystem knows nothing about is not an access model that the
filesystems can support.

IOWs, anyone looking at this problem just from the RDMA POV of page
pins is not seeing all the other direct storage access mechainsms
that we need to support in the filesystems. RDMA on DAX is just one
of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
-> DAX (direct file data placement from the network hardware) is
another. There are /lots/ of different direct storage access
mechanisms that filesystems need to support and we sure as hell do
not want to have to support special case semantics for every single
one of them.

Hence if we don't start with a sane model for arbitrating direct
access to the storage at the filesystem level we'll never get this
stuff to work reliably, let alone work together coherently.  An
application that wants a direct data path to storage should have a
single API that enables then to safely access the storage,
regardless of how they are accessing the storage.

From that perspective, what we are talking about here with RDMA
doing "mmap, page pin, unmap, close" and "pass page pins via
SCM_RIGHTS" are fundamentally unworkable from the filesystem
perspective. They are use-after-free situations from the filesystem
perspective - they do not hold direct references to anything in the
filesystem, and so the filesytem is completely unaware of them.

The filesystem needs to be aware of /all users/ of it's resources if
it's going to manage them sanely.  It needs to be able to corectly
coordinate modifications to ownership of the underlying storage with
all the users directly accessing that physical storage regardless of
the mechanism being used to access the storage.  IOWs, access
control must be independent of the mechanism used to gain access to
the storage hardware.

That's what file layout leases are defining - the filesystem support
model for allowing direct storage access from userspace. It's not
defining "RDMA/FSDAX" access rules, it's defining a generic direct
access model. And one of the rules in this model is "if you don't
have an active reference to the file layout, you are not allowed to
directly access the layout.".

Anything else is unsupportable from the filesystem perspective -
designing an access mechanism that allows userspace to retain access
indefinitely by relying on hidden details of kernel subsystem
implementations is a terrible architecture.  Not only does it bleed
kernel implementation into the API and the behavioural model, it
means we can't ever change that internal kernel behaviour because
userspace may be dependent on it. I shouldn't be having to point out
how bad this is from a system design perspective.

That's where the "nice and clean" semantics come from - starting
from "what can we actually support?", "what exactly do all the
different direct access mechanisms actually require?", "does it work
for future technologies not yet on our radar?" and working from
there.  So I'm not just looking at what we are doing right now, I'm
looking at 15 years down the track when we still have to support
layout leases and we've got hardware we haven't dreamed of yet.  If
the model is clean, simple, robust, implementation independent and
has well defined semantics, then it should stand the test of time.
i.e. the "nice and clean" semantics have nothign to do with the
filesystem per se, but everything to do with ensuring the mechanism
is generic and workable for direct storage access for a long time
into the future.

We can't force people to use layout leases - at all, let alone
correctly - but if you want filesystems and enterprise distros to
support direct access to filesystem controlled storage, then direct
access applications need to follow a sane set of rules that are
supportable at the filesystem level.

> But the fact that RDMA, and potentially others, can "pass the
> pins" to other processes is something I spent a lot of time trying to work out.

There's nothing in file layout lease architecture that says you
can't "pass the pins" to another process.  All the file layout lease
requirements say is that if you are going to pass a resource for
which the layout lease guarantees access for to another process,
then the destination process already have a valid, active layout
lease that covers the range of the pins being passed to it via the
RDMA handle.

i.e. as the pins pass from one process to another, they pass from
the protection of the lease process A holds to the protection that
the lease process B holds. This can probably even be done by
duplicating the lease fd and passing it by SCM_RIGHTS first.....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-23  3:23                               ` Dave Chinner
@ 2019-08-23 12:04                                 ` Jason Gunthorpe
  2019-08-24  0:11                                   ` Dave Chinner
  2019-08-24  4:49                                 ` Ira Weiny
  1 sibling, 1 reply; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-23 12:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, Jan Kara, Andrew Morton, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:

> > But the fact that RDMA, and potentially others, can "pass the
> > pins" to other processes is something I spent a lot of time trying to work out.
> 
> There's nothing in file layout lease architecture that says you
> can't "pass the pins" to another process.  All the file layout lease
> requirements say is that if you are going to pass a resource for
> which the layout lease guarantees access for to another process,
> then the destination process already have a valid, active layout
> lease that covers the range of the pins being passed to it via the
> RDMA handle.

How would the kernel detect and enforce this? There are many ways to
pass a FD.

IMHO it is wrong to try and create a model where the file lease exists
independently from the kernel object relying on it. In other words the
IB MR object itself should hold a reference to the lease it relies
upon to function properly.

Then we don't have to wreck the unix FD model to fit this in.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  2019-08-09 22:58 ` [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
@ 2019-08-23 15:18   ` Vivek Goyal
  2019-08-29 18:52     ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Vivek Goyal @ 2019-08-23 15:18 UTC (permalink / raw)
  To: ira.weiny
  Cc: Andrew Morton, Michal Hocko, Jan Kara, linux-nvdimm, linux-rdma,
	John Hubbard, Dave Chinner, linux-kernel, Matthew Wilcox,
	linux-xfs, Jason Gunthorpe, linux-fsdevel, Theodore Ts'o,
	linux-ext4, linux-mm

On Fri, Aug 09, 2019 at 03:58:20PM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Callers of dax_layout_busy_page() are only rarely operating on the
> entire file of concern.
> 
> Teach dax_layout_busy_page() to operate on a sub-range of the
> address_space provided.  Specifying 0 - ULONG_MAX however, will continue
> to operate on the "entire file" and XFS is split out to a separate patch
> by this method.
> 
> This could potentially speed up dax_layout_busy_page() as well.

I need this functionality as well for virtio_fs and posted a patch for
this.

https://lkml.org/lkml/2019/8/21/825

Given this is an optimization which existing users can benefit from already,
this patch could probably be pushed upstream independently.

> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from RFC v1
> 	Fix 0-day build errors
> 
>  fs/dax.c            | 15 +++++++++++----
>  fs/ext4/ext4.h      |  2 +-
>  fs/ext4/extents.c   |  6 +++---
>  fs/ext4/inode.c     | 19 ++++++++++++-------
>  fs/xfs/xfs_file.c   |  3 ++-
>  include/linux/dax.h |  6 ++++--
>  6 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index a14ec32255d8..3ad19c384454 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -573,8 +573,11 @@ bool dax_mapping_is_dax(struct address_space *mapping)
>  EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
>  
>  /**
> - * dax_layout_busy_page - find first pinned page in @mapping
> + * dax_layout_busy_page - find first pinned page in @mapping within
> + *                        the range @off - @off + @len
>   * @mapping: address space to scan for a page with ref count > 1
> + * @off: offset to start at
> + * @len: length to scan through
>   *
>   * DAX requires ZONE_DEVICE mapped pages. These pages are never
>   * 'onlined' to the page allocator so they are considered idle when
> @@ -587,9 +590,13 @@ EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
>   * to be able to run unmap_mapping_range() and subsequently not race
>   * mapping_mapped() becoming true.
>   */
> -struct page *dax_layout_busy_page(struct address_space *mapping)
> +struct page *dax_layout_busy_page(struct address_space *mapping,
> +				  loff_t off, loff_t len)
>  {
> -	XA_STATE(xas, &mapping->i_pages, 0);
> +	unsigned long start_idx = off >> PAGE_SHIFT;
> +	unsigned long end_idx = (len == ULONG_MAX) ? ULONG_MAX
> +				: start_idx + (len >> PAGE_SHIFT);
> +	XA_STATE(xas, &mapping->i_pages, start_idx);
>  	void *entry;
>  	unsigned int scanned = 0;
>  	struct page *page = NULL;
> @@ -612,7 +619,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
>  	unmap_mapping_range(mapping, 0, 0, 1);

Should we unmap only those pages which fall in the range specified by caller.
Unmapping whole file seems to be less efficient.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-23  0:59                       ` Dave Chinner
@ 2019-08-23 17:15                         ` Ira Weiny
  2019-08-24  0:18                           ` Dave Chinner
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-23 17:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >    uverbs_destroy_ufile_hw()
> > > > >       mutex_lock()
> > > > >        [..]
> > > > >         mmput()
> > > > >          exit_mmap()
> > > > >           remove_vma()
> > > > >            fput();
> > > > >             file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Process A has the RDMA context FD and data file FD (with lease) open.
> > 
> > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> 
> Passing the RDMA context dependent on a file layout lease to another
> process that doesn't have a file layout lease or a reference to the
> original lease should be considered a violation of the layout lease.
> Process B does not have an active layout lease, and so by the rules
> of layout leases, it is not allowed to pin the layout of the file.
> 

I don't disagree with the semantics of this.  I just don't know how to enforce
it.

> > Process A attempts to exit (hangs because data file FD is pinned).
> > 
> > Admin kills process A.  kill works because we have allowed for it...
> > 
> > Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> > file pins.
> > 
> > Truncation still fails.
> > 
> > Admin does not know which process is holding the pin.
> > 
> > What am I missing?
> 
> Application does not hold the correct file layout lease references.
> Passing the fd via SCM_RIGHTS to a process without a layout lease
> is equivalent to not using layout leases in the first place.

Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
in this case?  I'm ok with that but not sure how to implement it right now.

To that end, I would like to simplify this slightly because I'm not convinced
that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
user who wants to do this.

Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
definition leases) exist underneath the "RDMA FD" (or other direct access FD,
like XDP etc) being duplicated.  Later, if this becomes a use case we will need
to code up the proper checks, potentially within each of the subsystems.  This
is because, with RDMA at least, there are potentially large numbers of MR's and
file leases which may have to be checked.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-23 12:04                                 ` Jason Gunthorpe
@ 2019-08-24  0:11                                   ` Dave Chinner
  2019-08-24  5:08                                     ` Ira Weiny
  2019-08-25 19:39                                     ` Jason Gunthorpe
  0 siblings, 2 replies; 110+ messages in thread
From: Dave Chinner @ 2019-08-24  0:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ira Weiny, Jan Kara, Andrew Morton, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> 
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to work out.
> > 
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process.  All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
> 
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.

AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.

> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.

That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.

That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications.  i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.

So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd....

This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.

So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-23 17:15                         ` Ira Weiny
@ 2019-08-24  0:18                           ` Dave Chinner
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Chinner @ 2019-08-24  0:18 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 10:15:04AM -0700, Ira Weiny wrote:
> On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >    uverbs_destroy_ufile_hw()
> > > > > >       mutex_lock()
> > > > > >        [..]
> > > > > >         mmput()
> > > > > >          exit_mmap()
> > > > > >           remove_vma()
> > > > > >            fput();
> > > > > >             file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) is safe
> > > with regard to hanging the close for the "data file FD" (the file which has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > > 
> > > Process A has the RDMA context FD and data file FD (with lease) open.
> > > 
> > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> > 
> > Passing the RDMA context dependent on a file layout lease to another
> > process that doesn't have a file layout lease or a reference to the
> > original lease should be considered a violation of the layout lease.
> > Process B does not have an active layout lease, and so by the rules
> > of layout leases, it is not allowed to pin the layout of the file.
> > 
> 
> I don't disagree with the semantics of this.  I just don't know how to enforce
> it.
> 
> > > Process A attempts to exit (hangs because data file FD is pinned).
> > > 
> > > Admin kills process A.  kill works because we have allowed for it...
> > > 
> > > Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> > > file pins.
> > > 
> > > Truncation still fails.
> > > 
> > > Admin does not know which process is holding the pin.
> > > 
> > > What am I missing?
> > 
> > Application does not hold the correct file layout lease references.
> > Passing the fd via SCM_RIGHTS to a process without a layout lease
> > is equivalent to not using layout leases in the first place.
> 
> Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
> in this case?  I'm ok with that but not sure how to implement it right now.
> 
> To that end, I would like to simplify this slightly because I'm not convinced
> that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
> user who wants to do this.

I don't think we can support it, let alone want to. SCM_RIGHTS was a
mistake made years ago that has been causing bugs and complexity to
try and avoid those bugs ever since.  I'm only taking about it
because someone else raised it and I asummed they raised it because
they want it to "work".

> Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
> definition leases) exist underneath the "RDMA FD" (or other direct access FD,
> like XDP etc) being duplicated.

Sounds like a fine idea to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-23  3:23                               ` Dave Chinner
  2019-08-23 12:04                                 ` Jason Gunthorpe
@ 2019-08-24  4:49                                 ` Ira Weiny
  2019-08-25 19:40                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-24  4:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > > 
> > > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > > deadlock..
> > > > 
> > > > The discussion between Jan and Dave was concerning what happens when a user
> > > > calls
> > > > 
> > > > fd = open()
> > > > fnctl(...getlease...)
> > > > addr = mmap(fd...)
> > > > ib_reg_mr() <pin>
> > > > munmap(addr...)
> > > > close(fd)
> > > 
> > > I don't see how blocking close(fd) could work.
> > 
> > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> > can't prove it won't..
> 
> Right, I proposed it as a possible way of making sure application
> developers don't do this. It _could_ be made to work (e.g. recording
> longterm page pins on the vma->file), but this is tangential to 
> the discussion of requiring active references to all resources
> covered by the layout lease.
> 
> I think allowing applications to behave like the above is simply
> poor system level design, regardless of the interaction with
> filesystems and layout leases.
> 
> > Maybe we are all just touching a different part of this
> > elephant[1] but the above scenario or one without munmap is very reasonably
> > something a user would do.  So we can either allow the close to complete (my
> > current patches) or try to make it block like Dave is suggesting.

My belief when writing the current series was that hanging the close would
cause deadlock.  But it seems I was wrong because of the delayed __fput().

So far, I have not been able to get RDMA to have an issue like Jason suggested
would happen (or used to happen).  So from that perspective it may be ok to
hang the close.

> > 
> > I don't disagree with Dave with the semantics being nice and clean for the
> > filesystem.
> 
> I'm not trying to make it "nice and clean for the filesystem".
> 
> The problem is not just RDMA/DAX - anything that is directly
> accessing the block device under the filesystem has the same set of
> issues. That is, the filesystem controls the life cycle of the
> blocks in the block device, so direct access to the blocks by any
> means needs to be co-ordinated with the filesystem. Pinning direct
> access to a file via page pins attached to a hardware context that
> the filesystem knows nothing about is not an access model that the
> filesystems can support.
> 
> IOWs, anyone looking at this problem just from the RDMA POV of page
> pins is not seeing all the other direct storage access mechainsms
> that we need to support in the filesystems. RDMA on DAX is just one
> of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
> -> DAX (direct file data placement from the network hardware) is
> another. There are /lots/ of different direct storage access
> mechanisms that filesystems need to support and we sure as hell do
> not want to have to support special case semantics for every single
> one of them.

My use of struct file was based on the fact that FDs are a primary interface
for linux and my thought was that they would be more universal than having file
pin information stored in an RDMA specific structure.

XDP is not as direct; it uses sockets.  But sockets also have a struct file
which I believe could be used in a similar manner.  I'm not 100% sure of the
xdp_umem lifetime yet but it seems that my choice of using struct file was a
good one in this respect.

> 
> Hence if we don't start with a sane model for arbitrating direct
> access to the storage at the filesystem level we'll never get this
> stuff to work reliably, let alone work together coherently.  An
> application that wants a direct data path to storage should have a
> single API that enables then to safely access the storage,
> regardless of how they are accessing the storage.
> 
> From that perspective, what we are talking about here with RDMA
> doing "mmap, page pin, unmap, close" and "pass page pins via
> SCM_RIGHTS" are fundamentally unworkable from the filesystem
> perspective. They are use-after-free situations from the filesystem
> perspective - they do not hold direct references to anything in the
> filesystem, and so the filesytem is completely unaware of them.

I see your point of view but looking at it from a different point of view I
don't see this as a "use after free".

The user has explicitly registered this memory (and layout) with another direct
access subsystem (RDMA for example) so why do they need to keep the FD around?

> 
> The filesystem needs to be aware of /all users/ of it's resources if
> it's going to manage them sanely.

From the way I look at it the underlying filesystem _is_ aware of the leases
with my patch set.  And so to is the user.  It is just not through the original
"data file fd".

And the owner of the lease becomes the subsystem object ("RDMA FD" in this
case) which is holding the pins.  Furthermore, the lease is maintained and
transferred automatically through the normal FD processing.

(Furthermore, tracking of these pins is available for whatever subsystem by
tracking them with struct file; _not_ just RDMA).  When those subsystem objects
are released the "data file lease" will be released as well.  That was the
design.

> 
> > But the fact that RDMA, and potentially others, can "pass the
> > pins" to other processes is something I spent a lot of time trying to work out.
> 
> There's nothing in file layout lease architecture that says you
> can't "pass the pins" to another process.  All the file layout lease
> requirements say is that if you are going to pass a resource for
> which the layout lease guarantees access for to another process,
> then the destination process already have a valid, active layout
> lease that covers the range of the pins being passed to it via the
> RDMA handle.
> 
> i.e. as the pins pass from one process to another, they pass from
> the protection of the lease process A holds to the protection that
> the lease process B holds. This can probably even be done by
> duplicating the lease fd and passing it by SCM_RIGHTS first.....

My worry with this is how to enforce it.  As I said in the other thread I think
we could potentially block SCM_RIGHTS use in the short term.  But I'm not sure
about blocking every call which may "dup()" an FD to random processes.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-24  0:11                                   ` Dave Chinner
@ 2019-08-24  5:08                                     ` Ira Weiny
  2019-08-26  5:55                                       ` Dave Chinner
  2019-08-25 19:39                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-24  5:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.
> 
> > IMHO it is wrong to try and create a model where the file lease exists
> > independently from the kernel object relying on it. In other words the
> > IB MR object itself should hold a reference to the lease it relies
> > upon to function properly.
> 
> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.
> 
> That is, a lease specifically tracks the pid and the _open fd_ it
> was obtained for, so it is essentially owned by a specific process
> context.  Hence a lease is not able to be passed to a separate
> process context and have it still work correctly for lease break
> notifications.  i.e. the layout break signal gets delivered to
> original process that created the struct file, if it still exists
> and has the original fd still open. It does not get sent to the
> process that currently holds a reference to the IB context.
>

The fcntl man page says:

"Leases are associated with an open file description (see open(2)).  This means
that duplicate file descriptors (created by, for example, fork(2) or dup(2))
refer to the same lease, and this lease may be modified or released using any
of these descriptors.  Furthermore,  the lease is released by either an
explicit F_UNLCK operation on any of these duplicate file descriptors, or when
all such file descriptors have been closed."

From this I took it that the child process FD would have the lease as well
_and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
seem to work the same way as dup() so I'm not so sure.

Ira

> 
> So while a struct file passed to another process might still have
> an active lease, and you can change the owner of the struct file
> via fcntl(F_SETOWN), you can't associate the existing lease with a
> the new fd in the new process and so layout break signals can't be
> directed at the lease fd....
> 
> This really means that a lease can only be owned by a single process
> context - it can't be shared across multiple processes (so I was
> wrong about dup/pass as being a possible way of passing them)
> because there's only one process that can "own" a struct file, and
> that where signals are sent when the lease needs to be broken.
> 
> So, fundamentally, if you want to pass a resource that pins a file
> layout between processes, both processes need to hold a layout lease
> on that file range. And that means exclusive leases and passing
> layouts between processes are fundamentally incompatible because you
> can't hold two exclusive leases on the same file range....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-24  0:11                                   ` Dave Chinner
  2019-08-24  5:08                                     ` Ira Weiny
@ 2019-08-25 19:39                                     ` Jason Gunthorpe
  1 sibling, 0 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-25 19:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, Jan Kara, Andrew Morton, Dan Williams, Matthew Wilcox,
	Theodore Ts'o, John Hubbard, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.

It is a kernel problem, the MR exists and is doing DMA. That relies on
the lease to prevent data corruption.

The sanest outcome I could suggest is that when the kernel detects the
MR has outlived the lease it needs then we forcibly abort the entire
RDMA state. Ie the application has malfunctioned and gets wacked with
a very big hammer.

> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.

This is the problem. How to link something that is not refcounted to
the refcounted world of file descriptors does not seem very obvious.

There are too many places where struct file relies on its refcounting
to try to and plug them.

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-24  4:49                                 ` Ira Weiny
@ 2019-08-25 19:40                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-08-25 19:40 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 09:49:12PM -0700, Ira Weiny wrote:

> So far, I have not been able to get RDMA to have an issue like Jason suggested
> would happen (or used to happen).  So from that perspective it may be ok to
> hang the close.

No, it is not OK to hang the close. You will deadlock on process
destruction when the 'lease fd' hangs waiting for the 'uverbs fd'
which is later in the single threaded destruction sequence.

This is different from the uverbs deadlock I outlined

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-24  5:08                                     ` Ira Weiny
@ 2019-08-26  5:55                                       ` Dave Chinner
  2019-08-29  2:02                                         ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-08-26  5:55 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > > 
> > > > > But the fact that RDMA, and potentially others, can "pass the
> > > > > pins" to other processes is something I spent a lot of time trying to work out.
> > > > 
> > > > There's nothing in file layout lease architecture that says you
> > > > can't "pass the pins" to another process.  All the file layout lease
> > > > requirements say is that if you are going to pass a resource for
> > > > which the layout lease guarantees access for to another process,
> > > > then the destination process already have a valid, active layout
> > > > lease that covers the range of the pins being passed to it via the
> > > > RDMA handle.
> > > 
> > > How would the kernel detect and enforce this? There are many ways to
> > > pass a FD.
> > 
> > AFAIC, that's not really a kernel problem. It's more of an
> > application design constraint than anything else. i.e. if the app
> > passes the IB context to another process without a lease, then the
> > original process is still responsible for recalling the lease and
> > has to tell that other process to release the IB handle and it's
> > resources.
> > 
> > > IMHO it is wrong to try and create a model where the file lease exists
> > > independently from the kernel object relying on it. In other words the
> > > IB MR object itself should hold a reference to the lease it relies
> > > upon to function properly.
> > 
> > That still doesn't work. Leases are not individually trackable or
> > reference counted objects objects - they are attached to a struct
> > file bUt, in reality, they are far more restricted than a struct
> > file.
> > 
> > That is, a lease specifically tracks the pid and the _open fd_ it
> > was obtained for, so it is essentially owned by a specific process
> > context.  Hence a lease is not able to be passed to a separate
> > process context and have it still work correctly for lease break
> > notifications.  i.e. the layout break signal gets delivered to
> > original process that created the struct file, if it still exists
> > and has the original fd still open. It does not get sent to the
> > process that currently holds a reference to the IB context.
> >
> 
> The fcntl man page says:
> 
> "Leases are associated with an open file description (see open(2)).  This means
> that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> refer to the same lease, and this lease may be modified or released using any
> of these descriptors.  Furthermore,  the lease is released by either an
> explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> all such file descriptors have been closed."

Right, the lease is attached to the struct file, so it follows
where-ever the struct file goes. That doesn't mean it's actually
useful when the struct file is duplicated and/or passed to another
process. :/

AFAICT, the problem is that when we take another reference to the
struct file, or when the struct file is passed to a different
process, nothing updates the lease or lease state attached to that
struct file.

> From this I took it that the child process FD would have the lease as well
> _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
> seem to work the same way as dup() so I'm not so sure.

Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on.  When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

	send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that's
still only a single process context, and you can't change the fd in
the fasync list. You can add new fd to an existing lease by calling
F_SETLEASE on the new fd, but you still only have a single process
owner context for signal delivery.

As such, leases that require callbacks to userspace are currently
only valid within the process context the lease was taken in.
Indeed, even closing the fd the lease was taken on without
F_UNLCKing it first doesn't mean the lease has been torn down if
there is some other reference to the struct file. That means the
original lease owner will still get SIGIO delivered to that fd on a
lease break regardless of whether it is open or not. ANd if we
implement "layout lease not released within SIGIO response timeout"
then that process will get killed, despite the fact it may not even
have a reference to that file anymore.

So, AFAICT, leases that require userspace callbacks only work within
their original process context while they original fd is still open.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-14 21:56     ` Dave Chinner
@ 2019-08-26 10:41       ` Jeff Layton
  2019-08-29 23:34         ` Ira Weiny
  0 siblings, 1 reply; 110+ messages in thread
From: Jeff Layton @ 2019-08-26 10:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: ira.weiny, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Jan Kara, Theodore Ts'o, John Hubbard,
	Michal Hocko, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

On Thu, 2019-08-15 at 07:56 +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2019 at 10:15:06AM -0400, Jeff Layton wrote:
> > On Fri, 2019-08-09 at 15:58 -0700, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Add an exclusive lease flag which indicates that the layout mechanism
> > > can not be broken.
> > > 
> > > Exclusive layout leases allow the file system to know that pages may be
> > > GUP pined and that attempts to change the layout, ie truncate, should be
> > > failed.
> > > 
> > > A process which attempts to break it's own exclusive lease gets an
> > > EDEADLOCK return to help determine that this is likely a programming bug
> > > vs someone else holding a resource.
> .....
> > > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > > index baddd54f3031..88b175ceccbc 100644
> > > --- a/include/uapi/asm-generic/fcntl.h
> > > +++ b/include/uapi/asm-generic/fcntl.h
> > > @@ -176,6 +176,8 @@ struct f_owner_ex {
> > >  
> > >  #define F_LAYOUT	16      /* layout lease to allow longterm pins such as
> > >  				   RDMA */
> > > +#define F_EXCLUSIVE	32      /* layout lease is exclusive */
> > > +				/* FIXME or shoudl this be F_EXLCK??? */
> > >  
> > >  /* operations for bsd flock(), also used by the kernel implementation */
> > >  #define LOCK_SH		1	/* shared lock */
> > 
> > This interface just seems weird to me. The existing F_*LCK values aren't
> > really set up to be flags, but are enumerated values (even if there are
> > some gaps on some arches). For instance, on parisc and sparc:
> 
> I don't think we need to worry about this - the F_WRLCK version of
> the layout lease should have these exclusive access semantics (i.e
> other ops fail rather than block waiting for lease recall) and hence
> the API shouldn't need a new flag to specify them.
> 
> i.e. the primary difference between F_RDLCK and F_WRLCK layout
> leases is that the F_RDLCK is a shared, co-operative lease model
> where only delays in operations will be seen, while F_WRLCK is a
> "guarantee exclusive access and I don't care what it breaks"
> model... :)
> 

Not exactly...

F_WRLCK and F_RDLCK leases can both be broken, and will eventually time
out if there is conflicting access. The F_EXCLUSIVE flag on the other
hand is there to prevent any sort of lease break from 

I'm guessing what Ira really wants with the F_EXCLUSIVE flag is
something akin to what happens when we set fl_break_time to 0 in the
nfsd code. nfsd never wants the locks code to time out a lease of any
sort, since it handles that timeout itself.

If you're going to add this functionality, it'd be good to also convert
knfsd to use it as well, so we don't end up with multiple ways to deal
with that situation.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-26  5:55                                       ` Dave Chinner
@ 2019-08-29  2:02                                         ` Ira Weiny
  2019-08-29  3:27                                           ` John Hubbard
  2019-09-02 22:26                                           ` Dave Chinner
  0 siblings, 2 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-29  2:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > >
> > > > IMHO it is wrong to try and create a model where the file lease exists
> > > > independently from the kernel object relying on it. In other words the
> > > > IB MR object itself should hold a reference to the lease it relies
> > > > upon to function properly.
> > > 
> > > That still doesn't work. Leases are not individually trackable or
> > > reference counted objects objects - they are attached to a struct
> > > file bUt, in reality, they are far more restricted than a struct
> > > file.
> > > 
> > > That is, a lease specifically tracks the pid and the _open fd_ it
> > > was obtained for, so it is essentially owned by a specific process
> > > context.  Hence a lease is not able to be passed to a separate
> > > process context and have it still work correctly for lease break
> > > notifications.  i.e. the layout break signal gets delivered to
> > > original process that created the struct file, if it still exists
> > > and has the original fd still open. It does not get sent to the
> > > process that currently holds a reference to the IB context.

But this is an exclusive layout lease which does not send a signal.  There is
no way to break it.

> > >
> > 
> > The fcntl man page says:
> > 
> > "Leases are associated with an open file description (see open(2)).  This means
> > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > refer to the same lease, and this lease may be modified or released using any
> > of these descriptors.  Furthermore,  the lease is released by either an
> > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > all such file descriptors have been closed."
> 
> Right, the lease is attached to the struct file, so it follows
> where-ever the struct file goes. That doesn't mean it's actually
> useful when the struct file is duplicated and/or passed to another
> process. :/
> 
> AFAICT, the problem is that when we take another reference to the
> struct file, or when the struct file is passed to a different
> process, nothing updates the lease or lease state attached to that
> struct file.

Ok, I probably should have made this more clear in the cover letter but _only_
the process which took the lease can actually pin memory.

That pinned memory _can_ be passed to another process but those sub-process' can
_not_ use the original lease to pin _more_ of the file.  They would need to
take their own lease to do that.

Sorry for not being clear on that.

> 
> > From this I took it that the child process FD would have the lease as well
> > _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
> > seem to work the same way as dup() so I'm not so sure.
> 
> Sure, that part works because the struct file is passed. It doesn't
> end up with the same fd number in the other process, though.
> 
> The issue is that layout leases need to notify userspace when they
> are broken by the kernel, so a lease stores the owner pid/tid in the
> file->f_owner field via __f_setown(). It also keeps a struct fasync
> attached to the file_lock that records the fd that the lease was
> created on.  When a signal needs to be sent to userspace for that
> lease, we call kill_fasync() and that walks the list of fasync
> structures on the lease and calls:
> 
> 	send_sigio(fown, fa->fa_fd, band);
> 
> And it does for every fasync struct attached to a lease. Yes, a
> lease can track multiple fds, but it can only track them in a single
> process context. The moment the struct file is shared with another
> process, the lease is no longer capable of sending notifications to
> all the lease holders.
> 
> Yes, you can change the owning process via F_SETOWNER, but that's
> still only a single process context, and you can't change the fd in
> the fasync list. You can add new fd to an existing lease by calling
> F_SETLEASE on the new fd, but you still only have a single process
> owner context for signal delivery.
> 
> As such, leases that require callbacks to userspace are currently
> only valid within the process context the lease was taken in.

But for long term pins we are not requiring callbacks.

> Indeed, even closing the fd the lease was taken on without
> F_UNLCKing it first doesn't mean the lease has been torn down if
> there is some other reference to the struct file. That means the
> original lease owner will still get SIGIO delivered to that fd on a
> lease break regardless of whether it is open or not. ANd if we
> implement "layout lease not released within SIGIO response timeout"
> then that process will get killed, despite the fact it may not even
> have a reference to that file anymore.

I'm not seeing that as a problem.  This is all a result of the application
failing to do the right thing.  The code here is simply keeping the kernel
consistent and safe so that an admin or the user themselves can unwind the
badness without damage to the file system.

> 
> So, AFAICT, leases that require userspace callbacks only work within
> their original process context while they original fd is still open.

But they _work_ IFF the application actually expects to do something with the
SIGIO.  The application could just as well chose to ignore the SIGIO without
closing the FD which would do the same thing.

If the application expected to do something with the SIGIO but closed the FD
then it's really just the applications fault.

So after thinking on this for a day I don't think we have a serious issue.

Even the "zombie" lease is just an application error and it is already possible
to get something like this.  If the application passes the FD to another
process and closes their FD then SIGIO's don't get delivered but there is a
lease hanging off the struct file until it is destroyed.  No harm, no foul.

In the case of close it is _not_ true that users don't have a way to release
the lease.  It is just that they can't call F_UNLCK to do so.  Once they have
"zombie'ed" the lease (again an application error) the only recourse is to
unpin the file through the subsystem which pinned the page.  Probably through
killing the process.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-29  2:02                                         ` Ira Weiny
@ 2019-08-29  3:27                                           ` John Hubbard
  2019-08-29 16:16                                             ` Ira Weiny
  2019-09-02 22:26                                           ` Dave Chinner
  1 sibling, 1 reply; 110+ messages in thread
From: John Hubbard @ 2019-08-29  3:27 UTC (permalink / raw)
  To: Ira Weiny, Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, Michal Hocko, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/28/19 7:02 PM, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
>> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
>>> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
>>>> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
...
>>
>> Sure, that part works because the struct file is passed. It doesn't
>> end up with the same fd number in the other process, though.
>>
>> The issue is that layout leases need to notify userspace when they
>> are broken by the kernel, so a lease stores the owner pid/tid in the
>> file->f_owner field via __f_setown(). It also keeps a struct fasync
>> attached to the file_lock that records the fd that the lease was
>> created on.  When a signal needs to be sent to userspace for that
>> lease, we call kill_fasync() and that walks the list of fasync
>> structures on the lease and calls:
>>
>> 	send_sigio(fown, fa->fa_fd, band);
>>
>> And it does for every fasync struct attached to a lease. Yes, a
>> lease can track multiple fds, but it can only track them in a single
>> process context. The moment the struct file is shared with another
>> process, the lease is no longer capable of sending notifications to
>> all the lease holders.
>>
>> Yes, you can change the owning process via F_SETOWNER, but that's
>> still only a single process context, and you can't change the fd in
>> the fasync list. You can add new fd to an existing lease by calling
>> F_SETLEASE on the new fd, but you still only have a single process
>> owner context for signal delivery.
>>
>> As such, leases that require callbacks to userspace are currently
>> only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.
> 

Hi Ira,

If "require callbacks to userspace" means sending SIGIO, then actually
FOLL_LONGTERM *does* require those callbacks. Because we've been, so
far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.

What am I missing here?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-29  3:27                                           ` John Hubbard
@ 2019-08-29 16:16                                             ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-29 16:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dave Chinner, Jason Gunthorpe, Jan Kara, Andrew Morton,
	Dan Williams, Matthew Wilcox, Theodore Ts'o, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 28, 2019 at 08:27:23PM -0700, John Hubbard wrote:
> On 8/28/19 7:02 PM, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> ...
> > > 
> > > Sure, that part works because the struct file is passed. It doesn't
> > > end up with the same fd number in the other process, though.
> > > 
> > > The issue is that layout leases need to notify userspace when they
> > > are broken by the kernel, so a lease stores the owner pid/tid in the
> > > file->f_owner field via __f_setown(). It also keeps a struct fasync
> > > attached to the file_lock that records the fd that the lease was
> > > created on.  When a signal needs to be sent to userspace for that
> > > lease, we call kill_fasync() and that walks the list of fasync
> > > structures on the lease and calls:
> > > 
> > > 	send_sigio(fown, fa->fa_fd, band);
> > > 
> > > And it does for every fasync struct attached to a lease. Yes, a
> > > lease can track multiple fds, but it can only track them in a single
> > > process context. The moment the struct file is shared with another
> > > process, the lease is no longer capable of sending notifications to
> > > all the lease holders.
> > > 
> > > Yes, you can change the owning process via F_SETOWNER, but that's
> > > still only a single process context, and you can't change the fd in
> > > the fasync list. You can add new fd to an existing lease by calling
> > > F_SETLEASE on the new fd, but you still only have a single process
> > > owner context for signal delivery.
> > > 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> > 
> 
> Hi Ira,
> 
> If "require callbacks to userspace" means sending SIGIO, then actually
> FOLL_LONGTERM *does* require those callbacks. Because we've been, so
> far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.
> 
> What am I missing here?

We agreed back in June that the layout lease would have 2 "levels".  The
"normal" layout lease would cause SIGIO and could be broken and another
"exclusive" level which could _not_ be broken.

Because we _can't_ _trust_ user space to react to the SIGIO properly the
"exclusive" lease is required to take the longterm pins.  Also this is the
lease which causes the truncate to fail (return ETXTBSY) because the kernel
can't break the lease.

The vaddr_pin struct in the current RFC is there for a couple of reasons.

1) To ensure that we have a way to correlate the long term pin user with the
   file if the data file FD's are closed.  (ie the application has zombie'd the
   lease).

2) And more importantly as a token the vaddr_pin*() callers use to be able to
   properly ref count the file itself while in use.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
  2019-08-23 15:18   ` Vivek Goyal
@ 2019-08-29 18:52     ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-08-29 18:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, Michal Hocko, Jan Kara, linux-nvdimm, linux-rdma,
	John Hubbard, Dave Chinner, linux-kernel, Matthew Wilcox,
	linux-xfs, Jason Gunthorpe, linux-fsdevel, Theodore Ts'o,
	linux-ext4, linux-mm

On Fri, Aug 23, 2019 at 11:18:26AM -0400, Vivek Goyal wrote:
> On Fri, Aug 09, 2019 at 03:58:20PM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Callers of dax_layout_busy_page() are only rarely operating on the
> > entire file of concern.
> > 
> > Teach dax_layout_busy_page() to operate on a sub-range of the
> > address_space provided.  Specifying 0 - ULONG_MAX however, will continue
> > to operate on the "entire file" and XFS is split out to a separate patch
> > by this method.
> > 
> > This could potentially speed up dax_layout_busy_page() as well.
> 
> I need this functionality as well for virtio_fs and posted a patch for
> this.
> 
> https://lkml.org/lkml/2019/8/21/825
> 
> Given this is an optimization which existing users can benefit from already,
> this patch could probably be pushed upstream independently.

I'm ok with that.

However, this patch does not apply cleanly to head as I had some other
additions to dax.h.

> 
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from RFC v1
> > 	Fix 0-day build errors
> > 
> >  fs/dax.c            | 15 +++++++++++----
> >  fs/ext4/ext4.h      |  2 +-
> >  fs/ext4/extents.c   |  6 +++---
> >  fs/ext4/inode.c     | 19 ++++++++++++-------
> >  fs/xfs/xfs_file.c   |  3 ++-
> >  include/linux/dax.h |  6 ++++--
> >  6 files changed, 33 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index a14ec32255d8..3ad19c384454 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -573,8 +573,11 @@ bool dax_mapping_is_dax(struct address_space *mapping)
> >  EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
> >  
> >  /**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > + * dax_layout_busy_page - find first pinned page in @mapping within
> > + *                        the range @off - @off + @len
> >   * @mapping: address space to scan for a page with ref count > 1
> > + * @off: offset to start at
> > + * @len: length to scan through
> >   *
> >   * DAX requires ZONE_DEVICE mapped pages. These pages are never
> >   * 'onlined' to the page allocator so they are considered idle when
> > @@ -587,9 +590,13 @@ EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
> >   * to be able to run unmap_mapping_range() and subsequently not race
> >   * mapping_mapped() becoming true.
> >   */
> > -struct page *dax_layout_busy_page(struct address_space *mapping)
> > +struct page *dax_layout_busy_page(struct address_space *mapping,
> > +				  loff_t off, loff_t len)
> >  {
> > -	XA_STATE(xas, &mapping->i_pages, 0);
> > +	unsigned long start_idx = off >> PAGE_SHIFT;
> > +	unsigned long end_idx = (len == ULONG_MAX) ? ULONG_MAX
> > +				: start_idx + (len >> PAGE_SHIFT);
> > +	XA_STATE(xas, &mapping->i_pages, start_idx);
> >  	void *entry;
> >  	unsigned int scanned = 0;
> >  	struct page *page = NULL;
> > @@ -612,7 +619,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> >  	unmap_mapping_range(mapping, 0, 0, 1);
> 
> Should we unmap only those pages which fall in the range specified by caller.
> Unmapping whole file seems to be less efficient.

Seems reasonable to me.  I was focused on getting pages which were busy not
necessarily on what got unmapped.  So I did not consider this.  Thanks for the
suggestion.

However, I don't understand the math you do for length?  Is this comment/code
correct?

+  /* length is being calculated from lstart and not start.
+   * This is due to behavior of unmap_mapping_range(). If
+   * start is say 4094 and end is on 4093 then want to
+   * unamp two pages, idx 0 and 1. But unmap_mapping_range()
+   * will unmap only page at idx 0. If we calculate len
+   * from the rounded down start, this problem should not
+   * happen.
+   */
+  len = end - lstart + 1;


How can end (4093) be < start (4094)?  Is that valid?  And why would a start of
4094 unmap idx 0?

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-26 10:41       ` Jeff Layton
@ 2019-08-29 23:34         ` Ira Weiny
  2019-09-04 12:52           ` Jeff Layton
  0 siblings, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-08-29 23:34 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Jan Kara, Theodore Ts'o, John Hubbard,
	Michal Hocko, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

Missed this.  sorry.

On Mon, Aug 26, 2019 at 06:41:07AM -0400, Jeff Layton wrote:
> On Thu, 2019-08-15 at 07:56 +1000, Dave Chinner wrote:
> > On Wed, Aug 14, 2019 at 10:15:06AM -0400, Jeff Layton wrote:
> > > On Fri, 2019-08-09 at 15:58 -0700, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > Add an exclusive lease flag which indicates that the layout mechanism
> > > > can not be broken.
> > > > 
> > > > Exclusive layout leases allow the file system to know that pages may be
> > > > GUP pined and that attempts to change the layout, ie truncate, should be
> > > > failed.
> > > > 
> > > > A process which attempts to break it's own exclusive lease gets an
> > > > EDEADLOCK return to help determine that this is likely a programming bug
> > > > vs someone else holding a resource.
> > .....
> > > > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > > > index baddd54f3031..88b175ceccbc 100644
> > > > --- a/include/uapi/asm-generic/fcntl.h
> > > > +++ b/include/uapi/asm-generic/fcntl.h
> > > > @@ -176,6 +176,8 @@ struct f_owner_ex {
> > > >  
> > > >  #define F_LAYOUT	16      /* layout lease to allow longterm pins such as
> > > >  				   RDMA */
> > > > +#define F_EXCLUSIVE	32      /* layout lease is exclusive */
> > > > +				/* FIXME or shoudl this be F_EXLCK??? */
> > > >  
> > > >  /* operations for bsd flock(), also used by the kernel implementation */
> > > >  #define LOCK_SH		1	/* shared lock */
> > > 
> > > This interface just seems weird to me. The existing F_*LCK values aren't
> > > really set up to be flags, but are enumerated values (even if there are
> > > some gaps on some arches). For instance, on parisc and sparc:
> > 
> > I don't think we need to worry about this - the F_WRLCK version of
> > the layout lease should have these exclusive access semantics (i.e
> > other ops fail rather than block waiting for lease recall) and hence
> > the API shouldn't need a new flag to specify them.
> > 
> > i.e. the primary difference between F_RDLCK and F_WRLCK layout
> > leases is that the F_RDLCK is a shared, co-operative lease model
> > where only delays in operations will be seen, while F_WRLCK is a
> > "guarantee exclusive access and I don't care what it breaks"
> > model... :)
> > 
> 
> Not exactly...
> 
> F_WRLCK and F_RDLCK leases can both be broken, and will eventually time
> out if there is conflicting access. The F_EXCLUSIVE flag on the other
> hand is there to prevent any sort of lease break from 

Right EXCLUSIVE will not break for any reason.  It will fail truncate and hole
punch as we discussed back in June.  This is for the use case where the user
has handed this file/pages off to some hardware for which removing the lease
would be impossible.  _And_ we don't anticipate any valid use case that someone
will need to truncate short of killing the process to free up file system
space.

> 
> I'm guessing what Ira really wants with the F_EXCLUSIVE flag is
> something akin to what happens when we set fl_break_time to 0 in the
> nfsd code. nfsd never wants the locks code to time out a lease of any
> sort, since it handles that timeout itself.
> 
> If you're going to add this functionality, it'd be good to also convert
> knfsd to use it as well, so we don't end up with multiple ways to deal
> with that situation.

Could you point me at the source for knfsd?  I looked in 

git://git.linux-nfs.org/projects/steved/nfs-utils.git

but I don't see anywhere leases are used in that source?

Thanks,
Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-08-29  2:02                                         ` Ira Weiny
  2019-08-29  3:27                                           ` John Hubbard
@ 2019-09-02 22:26                                           ` Dave Chinner
  2019-09-04 16:54                                             ` Ira Weiny
  1 sibling, 1 reply; 110+ messages in thread
From: Dave Chinner @ 2019-09-02 22:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > "Leases are associated with an open file description (see open(2)).  This means
> > > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > > refer to the same lease, and this lease may be modified or released using any
> > > of these descriptors.  Furthermore,  the lease is released by either an
> > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > > all such file descriptors have been closed."
> > 
> > Right, the lease is attached to the struct file, so it follows
> > where-ever the struct file goes. That doesn't mean it's actually
> > useful when the struct file is duplicated and/or passed to another
> > process. :/
> > 
> > AFAICT, the problem is that when we take another reference to the
> > struct file, or when the struct file is passed to a different
> > process, nothing updates the lease or lease state attached to that
> > struct file.
> 
> Ok, I probably should have made this more clear in the cover letter but _only_
> the process which took the lease can actually pin memory.

Sure, no question about that.

> That pinned memory _can_ be passed to another process but those sub-process' can
> _not_ use the original lease to pin _more_ of the file.  They would need to
> take their own lease to do that.

Yes, they would need a new lease to extend it. But that ignores the
fact they don't have a lease on the existing pins they are using and
have no control over the lease those pins originated under.  e.g.
the originating process dies (for whatever reason) and now we have
pins without a valid lease holder.

If something else now takes an exclusive lease on the file (because
the original exclusive lease no longer exists), it's not going to
work correctly because of the zombied page pins caused by closing
the exclusive lease they were gained under. IOWs, pages pinned under
an exclusive lease are no longer "exclusive" the moment the original
exclusive lease is dropped, and pins passed to another process are
no longer covered by the original lease they were created under.

> Sorry for not being clear on that.

I know exactly what you are saying. What I'm failing to get across
is that file layout leases don't actually allow the behaviour you
want to have.

> > As such, leases that require callbacks to userspace are currently
> > only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.

Regardless, we still require an active lease for long term pins so
that other lease holders fail operations appropriately. And that
exclusive lease must follow the process that pins the pages so that
the life cycle is the same...

> > Indeed, even closing the fd the lease was taken on without
> > F_UNLCKing it first doesn't mean the lease has been torn down if
> > there is some other reference to the struct file. That means the
> > original lease owner will still get SIGIO delivered to that fd on a
> > lease break regardless of whether it is open or not. ANd if we
> > implement "layout lease not released within SIGIO response timeout"
> > then that process will get killed, despite the fact it may not even
> > have a reference to that file anymore.
> 
> I'm not seeing that as a problem.  This is all a result of the application
> failing to do the right thing.

How is that not a problem?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-29 23:34         ` Ira Weiny
@ 2019-09-04 12:52           ` Jeff Layton
  0 siblings, 0 replies; 110+ messages in thread
From: Jeff Layton @ 2019-09-04 12:52 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, Andrew Morton, Jason Gunthorpe, Dan Williams,
	Matthew Wilcox, Jan Kara, Theodore Ts'o, John Hubbard,
	Michal Hocko, linux-xfs, linux-rdma, linux-kernel, linux-fsdevel,
	linux-nvdimm, linux-ext4, linux-mm

On Thu, 2019-08-29 at 16:34 -0700, Ira Weiny wrote:
> Missed this.  sorry.
> 
> On Mon, Aug 26, 2019 at 06:41:07AM -0400, Jeff Layton wrote:
> > On Thu, 2019-08-15 at 07:56 +1000, Dave Chinner wrote:
> > > On Wed, Aug 14, 2019 at 10:15:06AM -0400, Jeff Layton wrote:
> > > > On Fri, 2019-08-09 at 15:58 -0700, ira.weiny@intel.com wrote:
> > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > 
> > > > > Add an exclusive lease flag which indicates that the layout mechanism
> > > > > can not be broken.
> > > > > 
> > > > > Exclusive layout leases allow the file system to know that pages may be
> > > > > GUP pined and that attempts to change the layout, ie truncate, should be
> > > > > failed.
> > > > > 
> > > > > A process which attempts to break it's own exclusive lease gets an
> > > > > EDEADLOCK return to help determine that this is likely a programming bug
> > > > > vs someone else holding a resource.
> > > .....
> > > > > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > > > > index baddd54f3031..88b175ceccbc 100644
> > > > > --- a/include/uapi/asm-generic/fcntl.h
> > > > > +++ b/include/uapi/asm-generic/fcntl.h
> > > > > @@ -176,6 +176,8 @@ struct f_owner_ex {
> > > > >  
> > > > >  #define F_LAYOUT	16      /* layout lease to allow longterm pins such as
> > > > >  				   RDMA */
> > > > > +#define F_EXCLUSIVE	32      /* layout lease is exclusive */
> > > > > +				/* FIXME or shoudl this be F_EXLCK??? */
> > > > >  
> > > > >  /* operations for bsd flock(), also used by the kernel implementation */
> > > > >  #define LOCK_SH		1	/* shared lock */
> > > > 
> > > > This interface just seems weird to me. The existing F_*LCK values aren't
> > > > really set up to be flags, but are enumerated values (even if there are
> > > > some gaps on some arches). For instance, on parisc and sparc:
> > > 
> > > I don't think we need to worry about this - the F_WRLCK version of
> > > the layout lease should have these exclusive access semantics (i.e
> > > other ops fail rather than block waiting for lease recall) and hence
> > > the API shouldn't need a new flag to specify them.
> > > 
> > > i.e. the primary difference between F_RDLCK and F_WRLCK layout
> > > leases is that the F_RDLCK is a shared, co-operative lease model
> > > where only delays in operations will be seen, while F_WRLCK is a
> > > "guarantee exclusive access and I don't care what it breaks"
> > > model... :)
> > > 
> > 
> > Not exactly...
> > 
> > F_WRLCK and F_RDLCK leases can both be broken, and will eventually time
> > out if there is conflicting access. The F_EXCLUSIVE flag on the other
> > hand is there to prevent any sort of lease break from 
> 
> Right EXCLUSIVE will not break for any reason.  It will fail truncate and hole
> punch as we discussed back in June.  This is for the use case where the user
> has handed this file/pages off to some hardware for which removing the lease
> would be impossible.  _And_ we don't anticipate any valid use case that someone
> will need to truncate short of killing the process to free up file system
> space.
> 
> > I'm guessing what Ira really wants with the F_EXCLUSIVE flag is
> > something akin to what happens when we set fl_break_time to 0 in the
> > nfsd code. nfsd never wants the locks code to time out a lease of any
> > sort, since it handles that timeout itself.
> > 
> > If you're going to add this functionality, it'd be good to also convert
> > knfsd to use it as well, so we don't end up with multiple ways to deal
> > with that situation.
> 
> Could you point me at the source for knfsd?  I looked in 
> 
> git://git.linux-nfs.org/projects/steved/nfs-utils.git
> 
> but I don't see anywhere leases are used in that source?
> 

Ahh sorry that wasn't clear. It's the fs/nfsd directory in the Linux
kernel sources. See nfsd4_layout_lm_break and nfsd_break_deleg_cb in
particular.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
  2019-09-02 22:26                                           ` Dave Chinner
@ 2019-09-04 16:54                                             ` Ira Weiny
  0 siblings, 0 replies; 110+ messages in thread
From: Ira Weiny @ 2019-09-04 16:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jason Gunthorpe, Jan Kara, Andrew Morton, Dan Williams,
	Matthew Wilcox, Theodore Ts'o, John Hubbard, Michal Hocko,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Tue, Sep 03, 2019 at 08:26:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > > "Leases are associated with an open file description (see open(2)).  This means
> > > > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > > > refer to the same lease, and this lease may be modified or released using any
> > > > of these descriptors.  Furthermore,  the lease is released by either an
> > > > explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> > > > all such file descriptors have been closed."
> > > 
> > > Right, the lease is attached to the struct file, so it follows
> > > where-ever the struct file goes. That doesn't mean it's actually
> > > useful when the struct file is duplicated and/or passed to another
> > > process. :/
> > > 
> > > AFAICT, the problem is that when we take another reference to the
> > > struct file, or when the struct file is passed to a different
> > > process, nothing updates the lease or lease state attached to that
> > > struct file.
> > 
> > Ok, I probably should have made this more clear in the cover letter but _only_
> > the process which took the lease can actually pin memory.
> 
> Sure, no question about that.
> 
> > That pinned memory _can_ be passed to another process but those sub-process' can
> > _not_ use the original lease to pin _more_ of the file.  They would need to
> > take their own lease to do that.
> 
> Yes, they would need a new lease to extend it. But that ignores the
> fact they don't have a lease on the existing pins they are using and
> have no control over the lease those pins originated under.  e.g.
> the originating process dies (for whatever reason) and now we have
> pins without a valid lease holder.

Define "valid lease holder"?

> 
> If something else now takes an exclusive lease on the file (because
> the original exclusive lease no longer exists), it's not going to
> work correctly because of the zombied page pins caused by closing
> the exclusive lease they were gained under. IOWs, pages pinned under
> an exclusive lease are no longer "exclusive" the moment the original
> exclusive lease is dropped, and pins passed to another process are
> no longer covered by the original lease they were created under.

The page pins are not zombied the lease is.  The lease still exists, it can't
be dropped while the pins are in place.  I need to double check the
implementation but that was the intent.

Yep just did a quick check, I have a test for that.  If the page pins exist
then the lease can _not_ be released.  Closing the FD will "zombie" the lease
but it and the struct file will still exist until the pins go away.

Furthermore, a "zombie" lease is _not_ sufficient to pin more pages.  (I have a
test for this too.)  I apologize that I don't have something to submit to
xfstests.  I'm new to that code base.

I'm happy to share the code I have which I've been using to test...  But it is
pretty rough as it has undergone a number of changes.  I think it would be
better to convert my test series to xfstests.

However, I don't know if it is ok to require RDMA within those tests.  Right
now that is the only sub-system I have allowed to create these page pins.  So
I'm not sure what to do at this time.  I'm open to suggestions.

> 
> > Sorry for not being clear on that.
> 
> I know exactly what you are saying. What I'm failing to get across
> is that file layout leases don't actually allow the behaviour you
> want to have.

Not currently, no.  But we are discussing the semantics to allow them _to_ have
the behavior needed.

> 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> 
> Regardless, we still require an active lease for long term pins so
> that other lease holders fail operations appropriately. And that
> exclusive lease must follow the process that pins the pages so that
> the life cycle is the same...

I disagree.  See below.

> 
> > > Indeed, even closing the fd the lease was taken on without
> > > F_UNLCKing it first doesn't mean the lease has been torn down if
> > > there is some other reference to the struct file. That means the
> > > original lease owner will still get SIGIO delivered to that fd on a
> > > lease break regardless of whether it is open or not. ANd if we
> > > implement "layout lease not released within SIGIO response timeout"
> > > then that process will get killed, despite the fact it may not even
> > > have a reference to that file anymore.
> > 
> > I'm not seeing that as a problem.  This is all a result of the application
> > failing to do the right thing.
> 
> How is that not a problem?

The application has taken an exclusive lease and they don't have to let it go.

IOW, there is little difference between the application closing the FD and
creating a zombie lease vs keeping the FD open with a real lease.  Because no
SIGIO is sent and there is no need to react to it anyway as the intention is to
keep the lease active and the layout pinned "indefinitely".

Furthermore, in both cases the admin must kill the application to change the
layout forcibly.  Basically applications don't _have_ to do the right thing but
the kernel and the filesystem is still protected while the admin has a way to
correct the situation given a bad application.

Therefore, from the POV of the kernel and file system I don't see a problem.

Ira


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-08-14 12:23                   ` Jason Gunthorpe
  2019-08-14 17:50                     ` Ira Weiny
@ 2019-09-04 22:25                     ` Ira Weiny
  2019-09-11  8:19                       ` Jason Gunthorpe
  1 sibling, 1 reply; 110+ messages in thread
From: Ira Weiny @ 2019-09-04 22:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Aug 14, 2019 at 09:23:08AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> > On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > > 
> > > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > > that some other thread is) destroying all the MR's we have associated with this
> > > > FD.
> > > 
> > > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > > deletes any underlying HW resources, but the FD persists.
> > 
> > I misspoke.  I should have said associated with this "context".  And of course
> > uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> > struct file which had file_pins hanging off of it would be getting its file
> > pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> > after uverbs_destroy_ufile_hw() is done.
> > 
> > But since it does not block it may be that the struct file is gone before the
> > MR is actually destroyed.  Which means I think the GUP code would blow up in
> > that case...  :-(
> 
> Oh, yes, that is true, you also can't rely on the struct file living
> longer than the HW objects either, that isn't how the lifetime model
> works.

Reviewing all these old threads...  And this made me think.  While the HW
objects may out live the struct file.

They _are_ going away in a finite amount of time right?  It is not like they
could be held forever right?

Ira

> 
> If GUP consumes the struct file it must allow the struct file to be
> deleted before the GUP pin is released.
> 
> > The drivers could provide some generic object (in RDMA this could be the
> > uverbs_attr_bundle) which represents their "context".
> 
> For RDMA the obvious context is the struct ib_mr *
> 
> > But for the procfs interface, that context then needs to be associated with any
> > file which points to it...  For RDMA, or any other "FD based pin mechanism", it
> > would be up to the driver to "install" a procfs handler into any struct file
> > which _may_ point to this context.  (before _or_ after memory pins).
> 
> Is this all just for debugging? Seems like a lot of complication just
> to print a string
> 
> Generally, I think you'd be better to associate things with the
> mm_struct not some struct file... The whole design is simpler as GUP
> already has the mm_struct.
> 
> Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease
  2019-08-09 22:58 ` [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease ira.weiny
  2019-08-14 14:15   ` Jeff Layton
@ 2019-09-04 23:12   ` John Hubbard
  1 sibling, 0 replies; 110+ messages in thread
From: John Hubbard @ 2019-09-04 23:12 UTC (permalink / raw)
  To: ira.weiny, Andrew Morton
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, Michal Hocko, Dave Chinner, linux-xfs,
	linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On 8/9/19 3:58 PM, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Add an exclusive lease flag which indicates that the layout mechanism
> can not be broken.

After studying the rest of these discussions extensively, I think in all
cases FL_EXCLUSIVE is better named "unbreakable", rather than exclusive.

If you read your sentence above, it basically reinforces that idea: "add an
exclusive flag to mean it is unbreakable" is a bit of a disconnect. It 
would be better to say,

Add an "unbreakable" lease flag which indicates that the layout lease
cannot be broken.

Furthermore, while this may or may not be a way forward on the "we cannot
have more than one process take a layout lease on a file/range", it at
least stops making it impossible. In other words, no one is going to
write a patch that allows sharing an exclusive layout lease--but someone
might well update some of these patches here to make it possible to
have multiple processes take unbreakable leases on the same file/range.

I haven't worked through everything there yet, but again:

* FL_UNBREAKABLE is the name you're looking for here, and

* I think we want to allow multiple processes to take FL_UNBREAKABLE
leases on the same file/range, so that we can make RDMA setups
reasonable. By "reasonable" I mean, "no need to have a lead process
that owns all the leases".



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object
  2019-09-04 22:25                     ` Ira Weiny
@ 2019-09-11  8:19                       ` Jason Gunthorpe
  0 siblings, 0 replies; 110+ messages in thread
From: Jason Gunthorpe @ 2019-09-11  8:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Andrew Morton, Dan Williams, Matthew Wilcox, Jan Kara,
	Theodore Ts'o, John Hubbard, Michal Hocko, Dave Chinner,
	linux-xfs, linux-rdma, linux-kernel, linux-fsdevel, linux-nvdimm,
	linux-ext4, linux-mm

On Wed, Sep 04, 2019 at 03:25:50PM -0700, Ira Weiny wrote:
> On Wed, Aug 14, 2019 at 09:23:08AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 13, 2019 at 01:38:59PM -0700, Ira Weiny wrote:
> > > On Tue, Aug 13, 2019 at 03:00:22PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 13, 2019 at 10:41:42AM -0700, Ira Weiny wrote:
> > > > 
> > > > > And I was pretty sure uverbs_destroy_ufile_hw() would take care of (or ensure
> > > > > that some other thread is) destroying all the MR's we have associated with this
> > > > > FD.
> > > > 
> > > > fd's can't be revoked, so destroy_ufile_hw() can't touch them. It
> > > > deletes any underlying HW resources, but the FD persists.
> > > 
> > > I misspoke.  I should have said associated with this "context".  And of course
> > > uverbs_destroy_ufile_hw() does not touch the FD.  What I mean is that the
> > > struct file which had file_pins hanging off of it would be getting its file
> > > pins destroyed by uverbs_destroy_ufile_hw().  Therefore we don't need the FD
> > > after uverbs_destroy_ufile_hw() is done.
> > > 
> > > But since it does not block it may be that the struct file is gone before the
> > > MR is actually destroyed.  Which means I think the GUP code would blow up in
> > > that case...  :-(
> > 
> > Oh, yes, that is true, you also can't rely on the struct file living
> > longer than the HW objects either, that isn't how the lifetime model
> > works.
> 
> Reviewing all these old threads...  And this made me think.  While the HW
> objects may out live the struct file.
> 
> They _are_ going away in a finite amount of time right?  It is not like they
> could be held forever right?

Yes, at least until they become shared between FDs

Jason

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, back to index

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 22:58 [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space ira.weiny
2019-08-09 23:52   ` Dave Chinner
2019-08-12 17:36     ` Ira Weiny
2019-08-14  8:05       ` Dave Chinner
2019-08-14 11:21         ` Jeff Layton
2019-08-14 11:38           ` Dave Chinner
2019-08-09 22:58 ` [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease ira.weiny
2019-08-14 14:15   ` Jeff Layton
2019-08-14 21:56     ` Dave Chinner
2019-08-26 10:41       ` Jeff Layton
2019-08-29 23:34         ` Ira Weiny
2019-09-04 12:52           ` Jeff Layton
2019-09-04 23:12   ` John Hubbard
2019-08-09 22:58 ` [RFC PATCH v2 03/19] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 04/19] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 05/19] fs/ext4: Teach ext4 to break layout leases ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 06/19] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
2019-08-23 15:18   ` Vivek Goyal
2019-08-29 18:52     ` Ira Weiny
2019-08-09 22:58 ` [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
2019-08-09 23:30   ` Dave Chinner
2019-08-12 18:05     ` Ira Weiny
2019-08-14  8:04       ` Dave Chinner
2019-08-09 22:58 ` [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken ira.weiny
2019-08-09 23:22   ` Dave Chinner
2019-08-12 18:08     ` Ira Weiny
2019-08-09 22:58 ` [RFC PATCH v2 09/19] mm/gup: Introduce vaddr_pin structure ira.weiny
2019-08-10  0:06   ` John Hubbard
2019-08-09 22:58 ` [RFC PATCH v2 10/19] mm/gup: Pass a NULL vaddr_pin through GUP fast ira.weiny
2019-08-10  0:06   ` John Hubbard
2019-08-09 22:58 ` [RFC PATCH v2 11/19] mm/gup: Pass follow_page_context further down the call stack ira.weiny
2019-08-10  0:18   ` John Hubbard
2019-08-12 19:01     ` Ira Weiny
2019-08-09 22:58 ` [RFC PATCH v2 12/19] mm/gup: Prep put_user_pages() to take an vaddr_pin struct ira.weiny
2019-08-10  0:30   ` John Hubbard
2019-08-12 20:46     ` Ira Weiny
2019-08-09 22:58 ` [RFC PATCH v2 13/19] {mm,file}: Add file_pins objects ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 14/19] fs/locks: Associate file pins while performing GUP ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 15/19] mm/gup: Introduce vaddr_pin_pages() ira.weiny
2019-08-10  0:09   ` John Hubbard
2019-08-12 21:00     ` Ira Weiny
2019-08-12 21:20       ` John Hubbard
2019-08-11 23:07   ` John Hubbard
2019-08-12 21:01     ` Ira Weiny
2019-08-12 12:28   ` Jason Gunthorpe
2019-08-12 21:48     ` Ira Weiny
2019-08-13 11:47       ` Jason Gunthorpe
2019-08-13 17:46         ` Ira Weiny
2019-08-13 17:56           ` John Hubbard
2019-08-09 22:58 ` [RFC PATCH v2 16/19] RDMA/uverbs: Add back pointer to system file object ira.weiny
2019-08-12 13:00   ` Jason Gunthorpe
2019-08-12 17:28     ` Ira Weiny
2019-08-12 17:56       ` Jason Gunthorpe
2019-08-12 21:15         ` Ira Weiny
2019-08-13 11:48           ` Jason Gunthorpe
2019-08-13 17:41             ` Ira Weiny
2019-08-13 18:00               ` Jason Gunthorpe
2019-08-13 20:38                 ` Ira Weiny
2019-08-14 12:23                   ` Jason Gunthorpe
2019-08-14 17:50                     ` Ira Weiny
2019-08-14 18:15                       ` Jason Gunthorpe
2019-09-04 22:25                     ` Ira Weiny
2019-09-11  8:19                       ` Jason Gunthorpe
2019-08-09 22:58 ` [RFC PATCH v2 17/19] RDMA/umem: Convert to vaddr_[pin|unpin]* operations ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 18/19] {mm,procfs}: Add display file_pins proc ira.weiny
2019-08-09 22:58 ` [RFC PATCH v2 19/19] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
2019-08-14 10:17 ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Jan Kara
2019-08-14 18:08   ` Ira Weiny
2019-08-15 13:05     ` Jan Kara
2019-08-16 19:05       ` Ira Weiny
2019-08-16 23:20         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -) Ira Weiny
2019-08-19  6:36           ` Jan Kara
2019-08-17  2:26         ` [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-) Dave Chinner
2019-08-19  6:34           ` Jan Kara
2019-08-19  9:24             ` Dave Chinner
2019-08-19 12:38               ` Jason Gunthorpe
2019-08-19 21:53                 ` Ira Weiny
2019-08-20  1:12                 ` Dave Chinner
2019-08-20 11:55                   ` Jason Gunthorpe
2019-08-21 18:02                     ` Ira Weiny
2019-08-21 18:13                       ` Jason Gunthorpe
2019-08-21 18:22                         ` John Hubbard
2019-08-21 18:57                         ` Ira Weiny
2019-08-21 19:06                           ` Ira Weiny
2019-08-21 19:48                           ` Jason Gunthorpe
2019-08-21 20:44                             ` Ira Weiny
2019-08-21 23:49                               ` Jason Gunthorpe
2019-08-23  3:23                               ` Dave Chinner
2019-08-23 12:04                                 ` Jason Gunthorpe
2019-08-24  0:11                                   ` Dave Chinner
2019-08-24  5:08                                     ` Ira Weiny
2019-08-26  5:55                                       ` Dave Chinner
2019-08-29  2:02                                         ` Ira Weiny
2019-08-29  3:27                                           ` John Hubbard
2019-08-29 16:16                                             ` Ira Weiny
2019-09-02 22:26                                           ` Dave Chinner
2019-09-04 16:54                                             ` Ira Weiny
2019-08-25 19:39                                     ` Jason Gunthorpe
2019-08-24  4:49                                 ` Ira Weiny
2019-08-25 19:40                                   ` Jason Gunthorpe
2019-08-23  0:59                       ` Dave Chinner
2019-08-23 17:15                         ` Ira Weiny
2019-08-24  0:18                           ` Dave Chinner
2019-08-20  0:05               ` John Hubbard
2019-08-20  1:20                 ` Dave Chinner
2019-08-20  3:09                   ` John Hubbard
2019-08-20  3:36                     ` Dave Chinner
2019-08-21 18:43                       ` John Hubbard
2019-08-21 19:09                         ` Ira Weiny

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org linux-rdma@archiver.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/ public-inbox