nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: ira.weiny@intel.com, Dan Williams <dan.j.williams@intel.com>,
	Jan Kara <jack@suse.cz>, Theodore Ts'o <tytso@mit.edu>,
	Dave Chinner <david@fromorbit.com>
Cc: linux-nvdimm@lists.01.org, "John Hubbard" <jhubbard@nvidia.com>,
	linux-kernel@vger.kernel.org,
	"Matthew Wilcox" <willy@infradead.org>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space
Date: Sun, 09 Jun 2019 09:00:24 -0400	[thread overview]
Message-ID: <4e5eb31a41b91a28fbc83c65195a2c75a59cfa24.camel@kernel.org> (raw)
In-Reply-To: <20190606014544.8339-3-ira.weiny@intel.com>

On Wed, 2019-06-05 at 18:45 -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> GUP longterm pins of non-pagecache file system pages (eg FS DAX) are
> currently disallowed because they are unsafe.
> 
> The danger for pinning these pages comes from the fact that hole punch
> and/or truncate of those files results in the pages being mapped and
> pinned by a user space process while DAX has potentially allocated those
> pages to other processes.
> 
> Most (All) users who are mapping FS DAX pages for long term pin purposes
> (such as RDMA) are not going to want to deallocate these pages while
> those pages are in use.  To do so would mean the application would lose
> data.  So the use case for allowing truncate operations of such pages
> is limited.
> 
> However, the kernel must protect itself and users from potential
> mistakes and/or malicious user space code.  Rather than disabling long
> term pins as is done now.   Allow for users who know they are going to
> be pinning this memory to alert the file system of this intention.
> Furthermore, allow users to be alerted such that they can react if a
> truncate operation occurs for some reason.
> 
> Example user space pseudocode for a user using RDMA and wanting to allow
> a truncate would look like this:
> 
> lease_break_sigio_handler() {
> ...
> 	if (sigio.fd == rdma_fd) {
> 		complete_rdma_operations(...);
> 		ibv_dereg_mr(mr);
> 		close(rdma_fd);
> 		fcntl(rdma_fd, F_SETLEASE, F_UNLCK);
> 	}
> }
> 
> setup_rdma_to_dax_file() {
> ...
> 	rdma_fd = open(...)
> 	fcntl(rdma_fd, F_SETLEASE, F_LAYOUT);

I'm not crazy about this interface. F_LAYOUT doesn't seem to be in the
same category as F_RDLCK/F_WRLCK/F_UNLCK.

Maybe instead of F_SETLEASE, this should use new
F_SETLAYOUT/F_GETLAYOUT cmd values? There is nothing that would prevent
you from setting both a lease and a layout on a file, and indeed knfsd
can set both.

This interface seems to conflate the two.

> 	sigaction(SIGIO, ...  lease_break ...);
> 	ptr = mmap(rdma_fd, ...);
> 	mr = ibv_reg_mr(ptr, ...);
> 	do_rdma_stuff(...);
> }
> 
> Follow on patches implement the notification of the lease holder on
> truncate as well as failing the truncate if the GUP pin is not released.
> 
> This first patch exports the F_LAYOUT lease type and allows the user to set
> and get it.
> 
> After the complete series:
> 
> 1) Failure to obtain a F_LAYOUT lease on an open FS DAX file will result
>    in a failure to GUP pin any pages in that file.  An example of a call
>    which results in GUP pin is ibv_reg_mr().
> 2) While the GUP pin is in place (eg MR is in use) truncates of the
>    affected pages will fail.
> 3) If the user registers a sigaction they will be notified of the
>    truncate so they can react.  Failure to react will result in the
>    lease being revoked after <sysfs>/lease-break-time seconds.  After
>    this time new GUP pins will fail without a new lease being taken.
> 4) A truncate will work if the pages being truncated are not actively
>    pinned at the time of truncate.  Attempts to pin these pages after
>    will result in a failure.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/locks.c                       | 36 +++++++++++++++++++++++++++-----
>  include/linux/fs.h               |  2 +-
>  include/uapi/asm-generic/fcntl.h |  3 +++
>  3 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 0cc2b9f30e22..de9761c068de 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -191,6 +191,8 @@ static int target_leasetype(struct file_lock *fl)
>  		return F_UNLCK;
>  	if (fl->fl_flags & FL_DOWNGRADE_PENDING)
>  		return F_RDLCK;
> +	if (fl->fl_flags & FL_LAYOUT)
> +		return F_LAYOUT;
>  	return fl->fl_type;
>  }
>  
> @@ -611,7 +613,8 @@ static const struct lock_manager_operations lease_manager_ops = {
>  /*
>   * Initialize a lease, use the default lock manager operations
>   */
> -static int lease_init(struct file *filp, long type, struct file_lock *fl)
> +static int lease_init(struct file *filp, long type, unsigned int flags,
> +		      struct file_lock *fl)
>  {
>  	if (assign_type(fl, type) != 0)
>  		return -EINVAL;
> @@ -621,6 +624,8 @@ static int lease_init(struct file *filp, long type, struct file_lock *fl)
>  
>  	fl->fl_file = filp;
>  	fl->fl_flags = FL_LEASE;
> +	if (flags & FL_LAYOUT)
> +		fl->fl_flags |= FL_LAYOUT;
>  	fl->fl_start = 0;
>  	fl->fl_end = OFFSET_MAX;
>  	fl->fl_ops = NULL;
> @@ -629,7 +634,8 @@ static int lease_init(struct file *filp, long type, struct file_lock *fl)
>  }
>  
>  /* Allocate a file_lock initialised to this type of lease */
> -static struct file_lock *lease_alloc(struct file *filp, long type)
> +static struct file_lock *lease_alloc(struct file *filp, long type,
> +				     unsigned int flags)
>  {
>  	struct file_lock *fl = locks_alloc_lock();
>  	int error = -ENOMEM;
> @@ -637,7 +643,7 @@ static struct file_lock *lease_alloc(struct file *filp, long type)
>  	if (fl == NULL)
>  		return ERR_PTR(error);
>  
> -	error = lease_init(filp, type, fl);
> +	error = lease_init(filp, type, flags, fl);
>  	if (error) {
>  		locks_free_lock(fl);
>  		return ERR_PTR(error);
> @@ -1588,7 +1594,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>  	int want_write = (mode & O_ACCMODE) != O_RDONLY;
>  	LIST_HEAD(dispose);
>  
> -	new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK);
> +	new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK, 0);
>  	if (IS_ERR(new_fl))
>  		return PTR_ERR(new_fl);
>  	new_fl->fl_flags = type;
> @@ -1725,6 +1731,8 @@ EXPORT_SYMBOL(lease_get_mtime);
>   *
>   *	%F_UNLCK to indicate no lease is held.
>   *
> + *	%F_LAYOUT to indicate a layout lease is held.
> + *
>   *	(if a lease break is pending):
>   *
>   *	%F_RDLCK to indicate an exclusive lease needs to be
> @@ -2015,8 +2023,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
>  	struct file_lock *fl;
>  	struct fasync_struct *new;
>  	int error;
> +	unsigned int flags = 0;
> +
> +	/*
> +	 * NOTE on F_LAYOUT lease
> +	 *
> +	 * LAYOUT lease types are taken on files which the user knows that
> +	 * they will be pinning in memory for some indeterminate amount of
> +	 * time.  Such as for use with RDMA.  While we don't know what user
> +	 * space is going to do with the file we still use a F_RDLOCK level of
> +	 * lease.  This ensures that there are no conflicts between
> +	 * 2 users.  The conflict should only come from the File system wanting
> +	 * to revoke the lease in break_layout()  And this is done by using
> +	 * F_WRLCK in the break code.
> +	 */
> +	if (arg == F_LAYOUT) {
> +		arg = F_RDLCK;
> +		flags = FL_LAYOUT;
> +	}
>  
> -	fl = lease_alloc(filp, arg);
> +	fl = lease_alloc(filp, arg, flags);
>  	if (IS_ERR(fl))
>  		return PTR_ERR(fl);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7fdfe93e25d..9e9d8d35ee93 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -998,7 +998,7 @@ static inline struct file *get_file(struct file *f)
>  #define FL_DOWNGRADE_PENDING	256 /* Lease is being downgraded */
>  #define FL_UNLOCK_PENDING	512 /* Lease is being broken */
>  #define FL_OFDLCK	1024	/* lock is "owned" by struct file */
> -#define FL_LAYOUT	2048	/* outstanding pNFS layout */
> +#define FL_LAYOUT	2048	/* outstanding pNFS layout or user held pin */
>  
>  #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE)
>  
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..baddd54f3031 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -174,6 +174,9 @@ struct f_owner_ex {
>  #define F_SHLCK		8	/* or 4 */
>  #endif
>  
> +#define F_LAYOUT	16      /* layout lease to allow longterm pins such as
> +				   RDMA */
> +
>  /* operations for bsd flock(), also used by the kernel implementation */
>  #define LOCK_SH		1	/* shared lock */
>  #define LOCK_EX		2	/* exclusive lock */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

  reply	other threads:[~2019-06-09 13:00 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-06  1:45 [PATCH RFC 00/10] RDMA/FS DAX truncate proposal ira.weiny
2019-06-06  1:45 ` [PATCH RFC 01/10] fs/locks: Add trace_leases_conflict ira.weiny
2019-06-09 12:52   ` Jeff Layton
2019-06-06  1:45 ` [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space ira.weiny
2019-06-09 13:00   ` Jeff Layton [this message]
2019-06-11 21:38     ` Ira Weiny
2019-06-12  9:46       ` Jan Kara
2019-06-06  1:45 ` [PATCH RFC 03/10] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
2019-06-06  6:18   ` Christoph Hellwig
2019-06-06 16:10     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 04/10] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
2019-06-06  1:45 ` [PATCH RFC 05/10] fs/ext4: Teach ext4 to break layout leases ira.weiny
2019-06-06  1:45 ` [PATCH RFC 06/10] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
2019-06-06  1:45 ` [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06 10:58   ` Jan Kara
2019-06-06 16:17     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 08/10] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
2019-06-06  1:45 ` [PATCH RFC 09/10] fs/xfs: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45 ` [PATCH RFC 10/10] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
2019-06-06  5:52 ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal John Hubbard
2019-06-06 17:11   ` Ira Weiny
2019-06-06 19:46     ` Jason Gunthorpe
2019-06-06 10:42 ` Jan Kara
2019-06-06 15:35   ` Dan Williams
2019-06-06 19:51   ` Jason Gunthorpe
2019-06-06 22:22     ` Ira Weiny
2019-06-07 10:36       ` Jan Kara
2019-06-07 12:17         ` Jason Gunthorpe
2019-06-07 14:52           ` Ira Weiny
2019-06-07 15:10             ` Jason Gunthorpe
2019-06-12 10:29             ` Jan Kara
2019-06-12 11:47               ` Jason Gunthorpe
2019-06-12 12:09                 ` Jan Kara
2019-06-12 18:41                   ` Dan Williams
2019-06-13  7:17                     ` Jan Kara
2019-06-12 19:14                   ` Jason Gunthorpe
2019-06-12 22:13                     ` Ira Weiny
2019-06-12 22:54                       ` Dan Williams
2019-06-12 23:33                         ` Ira Weiny
2019-06-13  1:14                           ` Dan Williams
2019-06-13 15:13                             ` Jason Gunthorpe
2019-06-13 16:25                               ` Dan Williams
2019-06-13 17:18                                 ` Jason Gunthorpe
2019-06-13 16:53                           ` Dan Williams
2019-06-13 15:12                         ` Jason Gunthorpe
2019-06-13  7:53                       ` Jan Kara
2019-06-12 18:49               ` Dan Williams
2019-06-13  7:43                 ` Jan Kara
2019-06-06 22:03   ` Ira Weiny
2019-06-06 22:26     ` Ira Weiny
2019-06-06 22:28     ` Dave Chinner
2019-06-07 11:04     ` Jan Kara
2019-06-07 18:25       ` Ira Weiny
2019-06-07 18:50         ` Jason Gunthorpe
2019-06-08  0:10         ` Dave Chinner
2019-06-09  1:29           ` Ira Weiny
2019-06-12 12:37           ` Matthew Wilcox
2019-06-12 23:30             ` Ira Weiny
2019-06-13  0:55               ` Dave Chinner
2019-06-13 20:34                 ` Ira Weiny
2019-06-14  3:42                   ` Dave Chinner
2019-06-13  0:25             ` Dave Chinner
2019-06-13  3:23               ` Matthew Wilcox
2019-06-13  4:36                 ` Dave Chinner
2019-06-13 10:47                   ` Matthew Wilcox
2019-06-13 15:29                 ` Jason Gunthorpe
2019-06-13 15:27               ` Matthew Wilcox
2019-06-13 21:13                 ` Ira Weiny
2019-06-13 23:45                   ` Jason Gunthorpe
2019-06-14  0:00                     ` Ira Weiny
2019-06-14  2:09                     ` Dave Chinner
2019-06-14  2:31                       ` Matthew Wilcox
2019-06-14  3:07                         ` Dave Chinner
2019-06-20 14:52                 ` Jan Kara
2019-06-13 20:34               ` Ira Weiny
2019-06-14  2:58                 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4e5eb31a41b91a28fbc83c65195a2c75a59cfa24.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).