All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ira Weiny <ira.weiny@intel.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Jason Gunthorpe" <jgg@ziepe.ca>, "Theodore Ts'o" <tytso@mit.edu>,
	linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org,
	"Dave Chinner" <david@fromorbit.com>,
	"Jeff Layton" <jlayton@kernel.org>,
	linux-kernel@vger.kernel.org,
	"Matthew Wilcox" <willy@infradead.org>,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-fsdevel@vger.kernel.org, "Jan Kara" <jack@suse.cz>,
	linux-ext4@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
Date: Thu, 6 Jun 2019 10:11:58 -0700	[thread overview]
Message-ID: <20190606171158.GB11374@iweiny-DESK2.sc.intel.com> (raw)
In-Reply-To: <c559c2ce-50dc-d143-5741-fe3d21d0305c@nvidia.com>

On Wed, Jun 05, 2019 at 10:52:12PM -0700, John Hubbard wrote:
> On 6/5/19 6:45 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > ... V1,000,000   ;-)
> > 
> > Pre-requisites:
> > 	John Hubbard's put_user_pages() patch series.[1]
> > 	Jan Kara's ext4_break_layouts() fixes[2]
> > 
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >    (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> >    notification to the original thread that another thread has tried to delete
> >    their data.  Furthermore this indicates that if the user needs to GUP the
> >    file again they will need to retake the Layout lease before doing so.
> > 
> > 
> > NOTE: If the user releases the layout lease or if it has been broken by another
> > operation further GUP operations on the file will fail without re-taking the
> > lease.  This means that if a user would like to register pieces of a file and
> > continue to register other pieces later they would be advised to keep the
> > layout lease, get a SIGIO notification, and retake the lease.
> > 
> > NOTE2: Truncation of pages which are not actively pinned will succeed.  Similar
> > to accessing an mmap to this area GUP pins of that memory may fail.
> > 
> 
> Hi Ira,
> 
> Wow, great to see this. This looks like basically the right behavior, IMHO.
> 
> 1. We'll need man page additions, to explain it. In fact, even after a quick first
> pass through, I'm vague on two points:

Of course.  But I was not going to go through and attempt to write man pages
and other docs without some agreement on the final mechanisms.  This works
which was the basic requirement I had to send an RFC.  :-D  But yes man pages
and updates to headers etc all have to be done.

> 
> a) I'm not sure how this actually provides "opt-in to new behavior", because I 
> don't see any CONFIG_* or boot time choices, and it looks like the new behavior 
> just is there. That is, if user space doesn't set F_LAYOUT on a range, 
> GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that right?)

The opt in is at run time.  Currently GUP FOLL_LONGTERM is _not_ _allowed_ on
the FS DAX pages at all.  So the default behavior is the same, GUP fails.  (Or
specifically ibv_reg_mr() fails.  This fails as before, not change there.

The Opt in is that if a user knows what is involved they can take the lease and
the GUP will not fail.  This comes with the price of knowing that other
processes can't truncate those pages in use.

> 
> b) Truncate and hole punch behavior, with and without user space having a SIGIO
> handler. (I'm sure this is obvious after another look through, but it might go
> nicely in a man page.)

Sorry this was not clear.  There are 2 points for this patch set which requires
the use of catching SIGIO.

1) If an application _actually_ does (somehow, somewhere, in some unforseen use
   case) want to allow a truncate to happen.  They can catch the SIGIO, finish
   their use of the pages, and release them.  As long as they can do this within
   the <sysfs>/lease-time-break time they are ok and the truncate can proceed.

2) This is a bit more subtle and something I almost delayed sending these out
   for.  Currently the implementation of a lease break actually removes the
   lease from the file.  I did not want this to happen and I was thinking of
   delaying this patch set to implement something which keeps the lease around
   but I figured I should get something out for comments.  Jan has proposed
   something along these lines and I agree with him so I'm going to ask you to
   read my response to him about the details.

   Anyway so the key here is that currently an app needs the SIGIO to retake
   the lease if they want to map the file again or in parts based on usage.
   For example, they may only want to map some of the file for when they are
   using it and then map another part later.  Without the SIGIO they would lose
   their lease or would have to just take the lease for each GUP pin (which
   adds overhead).  Like I said I did not like this but I left it to get
   something which works out.

> 
> 2. It *seems* like ext4, xfs are taken care of here, not just for the DAX case,
> but for general RDMA on them? Or is there more that must be done?

This is limited to DAX.  All the functionality is limited to *_devmap or "is
DAX" cases.  I'm still thinking that page cache backed files can have a better
solution for the user.

> 
> 3. Christophe Hellwig's unified gup patchset wreaks havoc in gup.c, and will
> conflict violently, as I'm sure you noticed. :)

Yep...  But I needed to get the conversation started on this idea.

Thanks for the feedback!
Ira

> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Ira Weiny <ira.weiny@intel.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Dan Williams" <dan.j.williams@intel.com>,
	"Jan Kara" <jack@suse.cz>, "Theodore Ts'o" <tytso@mit.edu>,
	"Jeff Layton" <jlayton@kernel.org>,
	"Dave Chinner" <david@fromorbit.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	linux-xfs@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvdimm@lists.01.org, linux-ext4@vger.kernel.org,
	linux-mm@kvack.org, "Jason Gunthorpe" <jgg@ziepe.ca>,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
Date: Thu, 6 Jun 2019 10:11:58 -0700	[thread overview]
Message-ID: <20190606171158.GB11374@iweiny-DESK2.sc.intel.com> (raw)
In-Reply-To: <c559c2ce-50dc-d143-5741-fe3d21d0305c@nvidia.com>

On Wed, Jun 05, 2019 at 10:52:12PM -0700, John Hubbard wrote:
> On 6/5/19 6:45 PM, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > ... V1,000,000   ;-)
> > 
> > Pre-requisites:
> > 	John Hubbard's put_user_pages() patch series.[1]
> > 	Jan Kara's ext4_break_layouts() fixes[2]
> > 
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >    (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> >    notification to the original thread that another thread has tried to delete
> >    their data.  Furthermore this indicates that if the user needs to GUP the
> >    file again they will need to retake the Layout lease before doing so.
> > 
> > 
> > NOTE: If the user releases the layout lease or if it has been broken by another
> > operation further GUP operations on the file will fail without re-taking the
> > lease.  This means that if a user would like to register pieces of a file and
> > continue to register other pieces later they would be advised to keep the
> > layout lease, get a SIGIO notification, and retake the lease.
> > 
> > NOTE2: Truncation of pages which are not actively pinned will succeed.  Similar
> > to accessing an mmap to this area GUP pins of that memory may fail.
> > 
> 
> Hi Ira,
> 
> Wow, great to see this. This looks like basically the right behavior, IMHO.
> 
> 1. We'll need man page additions, to explain it. In fact, even after a quick first
> pass through, I'm vague on two points:

Of course.  But I was not going to go through and attempt to write man pages
and other docs without some agreement on the final mechanisms.  This works
which was the basic requirement I had to send an RFC.  :-D  But yes man pages
and updates to headers etc all have to be done.

> 
> a) I'm not sure how this actually provides "opt-in to new behavior", because I 
> don't see any CONFIG_* or boot time choices, and it looks like the new behavior 
> just is there. That is, if user space doesn't set F_LAYOUT on a range, 
> GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that right?)

The opt in is at run time.  Currently GUP FOLL_LONGTERM is _not_ _allowed_ on
the FS DAX pages at all.  So the default behavior is the same, GUP fails.  (Or
specifically ibv_reg_mr() fails.  This fails as before, not change there.

The Opt in is that if a user knows what is involved they can take the lease and
the GUP will not fail.  This comes with the price of knowing that other
processes can't truncate those pages in use.

> 
> b) Truncate and hole punch behavior, with and without user space having a SIGIO
> handler. (I'm sure this is obvious after another look through, but it might go
> nicely in a man page.)

Sorry this was not clear.  There are 2 points for this patch set which requires
the use of catching SIGIO.

1) If an application _actually_ does (somehow, somewhere, in some unforseen use
   case) want to allow a truncate to happen.  They can catch the SIGIO, finish
   their use of the pages, and release them.  As long as they can do this within
   the <sysfs>/lease-time-break time they are ok and the truncate can proceed.

2) This is a bit more subtle and something I almost delayed sending these out
   for.  Currently the implementation of a lease break actually removes the
   lease from the file.  I did not want this to happen and I was thinking of
   delaying this patch set to implement something which keeps the lease around
   but I figured I should get something out for comments.  Jan has proposed
   something along these lines and I agree with him so I'm going to ask you to
   read my response to him about the details.

   Anyway so the key here is that currently an app needs the SIGIO to retake
   the lease if they want to map the file again or in parts based on usage.
   For example, they may only want to map some of the file for when they are
   using it and then map another part later.  Without the SIGIO they would lose
   their lease or would have to just take the lease for each GUP pin (which
   adds overhead).  Like I said I did not like this but I left it to get
   something which works out.

> 
> 2. It *seems* like ext4, xfs are taken care of here, not just for the DAX case,
> but for general RDMA on them? Or is there more that must be done?

This is limited to DAX.  All the functionality is limited to *_devmap or "is
DAX" cases.  I'm still thinking that page cache backed files can have a better
solution for the user.

> 
> 3. Christophe Hellwig's unified gup patchset wreaks havoc in gup.c, and will
> conflict violently, as I'm sure you noticed. :)

Yep...  But I needed to get the conversation started on this idea.

Thanks for the feedback!
Ira

> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
> 

  reply	other threads:[~2019-06-06 17:10 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-06  1:45 [PATCH RFC 00/10] RDMA/FS DAX truncate proposal ira.weiny
2019-06-06  1:45 ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 01/10] fs/locks: Add trace_leases_conflict ira.weiny
2019-06-09 12:52   ` Jeff Layton
2019-06-06  1:45 ` [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-09 13:00   ` Jeff Layton
2019-06-09 13:00     ` Jeff Layton
2019-06-11 21:38     ` Ira Weiny
2019-06-11 21:38       ` Ira Weiny
2019-06-12  9:46       ` Jan Kara
2019-06-06  1:45 ` [PATCH RFC 03/10] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  6:18   ` Christoph Hellwig
2019-06-06 16:10     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 04/10] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 05/10] fs/ext4: Teach ext4 to break layout leases ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 06/10] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06 10:58   ` Jan Kara
2019-06-06 10:58     ` Jan Kara
2019-06-06 16:17     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 08/10] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 09/10] fs/xfs: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 10/10] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  5:52 ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal John Hubbard
2019-06-06  5:52   ` John Hubbard
2019-06-06 17:11   ` Ira Weiny [this message]
2019-06-06 17:11     ` Ira Weiny
2019-06-06 19:46     ` Jason Gunthorpe
2019-06-06 10:42 ` Jan Kara
2019-06-06 15:35   ` Dan Williams
2019-06-06 19:51   ` Jason Gunthorpe
2019-06-06 22:22     ` Ira Weiny
2019-06-07 10:36       ` Jan Kara
2019-06-07 12:17         ` Jason Gunthorpe
2019-06-07 14:52           ` Ira Weiny
2019-06-07 14:52             ` Ira Weiny
2019-06-07 15:10             ` Jason Gunthorpe
2019-06-12 10:29             ` Jan Kara
2019-06-12 10:29               ` Jan Kara
2019-06-12 11:47               ` Jason Gunthorpe
2019-06-12 12:09                 ` Jan Kara
2019-06-12 12:09                   ` Jan Kara
2019-06-12 18:41                   ` Dan Williams
2019-06-13  7:17                     ` Jan Kara
2019-06-13  7:17                       ` Jan Kara
2019-06-12 19:14                   ` Jason Gunthorpe
2019-06-12 22:13                     ` Ira Weiny
2019-06-12 22:54                       ` Dan Williams
2019-06-12 22:54                         ` Dan Williams
2019-06-12 23:33                         ` Ira Weiny
2019-06-12 23:33                           ` Ira Weiny
2019-06-13  1:14                           ` Dan Williams
2019-06-13  1:14                             ` Dan Williams
2019-06-13 15:13                             ` Jason Gunthorpe
2019-06-13 16:25                               ` Dan Williams
2019-06-13 16:25                                 ` Dan Williams
2019-06-13 17:18                                 ` Jason Gunthorpe
2019-06-13 16:53                           ` Dan Williams
2019-06-13 16:53                             ` Dan Williams
2019-06-13 15:12                         ` Jason Gunthorpe
2019-06-13  7:53                       ` Jan Kara
2019-06-13  7:53                         ` Jan Kara
2019-06-12 18:49               ` Dan Williams
2019-06-12 18:49                 ` Dan Williams
2019-06-13  7:43                 ` Jan Kara
2019-06-06 22:03   ` Ira Weiny
2019-06-06 22:03     ` Ira Weiny
2019-06-06 22:26     ` Ira Weiny
2019-06-06 22:28     ` Dave Chinner
2019-06-07 11:04     ` Jan Kara
2019-06-07 18:25       ` Ira Weiny
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:50         ` Jason Gunthorpe
2019-06-08  0:10         ` Dave Chinner
2019-06-08  0:10           ` Dave Chinner
2019-06-09  1:29           ` Ira Weiny
2019-06-09  1:29             ` Ira Weiny
2019-06-09  1:29             ` Ira Weiny
2019-06-12 12:37           ` Matthew Wilcox
2019-06-12 12:37             ` Matthew Wilcox
2019-06-12 12:37             ` Matthew Wilcox
2019-06-12 23:30             ` Ira Weiny
2019-06-12 23:30               ` Ira Weiny
2019-06-12 23:30               ` Ira Weiny
2019-06-13  0:55               ` Dave Chinner
2019-06-13  0:55                 ` Dave Chinner
2019-06-13  0:55                 ` Dave Chinner
2019-06-13 20:34                 ` Ira Weiny
2019-06-13 20:34                   ` Ira Weiny
2019-06-13 20:34                   ` Ira Weiny
2019-06-14  3:42                   ` Dave Chinner
2019-06-13  0:25             ` Dave Chinner
2019-06-13  0:25               ` Dave Chinner
2019-06-13  3:23               ` Matthew Wilcox
2019-06-13  3:23                 ` Matthew Wilcox
2019-06-13  3:23                 ` Matthew Wilcox
2019-06-13  4:36                 ` Dave Chinner
2019-06-13  4:36                   ` Dave Chinner
2019-06-13  4:36                   ` Dave Chinner
2019-06-13 10:47                   ` Matthew Wilcox
2019-06-13 10:47                     ` Matthew Wilcox
2019-06-13 10:47                     ` Matthew Wilcox
2019-06-13 15:29                 ` Jason Gunthorpe
2019-06-13 15:27               ` Matthew Wilcox
2019-06-13 15:27                 ` Matthew Wilcox
2019-06-13 15:27                 ` Matthew Wilcox
2019-06-13 21:13                 ` Ira Weiny
2019-06-13 21:13                   ` Ira Weiny
2019-06-13 23:45                   ` Jason Gunthorpe
2019-06-14  0:00                     ` Ira Weiny
2019-06-14  0:00                       ` Ira Weiny
2019-06-14  2:09                     ` Dave Chinner
2019-06-14  2:09                       ` Dave Chinner
2019-06-14  2:09                       ` Dave Chinner
2019-06-14  2:31                       ` Matthew Wilcox
2019-06-14  2:31                         ` Matthew Wilcox
2019-06-14  3:07                         ` Dave Chinner
2019-06-14  3:07                           ` Dave Chinner
2019-06-14  3:07                           ` Dave Chinner
2019-06-20 14:52                 ` Jan Kara
2019-06-20 14:52                   ` Jan Kara
2019-06-13 20:34               ` Ira Weiny
2019-06-13 20:34                 ` Ira Weiny
2019-06-13 20:34                 ` Ira Weiny
2019-06-14  2:58                 ` Dave Chinner
2019-06-14  2:58                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190606171158.GB11374@iweiny-DESK2.sc.intel.com \
    --to=ira.weiny@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=jlayton@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.