All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "Weiny, Ira" <ira.weiny@intel.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	"Jeff Layton" <jlayton@kernel.org>,
	"Dave Chinner" <david@fromorbit.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	"Linux MM" <linux-mm@kvack.org>
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal
Date: Thu, 6 Jun 2019 08:35:42 -0700	[thread overview]
Message-ID: <CAPcyv4h-k_5T39fDY+SVrLXG_XETmgz-6N3NjQUteYG7g9NdDQ@mail.gmail.com> (raw)
In-Reply-To: <20190606104203.GF7433@quack2.suse.cz>

On Thu, Jun 6, 2019 at 3:42 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 05-06-19 18:45:33, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > ... V1,000,000   ;-)
> >
> > Pre-requisites:
> >       John Hubbard's put_user_pages() patch series.[1]
> >       Jan Kara's ext4_break_layouts() fixes[2]
> >
> > Based on the feedback from LSFmm and the LWN article which resulted.  I've
> > decided to take a slightly different tack on this problem.
> >
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> >
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> >
> > Should be an acceptable solution.
> >
> > Because this is slightly new behavior.  And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> >
> > The following patches implement the following solution.
> >
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> >    (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> >    notification to the original thread that another thread has tried to delete
> >    their data.  Furthermore this indicates that if the user needs to GUP the
> >    file again they will need to retake the Layout lease before doing so.
> >
> >
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease.  This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> >
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
>
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated".  The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.
>
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Yes, this satisfies the primary concern that made me oppose failing
truncate. If the administrator determines that reclaiming capacity is
more important than maintaining active RDMA mappings "lsof + kill" is
a reasonable way to recover. I'd go so far as to say that anything
less is an abdication of the kernel's responsibility as an arbiter of
platform resources.

> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking. Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which would
> release the lease but OTOH we need to make sure task exit does not deadlock.
> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...
>
> What do people think about this?

SIGKILL on close() without explicit unlock and wait-on-last-pin with
explicit unlock sounds reasonable to me.

  reply	other threads:[~2019-06-06 15:35 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-06  1:45 [PATCH RFC 00/10] RDMA/FS DAX truncate proposal ira.weiny
2019-06-06  1:45 ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 01/10] fs/locks: Add trace_leases_conflict ira.weiny
2019-06-09 12:52   ` Jeff Layton
2019-06-06  1:45 ` [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-09 13:00   ` Jeff Layton
2019-06-09 13:00     ` Jeff Layton
2019-06-11 21:38     ` Ira Weiny
2019-06-11 21:38       ` Ira Weiny
2019-06-12  9:46       ` Jan Kara
2019-06-06  1:45 ` [PATCH RFC 03/10] mm/gup: Pass flags down to __gup_device_huge* calls ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  6:18   ` Christoph Hellwig
2019-06-06 16:10     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 04/10] mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 05/10] fs/ext4: Teach ext4 to break layout leases ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 06/10] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06 10:58   ` Jan Kara
2019-06-06 10:58     ` Jan Kara
2019-06-06 16:17     ` Ira Weiny
2019-06-06  1:45 ` [PATCH RFC 08/10] fs/xfs: Teach xfs to use new dax_layout_busy_page() ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 09/10] fs/xfs: Fail truncate if pages are GUP pinned ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  1:45 ` [PATCH RFC 10/10] mm/gup: Remove FOLL_LONGTERM DAX exclusion ira.weiny
2019-06-06  1:45   ` ira.weiny
2019-06-06  5:52 ` [PATCH RFC 00/10] RDMA/FS DAX truncate proposal John Hubbard
2019-06-06  5:52   ` John Hubbard
2019-06-06 17:11   ` Ira Weiny
2019-06-06 17:11     ` Ira Weiny
2019-06-06 19:46     ` Jason Gunthorpe
2019-06-06 10:42 ` Jan Kara
2019-06-06 15:35   ` Dan Williams [this message]
2019-06-06 19:51   ` Jason Gunthorpe
2019-06-06 22:22     ` Ira Weiny
2019-06-07 10:36       ` Jan Kara
2019-06-07 12:17         ` Jason Gunthorpe
2019-06-07 14:52           ` Ira Weiny
2019-06-07 14:52             ` Ira Weiny
2019-06-07 15:10             ` Jason Gunthorpe
2019-06-12 10:29             ` Jan Kara
2019-06-12 10:29               ` Jan Kara
2019-06-12 11:47               ` Jason Gunthorpe
2019-06-12 12:09                 ` Jan Kara
2019-06-12 12:09                   ` Jan Kara
2019-06-12 18:41                   ` Dan Williams
2019-06-13  7:17                     ` Jan Kara
2019-06-13  7:17                       ` Jan Kara
2019-06-12 19:14                   ` Jason Gunthorpe
2019-06-12 22:13                     ` Ira Weiny
2019-06-12 22:54                       ` Dan Williams
2019-06-12 22:54                         ` Dan Williams
2019-06-12 23:33                         ` Ira Weiny
2019-06-12 23:33                           ` Ira Weiny
2019-06-13  1:14                           ` Dan Williams
2019-06-13  1:14                             ` Dan Williams
2019-06-13 15:13                             ` Jason Gunthorpe
2019-06-13 16:25                               ` Dan Williams
2019-06-13 16:25                                 ` Dan Williams
2019-06-13 17:18                                 ` Jason Gunthorpe
2019-06-13 16:53                           ` Dan Williams
2019-06-13 16:53                             ` Dan Williams
2019-06-13 15:12                         ` Jason Gunthorpe
2019-06-13  7:53                       ` Jan Kara
2019-06-13  7:53                         ` Jan Kara
2019-06-12 18:49               ` Dan Williams
2019-06-12 18:49                 ` Dan Williams
2019-06-13  7:43                 ` Jan Kara
2019-06-06 22:03   ` Ira Weiny
2019-06-06 22:03     ` Ira Weiny
2019-06-06 22:26     ` Ira Weiny
2019-06-06 22:28     ` Dave Chinner
2019-06-07 11:04     ` Jan Kara
2019-06-07 18:25       ` Ira Weiny
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:25         ` Ira Weiny
2019-06-07 18:50         ` Jason Gunthorpe
2019-06-08  0:10         ` Dave Chinner
2019-06-08  0:10           ` Dave Chinner
2019-06-09  1:29           ` Ira Weiny
2019-06-09  1:29             ` Ira Weiny
2019-06-09  1:29             ` Ira Weiny
2019-06-12 12:37           ` Matthew Wilcox
2019-06-12 12:37             ` Matthew Wilcox
2019-06-12 12:37             ` Matthew Wilcox
2019-06-12 23:30             ` Ira Weiny
2019-06-12 23:30               ` Ira Weiny
2019-06-12 23:30               ` Ira Weiny
2019-06-13  0:55               ` Dave Chinner
2019-06-13  0:55                 ` Dave Chinner
2019-06-13  0:55                 ` Dave Chinner
2019-06-13 20:34                 ` Ira Weiny
2019-06-13 20:34                   ` Ira Weiny
2019-06-13 20:34                   ` Ira Weiny
2019-06-14  3:42                   ` Dave Chinner
2019-06-13  0:25             ` Dave Chinner
2019-06-13  0:25               ` Dave Chinner
2019-06-13  3:23               ` Matthew Wilcox
2019-06-13  3:23                 ` Matthew Wilcox
2019-06-13  3:23                 ` Matthew Wilcox
2019-06-13  4:36                 ` Dave Chinner
2019-06-13  4:36                   ` Dave Chinner
2019-06-13  4:36                   ` Dave Chinner
2019-06-13 10:47                   ` Matthew Wilcox
2019-06-13 10:47                     ` Matthew Wilcox
2019-06-13 10:47                     ` Matthew Wilcox
2019-06-13 15:29                 ` Jason Gunthorpe
2019-06-13 15:27               ` Matthew Wilcox
2019-06-13 15:27                 ` Matthew Wilcox
2019-06-13 15:27                 ` Matthew Wilcox
2019-06-13 21:13                 ` Ira Weiny
2019-06-13 21:13                   ` Ira Weiny
2019-06-13 23:45                   ` Jason Gunthorpe
2019-06-14  0:00                     ` Ira Weiny
2019-06-14  0:00                       ` Ira Weiny
2019-06-14  2:09                     ` Dave Chinner
2019-06-14  2:09                       ` Dave Chinner
2019-06-14  2:09                       ` Dave Chinner
2019-06-14  2:31                       ` Matthew Wilcox
2019-06-14  2:31                         ` Matthew Wilcox
2019-06-14  3:07                         ` Dave Chinner
2019-06-14  3:07                           ` Dave Chinner
2019-06-14  3:07                           ` Dave Chinner
2019-06-20 14:52                 ` Jan Kara
2019-06-20 14:52                   ` Jan Kara
2019-06-13 20:34               ` Ira Weiny
2019-06-13 20:34                 ` Ira Weiny
2019-06-13 20:34                 ` Ira Weiny
2019-06-14  2:58                 ` Dave Chinner
2019-06-14  2:58                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4h-k_5T39fDY+SVrLXG_XETmgz-6N3NjQUteYG7g9NdDQ@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=jlayton@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.