linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
	Christopher Lameter <cl@linux.com>,
	Doug Ledford <dledford@redhat.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Matthew Wilcox <willy@infradead.org>,
	Ira Weiny <ira.weiny@intel.com>,
	lsf-pc@lists.linux-foundation.org,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Jerome Glisse <jglisse@redhat.com>,
	Michal Hocko <mhocko@kernel.org>
Subject: Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
Date: Tue, 12 Feb 2019 17:07:07 +0100	[thread overview]
Message-ID: <20190212160707.GA19076@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4iHso+PqAm-4NfF0svoK4mELJMSWNp+vsG43UaW1S2eew@mail.gmail.com>

On Mon 11-02-19 09:22:58, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > > One approach that may be a clean way to solve this:
> > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > >    on the longterm pinned range until the long term pin is removed.
> > > > >
> > > > > So, ummm, how do we do block allocation then, which is done on
> > > > > demand during writes?
> > > > >
> > > > > IOWs, this requires the application to set up the file in the
> > > > > correct state for the filesystem to lock it down so somebody else
> > > > > can write to it.  That means the file can't be sparse, it can't be
> > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > > written to it's full size before being shared because otherwise it
> > > > > exposes stale data to the remote client (secure sites are going to
> > > > > love that!), they can't be extended, etc.
> > > > >
> > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > > an immutable for the purposes of local access.
> > > > >
> > > > > Which, essentially we can already do. Prep the file, map it
> > > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > > interface which can do the necessary checks.
> > > >
> > > > Hum, and what will you do if the immutable file that is target for RDMA
> > > > will be a source of reflink? That seems to be currently allowed for
> > > > immutable files but RDMA store would be effectively corrupting the data of
> > > > the target inode. But we could treat it similarly as swapfiles - those also
> > > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > > similarity seems to be quite large there. What do you think?
> > >
> > > This sounds so familiar...
> > >
> > >     https://lwn.net/Articles/726481/
> > >
> > > I'm not opposed to trying again, but leases was what crawled out
> > > smoking crater when this last proposal was nuked.
> >
> > Umm, don't think this is that similar to daxctl() discussion. We are not
> > speaking about providing any new userspace API for this.
> 
> I thought explicit userspace API was one of the outcomes, i.e. that we
> can't depend on this behavior being an implicit side effect of a page
> pin?

I was thinking an implicit sideeffect of gup_longterm() call. Similarly as
swapon(2) does not require the file to be marked in any special way. But
OTOH I agree that RDMA is a less controlled usage than swapon so it is
questionable. I'd still require something like CAP_LINUX_IMMUTABLE at least
for gup_longterm() calls that end up pinning the file.

Inspired by Christoph's idea you reference in [2], maybe gup_longterm()
will succeed only if there is FL_LAYOUT lease for the range being pinned
and we don't allow the lease to be released until there's a pinned page in
the range. And we make the file protected (i.e. treat it like swapfile) if
there's any such lease in it. But this is just a rough sketch and needs more
thinking.

> > Also I think the
> > situation about leases has somewhat cleared up with this discussion - ODP
> > hardware does not need leases since it can use MMU notifiers, for non-ODP
> > hardware it is difficult to handle leases as such hardware has only one big
> > kill-everything call and using that would effectively mean lot of work on
> > the userspace side to resetup everything to make things useful if workable
> > at all.
> >
> > So my proposal would be:
> >
> > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> > its teardown when fs needs it.
> >
> > 2) Hardware not capable of tearing down pins from MMU notifiers will have
> > to use gup_longterm() (we may actually rename it to a more suitable name).
> > FS may just refuse such calls (for normal page cache backed file, it will
> > just return success but for DAX file it will do sanity checks whether the
> > file is fully allocated etc. like we currently do for swapfiles) but if
> > gup_longterm() returns success, it will provide the same guarantees as for
> > swapfiles. So the only thing that we need is some call from gup_longterm()
> > to a filesystem callback to tell it - this file is going to be used by a
> > third party as an IO buffer, don't touch it. And we can (and should)
> > probably refactor the handling to be shared between swapfiles and
> > gup_longterm().
> 
> Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
> solution I thought we dax folks walked away from in the original
> MAP_DIRECT discussion [1]. Here is where leases were the response to
> MAP_DIRECT [2]. ...and here is where we had tame discussions about
> implications of notifying memory-registrations of lease break events
> [3].

Yeah, thanks for the references.

> I honestly don't like the idea that random subsystems can pin down
> file blocks as a side effect of gup on the result of mmap. Recall that
> it's not just RDMA that wants this guarantee. It seems safer to have
> the file be in an explicit block-allocation-immutable-mode so that the
> fallocate man page can describe this error case. Otherwise how would
> you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?

So with requiring lease for gup_longterm() to succeed (and the
FALLOC_FL_PUNCH_HOLE failure being keyed from the existence of such lease),
does it look more reasonable to you?

> [1]: https://lwn.net/Articles/736333/
> [2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html
> [3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  parent reply	other threads:[~2019-02-12 16:07 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-05 17:50 [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA Ira Weiny
2019-02-05 18:01 ` Ira Weiny
2019-02-06 21:31   ` Dave Chinner
2019-02-06  9:50 ` Jan Kara
2019-02-06 17:31   ` Jason Gunthorpe
2019-02-06 17:52     ` Matthew Wilcox
2019-02-06 18:32       ` Doug Ledford
2019-02-06 18:35         ` Matthew Wilcox
2019-02-06 18:44           ` Doug Ledford
2019-02-06 18:52           ` Jason Gunthorpe
2019-02-06 19:45             ` Dan Williams
2019-02-06 20:14               ` Doug Ledford
2019-02-06 21:04                 ` Dan Williams
2019-02-06 21:12                   ` Doug Ledford
2019-02-06 19:16         ` Christopher Lameter
2019-02-06 19:40           ` Matthew Wilcox
2019-02-06 20:16             ` Doug Ledford
2019-02-06 20:20               ` Matthew Wilcox
2019-02-06 20:28                 ` Doug Ledford
2019-02-06 20:41                   ` Matthew Wilcox
2019-02-06 20:47                     ` Doug Ledford
2019-02-06 20:49                       ` Matthew Wilcox
2019-02-06 20:50                         ` Doug Ledford
2019-02-06 20:31                 ` Jason Gunthorpe
2019-02-06 20:39                 ` Christopher Lameter
2019-02-06 20:54                 ` Doug Ledford
2019-02-07 16:48                   ` Jan Kara
2019-02-06 20:24             ` Christopher Lameter
2019-02-06 21:03           ` Dave Chinner
2019-02-06 22:08             ` Jason Gunthorpe
2019-02-06 22:24               ` Doug Ledford
2019-02-06 22:44                 ` Dan Williams
2019-02-06 23:21                   ` Jason Gunthorpe
2019-02-06 23:30                     ` Dan Williams
2019-02-06 23:41                       ` Jason Gunthorpe
2019-02-07  0:22                         ` Dan Williams
2019-02-07  5:33                           ` Jason Gunthorpe
2019-02-07  1:57                   ` Doug Ledford
2019-02-07  2:48                     ` Dan Williams
2019-02-07  2:42                   ` Doug Ledford
2019-02-07  3:13                     ` Dan Williams
2019-02-07 17:23                       ` Ira Weiny
2019-02-07 16:25                   ` Doug Ledford
2019-02-07 16:55                     ` Christopher Lameter
2019-02-07 17:35                       ` Ira Weiny
2019-02-07 18:17                         ` Christopher Lameter
2019-02-08  4:43                       ` Dave Chinner
2019-02-08 11:10                         ` Jan Kara
2019-02-08 20:50                           ` Dan Williams
2019-02-11 10:24                             ` Jan Kara
2019-02-11 17:22                               ` Dan Williams
2019-02-11 18:06                                 ` Jason Gunthorpe
2019-02-11 18:15                                   ` Dan Williams
2019-02-11 18:19                                   ` Ira Weiny
2019-02-11 18:26                                     ` Jason Gunthorpe
2019-02-11 18:40                                       ` Matthew Wilcox
2019-02-11 19:58                                         ` Dan Williams
2019-02-11 20:49                                           ` Jason Gunthorpe
2019-02-11 21:02                                             ` Dan Williams
2019-02-11 21:09                                               ` Jason Gunthorpe
2019-02-12 16:34                                                 ` Jan Kara
2019-02-12 16:55                                                   ` Christopher Lameter
2019-02-13 15:06                                                     ` Jan Kara
2019-02-12 16:36                                               ` Christopher Lameter
2019-02-12 16:44                                                 ` Jan Kara
2019-02-11 21:08                                     ` Jerome Glisse
2019-02-11 21:22                                     ` John Hubbard
2019-02-11 22:12                                       ` Jason Gunthorpe
2019-02-11 22:33                                         ` John Hubbard
2019-02-12 16:39                                           ` Christopher Lameter
2019-02-13  2:58                                             ` John Hubbard
2019-02-12 16:28                                   ` Jan Kara
2019-02-14 20:26                                   ` Jerome Glisse
2019-02-14 20:50                                     ` Matthew Wilcox
2019-02-14 21:39                                       ` Jerome Glisse
2019-02-15  1:19                                         ` Dave Chinner
2019-02-15 15:42                                           ` Christopher Lameter
2019-02-15 18:08                                             ` Matthew Wilcox
2019-02-15 18:31                                               ` Christopher Lameter
2019-02-15 22:00                                                 ` Jason Gunthorpe
2019-02-15 23:38                                                   ` Ira Weiny
2019-02-16 22:42                                                     ` Dave Chinner
2019-02-17  2:54                                                     ` Christopher Lameter
2019-02-12 16:07                                 ` Jan Kara [this message]
2019-02-12 21:53                                   ` Dan Williams
2019-02-08 21:20                           ` Dave Chinner
2019-02-08 15:33                         ` Christopher Lameter
2019-02-07 17:24                     ` Matthew Wilcox
2019-02-07 17:26                       ` Jason Gunthorpe
2019-02-07  3:52                 ` Dave Chinner
2019-02-07  5:23                   ` Jason Gunthorpe
2019-02-07  6:00                     ` Dan Williams
2019-02-07 17:17                       ` Jason Gunthorpe
2019-02-07 23:54                         ` Dan Williams
2019-02-08  1:44                           ` Ira Weiny
2019-02-08  5:19                           ` Jason Gunthorpe
2019-02-08  7:20                             ` Dan Williams
2019-02-08 15:42                               ` Jason Gunthorpe
2019-02-07 15:04                     ` Chuck Lever
2019-02-07 15:28                       ` Tom Talpey
2019-02-07 15:37                         ` Doug Ledford
2019-02-07 15:41                           ` Tom Talpey
2019-02-07 15:56                             ` Doug Ledford
2019-02-07 16:57                         ` Ira Weiny
2019-02-07 21:31                           ` Tom Talpey
2019-02-07 16:54                     ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190212160707.GA19076@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=cl@linux.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=dledford@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).