linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-05 17:50 Ira Weiny
  2019-02-05 18:01 ` Ira Weiny
  2019-02-06  9:50 ` Jan Kara
  0 siblings, 2 replies; 106+ messages in thread
From: Ira Weiny @ 2019-02-05 17:50 UTC (permalink / raw)
  To: lsf-pc, linux-rdma, linux-mm, linux-kernel
  Cc: John Hubbard, Jan Kara, Jerome Glisse, Dan Williams,
	Matthew Wilcox, Jason Gunthorpe, Dave Chinner, Doug Ledford,
	Michal Hocko


The problem: Once we have pages marked as GUP-pinned how should various
subsystems work with those markings.

The current work for John Hubbards proposed solutions (part 1 and 2) is
progressing.[1]  But the final part (3) of his solution is also going to take
some work.

In Johns presentation he lists 3 alternatives for gup-pinned pages:

1) Hold off try_to_unmap
2) Allow writeback while pinned (via bounce buffers)
	[Note this will not work for DAX]
3) Use a "revocable reservation" (or lease) on those pages
4) Pin the blocks as busy in the FS allocator

The problem with lease's on pages used by RDMA is that the references to
these pages is not local to the machine.  Once the user has been given access
to the page they, through the use of a remote tokens, give a reference to that
page to remote nodes.  This is the core essence of RDMA, and like it or not,
something which is increasingly used by major Linux users.

Therefore we need to discuss the extent by which leases are appropriate and
what happens should a lease be revoked which a user does not respond to.

As John Hubbard put it:

"Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be to
require GUP users monitor and respond to CPU page table updates. Subsystems
such as ODP and HMM do this, for example. This aspect of the problem is
still under discussion."

	-- John Hubbard[2]

The following people have been involved in previous conversations and would be key to
the face to face discussion.

John Hubbard
Jan Kara
Dave Chinner
Michal Hocko
Dan Williams
Matthew Wilcox
Jason Gunthorpe

Thank you,
Ira Weiny

[1] https://linuxplumbersconf.org/event/2/contributions/126/attachments/136/168/LPC_2018_gup_dma.pdf
[2] https://lkml.org/lkml/2019/2/4/7


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-05 17:50 [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA Ira Weiny
@ 2019-02-05 18:01 ` Ira Weiny
  2019-02-06 21:31   ` Dave Chinner
  2019-02-06  9:50 ` Jan Kara
  1 sibling, 1 reply; 106+ messages in thread
From: Ira Weiny @ 2019-02-05 18:01 UTC (permalink / raw)
  To: lsf-pc, linux-rdma, linux-mm, linux-kernel
  Cc: John Hubbard, Jan Kara, Jerome Glisse, Dan Williams,
	Matthew Wilcox, Dave Chinner, Doug Ledford, Michal Hocko,
	Jason Gunthorpe

I had an old invalid address for Jason Gunthorpe in my address book...  

Correcting his email in the thread.

On Tue, Feb 05, 2019 at 09:50:59AM -0800, 'Ira Weiny' wrote:
> 
> The problem: Once we have pages marked as GUP-pinned how should various
> subsystems work with those markings.
> 
> The current work for John Hubbards proposed solutions (part 1 and 2) is
> progressing.[1]  But the final part (3) of his solution is also going to take
> some work.
> 
> In Johns presentation he lists 3 alternatives for gup-pinned pages:
> 
> 1) Hold off try_to_unmap
> 2) Allow writeback while pinned (via bounce buffers)
> 	[Note this will not work for DAX]
> 3) Use a "revocable reservation" (or lease) on those pages
> 4) Pin the blocks as busy in the FS allocator
> 
> The problem with lease's on pages used by RDMA is that the references to
> these pages is not local to the machine.  Once the user has been given access
> to the page they, through the use of a remote tokens, give a reference to that
> page to remote nodes.  This is the core essence of RDMA, and like it or not,
> something which is increasingly used by major Linux users.
> 
> Therefore we need to discuss the extent by which leases are appropriate and
> what happens should a lease be revoked which a user does not respond to.
> 
> As John Hubbard put it:
> 
> "Other filesystem features that need to replace the page with a new one can
> be inhibited for pages that are GUP-pinned. This will, however, alter and
> limit some of those filesystem features. The only fix for that would be to
> require GUP users monitor and respond to CPU page table updates. Subsystems
> such as ODP and HMM do this, for example. This aspect of the problem is
> still under discussion."
> 
> 	-- John Hubbard[2]
> 
> The following people have been involved in previous conversations and would be key to
> the face to face discussion.
> 
> John Hubbard
> Jan Kara
> Dave Chinner
> Michal Hocko
> Dan Williams
> Matthew Wilcox
> Jason Gunthorpe
> 
> Thank you,
> Ira Weiny
> 
> [1] https://linuxplumbersconf.org/event/2/contributions/126/attachments/136/168/LPC_2018_gup_dma.pdf
> [2] https://lkml.org/lkml/2019/2/4/7
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-05 17:50 [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA Ira Weiny
  2019-02-05 18:01 ` Ira Weiny
@ 2019-02-06  9:50 ` Jan Kara
  2019-02-06 17:31   ` Jason Gunthorpe
  1 sibling, 1 reply; 106+ messages in thread
From: Jan Kara @ 2019-02-06  9:50 UTC (permalink / raw)
  To: Ira Weiny
  Cc: lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jan Kara, Jerome Glisse, Dan Williams, Matthew Wilcox,
	Jason Gunthorpe, Dave Chinner, Doug Ledford, Michal Hocko

On Tue 05-02-19 09:50:59, Ira Weiny wrote:
> The problem: Once we have pages marked as GUP-pinned how should various
> subsystems work with those markings.
> 
> The current work for John Hubbards proposed solutions (part 1 and 2) is
> progressing.[1]  But the final part (3) of his solution is also going to take
> some work.
> 
> In Johns presentation he lists 3 alternatives for gup-pinned pages:
> 
> 1) Hold off try_to_unmap
> 2) Allow writeback while pinned (via bounce buffers)
> 	[Note this will not work for DAX]

Well, but DAX does not need it because by definition there's nothing to
writeback :)

> 3) Use a "revocable reservation" (or lease) on those pages
> 4) Pin the blocks as busy in the FS allocator
> 
> The problem with lease's on pages used by RDMA is that the references to
> these pages is not local to the machine.  Once the user has been given
> access to the page they, through the use of a remote tokens, give a
> reference to that page to remote nodes.  This is the core essence of
> RDMA, and like it or not, something which is increasingly used by major
> Linux users.
> 
> Therefore we need to discuss the extent by which leases are appropriate and
> what happens should a lease be revoked which a user does not respond to.

I don't know the RDMA hardware so this is just an opinion of filesystem /
mm guy but my idea how this should work would be:

MM/FS asks for lease to be revoked. The revoke handler agrees with the
other side on cancelling RDMA or whatever and drops the page pins. Now I
understand there can be HW / communication failures etc. in which case the
driver could either block waiting or make sure future IO will fail and drop
the pins. But under normal conditions there should be a way to revoke the
access. And if the HW/driver cannot support this, then don't let it anywhere
near DAX filesystem.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06  9:50 ` Jan Kara
@ 2019-02-06 17:31   ` Jason Gunthorpe
  2019-02-06 17:52     ` Matthew Wilcox
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 17:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Matthew Wilcox,
	Dave Chinner, Doug Ledford, Michal Hocko

On Wed, Feb 06, 2019 at 10:50:00AM +0100, Jan Kara wrote:

> MM/FS asks for lease to be revoked. The revoke handler agrees with the
> other side on cancelling RDMA or whatever and drops the page pins. 

This takes a trip through userspace since the communication protocol
is entirely managed in userspace.

Most existing communication protocols don't have a 'cancel operation'.

> Now I understand there can be HW / communication failures etc. in
> which case the driver could either block waiting or make sure future
> IO will fail and drop the pins. 

We can always rip things away from the userspace.. However..

> But under normal conditions there should be a way to revoke the
> access. And if the HW/driver cannot support this, then don't let it
> anywhere near DAX filesystem.

I think the general observation is that people who want to do DAX &
RDMA want it to actually work, without data corruption, random process
kills or random communication failures.

Really, few users would actually want to run in a system where revoke
can be triggered.

So.. how can the FS/MM side provide a guarantee to the user that
revoke won't happen under a certain system design?

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 17:31   ` Jason Gunthorpe
@ 2019-02-06 17:52     ` Matthew Wilcox
  2019-02-06 18:32       ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-06 17:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Dave Chinner,
	Doug Ledford, Michal Hocko

On Wed, Feb 06, 2019 at 10:31:14AM -0700, Jason Gunthorpe wrote:
> On Wed, Feb 06, 2019 at 10:50:00AM +0100, Jan Kara wrote:
> 
> > MM/FS asks for lease to be revoked. The revoke handler agrees with the
> > other side on cancelling RDMA or whatever and drops the page pins. 
> 
> This takes a trip through userspace since the communication protocol
> is entirely managed in userspace.
> 
> Most existing communication protocols don't have a 'cancel operation'.
> 
> > Now I understand there can be HW / communication failures etc. in
> > which case the driver could either block waiting or make sure future
> > IO will fail and drop the pins. 
> 
> We can always rip things away from the userspace.. However..
> 
> > But under normal conditions there should be a way to revoke the
> > access. And if the HW/driver cannot support this, then don't let it
> > anywhere near DAX filesystem.
> 
> I think the general observation is that people who want to do DAX &
> RDMA want it to actually work, without data corruption, random process
> kills or random communication failures.
> 
> Really, few users would actually want to run in a system where revoke
> can be triggered.
> 
> So.. how can the FS/MM side provide a guarantee to the user that
> revoke won't happen under a certain system design?

Most of the cases we want revoke for are things like truncate().
Shouldn't happen with a sane system, but we're trying to avoid users
doing awful things like being able to DMA to pages that are now part of
a different file.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 17:52     ` Matthew Wilcox
@ 2019-02-06 18:32       ` Doug Ledford
  2019-02-06 18:35         ` Matthew Wilcox
  2019-02-06 19:16         ` Christopher Lameter
  0 siblings, 2 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 18:32 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe
  Cc: Jan Kara, Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Dave Chinner,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2287 bytes --]

On Wed, 2019-02-06 at 09:52 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 10:31:14AM -0700, Jason Gunthorpe wrote:
> > On Wed, Feb 06, 2019 at 10:50:00AM +0100, Jan Kara wrote:
> > 
> > > MM/FS asks for lease to be revoked. The revoke handler agrees with the
> > > other side on cancelling RDMA or whatever and drops the page pins. 
> > 
> > This takes a trip through userspace since the communication protocol
> > is entirely managed in userspace.
> > 
> > Most existing communication protocols don't have a 'cancel operation'.
> > 
> > > Now I understand there can be HW / communication failures etc. in
> > > which case the driver could either block waiting or make sure future
> > > IO will fail and drop the pins. 
> > 
> > We can always rip things away from the userspace.. However..
> > 
> > > But under normal conditions there should be a way to revoke the
> > > access. And if the HW/driver cannot support this, then don't let it
> > > anywhere near DAX filesystem.
> > 
> > I think the general observation is that people who want to do DAX &
> > RDMA want it to actually work, without data corruption, random process
> > kills or random communication failures.
> > 
> > Really, few users would actually want to run in a system where revoke
> > can be triggered.
> > 
> > So.. how can the FS/MM side provide a guarantee to the user that
> > revoke won't happen under a certain system design?
> 
> Most of the cases we want revoke for are things like truncate().
> Shouldn't happen with a sane system, but we're trying to avoid users
> doing awful things like being able to DMA to pages that are now part of
> a different file.

Why is the solution revoke then?  Is there something besides truncate
that we have to worry about?  I ask because EBUSY is not currently
listed as a return value of truncate, so extending the API to include
EBUSY to mean "this file has pinned pages that can not be freed" is not
(or should not be) totally out of the question.

Admittedly, I'm coming in late to this conversation, but did I miss the
portion where that alternative was ruled out?

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 18:32       ` Doug Ledford
@ 2019-02-06 18:35         ` Matthew Wilcox
  2019-02-06 18:44           ` Doug Ledford
  2019-02-06 18:52           ` Jason Gunthorpe
  2019-02-06 19:16         ` Christopher Lameter
  1 sibling, 2 replies; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-06 18:35 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 01:32:04PM -0500, Doug Ledford wrote:
> On Wed, 2019-02-06 at 09:52 -0800, Matthew Wilcox wrote:
> > On Wed, Feb 06, 2019 at 10:31:14AM -0700, Jason Gunthorpe wrote:
> > > On Wed, Feb 06, 2019 at 10:50:00AM +0100, Jan Kara wrote:
> > > 
> > > > MM/FS asks for lease to be revoked. The revoke handler agrees with the
> > > > other side on cancelling RDMA or whatever and drops the page pins. 
> > > 
> > > This takes a trip through userspace since the communication protocol
> > > is entirely managed in userspace.
> > > 
> > > Most existing communication protocols don't have a 'cancel operation'.
> > > 
> > > > Now I understand there can be HW / communication failures etc. in
> > > > which case the driver could either block waiting or make sure future
> > > > IO will fail and drop the pins. 
> > > 
> > > We can always rip things away from the userspace.. However..
> > > 
> > > > But under normal conditions there should be a way to revoke the
> > > > access. And if the HW/driver cannot support this, then don't let it
> > > > anywhere near DAX filesystem.
> > > 
> > > I think the general observation is that people who want to do DAX &
> > > RDMA want it to actually work, without data corruption, random process
> > > kills or random communication failures.
> > > 
> > > Really, few users would actually want to run in a system where revoke
> > > can be triggered.
> > > 
> > > So.. how can the FS/MM side provide a guarantee to the user that
> > > revoke won't happen under a certain system design?
> > 
> > Most of the cases we want revoke for are things like truncate().
> > Shouldn't happen with a sane system, but we're trying to avoid users
> > doing awful things like being able to DMA to pages that are now part of
> > a different file.
> 
> Why is the solution revoke then?  Is there something besides truncate
> that we have to worry about?  I ask because EBUSY is not currently
> listed as a return value of truncate, so extending the API to include
> EBUSY to mean "this file has pinned pages that can not be freed" is not
> (or should not be) totally out of the question.
> 
> Admittedly, I'm coming in late to this conversation, but did I miss the
> portion where that alternative was ruled out?

That's my preferred option too, but the preponderance of opinion leans
towards "We can't give people a way to make files un-truncatable".


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 18:35         ` Matthew Wilcox
@ 2019-02-06 18:44           ` Doug Ledford
  2019-02-06 18:52           ` Jason Gunthorpe
  1 sibling, 0 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 18:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 3789 bytes --]

On Wed, 2019-02-06 at 10:35 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 01:32:04PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 09:52 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 10:31:14AM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Feb 06, 2019 at 10:50:00AM +0100, Jan Kara wrote:
> > > > 
> > > > > MM/FS asks for lease to be revoked. The revoke handler agrees with the
> > > > > other side on cancelling RDMA or whatever and drops the page pins. 
> > > > 
> > > > This takes a trip through userspace since the communication protocol
> > > > is entirely managed in userspace.
> > > > 
> > > > Most existing communication protocols don't have a 'cancel operation'.
> > > > 
> > > > > Now I understand there can be HW / communication failures etc. in
> > > > > which case the driver could either block waiting or make sure future
> > > > > IO will fail and drop the pins. 
> > > > 
> > > > We can always rip things away from the userspace.. However..
> > > > 
> > > > > But under normal conditions there should be a way to revoke the
> > > > > access. And if the HW/driver cannot support this, then don't let it
> > > > > anywhere near DAX filesystem.
> > > > 
> > > > I think the general observation is that people who want to do DAX &
> > > > RDMA want it to actually work, without data corruption, random process
> > > > kills or random communication failures.
> > > > 
> > > > Really, few users would actually want to run in a system where revoke
> > > > can be triggered.
> > > > 
> > > > So.. how can the FS/MM side provide a guarantee to the user that
> > > > revoke won't happen under a certain system design?
> > > 
> > > Most of the cases we want revoke for are things like truncate().
> > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > doing awful things like being able to DMA to pages that are now part of
> > > a different file.
> > 
> > Why is the solution revoke then?  Is there something besides truncate
> > that we have to worry about?  I ask because EBUSY is not currently
> > listed as a return value of truncate, so extending the API to include
> > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > (or should not be) totally out of the question.
> > 
> > Admittedly, I'm coming in late to this conversation, but did I miss the
> > portion where that alternative was ruled out?
> 
> That's my preferred option too, but the preponderance of opinion leans
> towards "We can't give people a way to make files un-truncatable".

Has anyone looked at the laundry list of possible failures truncate
already has?  Among others, ETXTBSY is already in the list, and it
allows someone to make a file un-truncatable by running it.  There's
EPERM for multiple failures.  In order for someone to make a file
untruncatable using this, they would have to have perms to the file
already anyway as well as perms to get the direct I/O pin.  I see no
reason why, if they have the perms to do it, that you don't allow them
to.  If you don't want someone else to make a file untruncatable that
you want to truncate, then don't share file perms with them.  What's the
difficulty here?  Really, creating this complex revoke thing to tear
down I/O when people really *don't* want that I/O getting torn down
seems like forcing a bad API on I/O to satisfy not doing what is an
entirely natural extension to an existing API.  You *shouldn't* have the
right to truncate a file that is busy, and ETXTBSY is a perfect example
of that, and an example of the API done right.  This other....

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 18:35         ` Matthew Wilcox
  2019-02-06 18:44           ` Doug Ledford
@ 2019-02-06 18:52           ` Jason Gunthorpe
  2019-02-06 19:45             ` Dan Williams
  1 sibling, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 18:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Doug Ledford, Jan Kara, Ira Weiny, lsf-pc, linux-rdma, linux-mm,
	linux-kernel, John Hubbard, Jerome Glisse, Dan Williams,
	Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 10:35:04AM -0800, Matthew Wilcox wrote:

> > Admittedly, I'm coming in late to this conversation, but did I miss the
> > portion where that alternative was ruled out?
> 
> That's my preferred option too, but the preponderance of opinion leans
> towards "We can't give people a way to make files un-truncatable".

I haven't heard an explanation why blocking ftruncate is worse than
giving people a way to break RDMA using process by calling ftruncate??

Isn't it exactly the same argument the other way?

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 18:32       ` Doug Ledford
  2019-02-06 18:35         ` Matthew Wilcox
@ 2019-02-06 19:16         ` Christopher Lameter
  2019-02-06 19:40           ` Matthew Wilcox
  2019-02-06 21:03           ` Dave Chinner
  1 sibling, 2 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-06 19:16 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matthew Wilcox, Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

On Wed, 6 Feb 2019, Doug Ledford wrote:

> > Most of the cases we want revoke for are things like truncate().
> > Shouldn't happen with a sane system, but we're trying to avoid users
> > doing awful things like being able to DMA to pages that are now part of
> > a different file.
>
> Why is the solution revoke then?  Is there something besides truncate
> that we have to worry about?  I ask because EBUSY is not currently
> listed as a return value of truncate, so extending the API to include
> EBUSY to mean "this file has pinned pages that can not be freed" is not
> (or should not be) totally out of the question.
>
> Admittedly, I'm coming in late to this conversation, but did I miss the
> portion where that alternative was ruled out?

Coming in late here too but isnt the only DAX case that we are concerned
about where there was an mmap with the O_DAX option to do direct write
though? If we only allow this use case then we may not have to worry about
long term GUP because DAX mapped files will stay in the physical location
regardless.

Maybe we can solve the long term GUP problem through the requirement that
user space acquires some sort of means to pin the pages? In the DAX case
this is given by the filesystem and the hardware will basically take care
of writeback.

In case of anonymous memory this can be guaranteed otherwise and is less
critical since these pages are not part of the pagecache and are not
subject to writeback.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 19:16         ` Christopher Lameter
@ 2019-02-06 19:40           ` Matthew Wilcox
  2019-02-06 20:16             ` Doug Ledford
  2019-02-06 20:24             ` Christopher Lameter
  2019-02-06 21:03           ` Dave Chinner
  1 sibling, 2 replies; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-06 19:40 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Doug Ledford, Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > Most of the cases we want revoke for are things like truncate().
> > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > doing awful things like being able to DMA to pages that are now part of
> > > a different file.
> >
> > Why is the solution revoke then?  Is there something besides truncate
> > that we have to worry about?  I ask because EBUSY is not currently
> > listed as a return value of truncate, so extending the API to include
> > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > (or should not be) totally out of the question.
> >
> > Admittedly, I'm coming in late to this conversation, but did I miss the
> > portion where that alternative was ruled out?
> 
> Coming in late here too but isnt the only DAX case that we are concerned
> about where there was an mmap with the O_DAX option to do direct write

There is no O_DAX option.  There's mount -o dax, but there's nothing that
a program does to say "Use DAX".

> though? If we only allow this use case then we may not have to worry about
> long term GUP because DAX mapped files will stay in the physical location
> regardless.

... except for truncate.  And now that I think about it, there was a
desire to support hot-unplug which also needed revoke.

> Maybe we can solve the long term GUP problem through the requirement that
> user space acquires some sort of means to pin the pages? In the DAX case
> this is given by the filesystem and the hardware will basically take care
> of writeback.

It's not given by the filesystem.

> In case of anonymous memory this can be guaranteed otherwise and is less
> critical since these pages are not part of the pagecache and are not
> subject to writeback.

but are subject to being swapped out?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 18:52           ` Jason Gunthorpe
@ 2019-02-06 19:45             ` Dan Williams
  2019-02-06 20:14               ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-06 19:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Doug Ledford, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
	Jerome Glisse, Dave Chinner, Michal Hocko

On Wed, Feb 6, 2019 at 10:52 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Feb 06, 2019 at 10:35:04AM -0800, Matthew Wilcox wrote:
>
> > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > portion where that alternative was ruled out?
> >
> > That's my preferred option too, but the preponderance of opinion leans
> > towards "We can't give people a way to make files un-truncatable".
>
> I haven't heard an explanation why blocking ftruncate is worse than
> giving people a way to break RDMA using process by calling ftruncate??
>
> Isn't it exactly the same argument the other way?

No, I don't think it is. The lease is there to set the expectation of
getting out of the way, it's not a silent un-coordinated failure. The
user asked for it, the kernel is just honoring a valid request. If the
RDMA application doesn't want it to happen, arrange for it by
permissions or other coordination to prevent truncation, but once the
two conflicting / valid requests have arrived at the filesystem try to
move the result forward to the user requested state not block and fail
indefinitely.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 19:45             ` Dan Williams
@ 2019-02-06 20:14               ` Doug Ledford
  2019-02-06 21:04                 ` Dan Williams
  0 siblings, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 20:14 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Wed, 2019-02-06 at 11:45 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 10:52 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Wed, Feb 06, 2019 at 10:35:04AM -0800, Matthew Wilcox wrote:
> > 
> > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > portion where that alternative was ruled out?
> > > 
> > > That's my preferred option too, but the preponderance of opinion leans
> > > towards "We can't give people a way to make files un-truncatable".
> > 
> > I haven't heard an explanation why blocking ftruncate is worse than
> > giving people a way to break RDMA using process by calling ftruncate??
> > 
> > Isn't it exactly the same argument the other way?
> 
> 
> If the
> RDMA application doesn't want it to happen, arrange for it by
> permissions or other coordination to prevent truncation,

I just argued the *exact* same thing, except from the other side: if you
want a guaranteed ability to truncate, then arrange the perms so the
RDMA or DAX capable things can't use the file.

>  but once the
> two conflicting / valid requests have arrived at the filesystem try to
> move the result forward to the user requested state not block and fail
> indefinitely.

Except this is wrong.  We already have ETXTBSY, and arguably it is much
easier for ETXTBSY to simply kill all of the running processes with
extreme prejudice.  But we don't do that.  We block indefinitely.  So,
no, there is no expectation that things will "move forward to the user
requested state".  Not when pages are in use by the kernel, and very
arguably pages being used for direct I/O are absolutely in use by the
kernel, then truncate blocks.

There is a major case of dissonant cognitive behavior here if the
syscall supports ETXTBSY, even though the ability to kill apps using the
text pages is trivial, but thinks supporting EBUSY is out of the
question.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 19:40           ` Matthew Wilcox
@ 2019-02-06 20:16             ` Doug Ledford
  2019-02-06 20:20               ` Matthew Wilcox
  2019-02-06 20:24             ` Christopher Lameter
  1 sibling, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 20:16 UTC (permalink / raw)
  To: Matthew Wilcox, Christopher Lameter
  Cc: Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > 
> > though? If we only allow this use case then we may not have to worry about
> > long term GUP because DAX mapped files will stay in the physical location
> > regardless.
> 
> ... except for truncate.  And now that I think about it, there was a
> desire to support hot-unplug which also needed revoke.

We already support hot unplug of RDMA devices.  But it is extreme.  How
does hot unplug deal with a program running from the device (something
that would have returned ETXTBSY)?

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:16             ` Doug Ledford
@ 2019-02-06 20:20               ` Matthew Wilcox
  2019-02-06 20:28                 ` Doug Ledford
                                   ` (3 more replies)
  0 siblings, 4 replies; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-06 20:20 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > 
> > > though? If we only allow this use case then we may not have to worry about
> > > long term GUP because DAX mapped files will stay in the physical location
> > > regardless.
> > 
> > ... except for truncate.  And now that I think about it, there was a
> > desire to support hot-unplug which also needed revoke.
> 
> We already support hot unplug of RDMA devices.  But it is extreme.  How
> does hot unplug deal with a program running from the device (something
> that would have returned ETXTBSY)?

Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.

It's straightforward to migrate text pages from one DIMM to another;
you remove the PTEs from the CPU's page tables, copy the data over and
pagefaults put the new PTEs in place.  We don't have a way to do similar
things to an RDMA device, do we?



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 19:40           ` Matthew Wilcox
  2019-02-06 20:16             ` Doug Ledford
@ 2019-02-06 20:24             ` Christopher Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-06 20:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Doug Ledford, Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

On Wed, 6 Feb 2019, Matthew Wilcox wrote:

> >
> > Coming in late here too but isnt the only DAX case that we are concerned
> > about where there was an mmap with the O_DAX option to do direct write
>
> There is no O_DAX option.  There's mount -o dax, but there's nothing that
> a program does to say "Use DAX".

Hmmm... I thought that a file handle must have a special open mode to
actually to a dax map. Looks like that is not the case.

> > though? If we only allow this use case then we may not have to worry about
> > long term GUP because DAX mapped files will stay in the physical location
> > regardless.
>
> ... except for truncate.  And now that I think about it, there was a
> desire to support hot-unplug which also needed revoke.

Well but that requires that the application unmaps the file.

> > Maybe we can solve the long term GUP problem through the requirement that
> > user space acquires some sort of means to pin the pages? In the DAX case
> > this is given by the filesystem and the hardware will basically take care
> > of writeback.
>
> It's not given by the filesystem.

DAX provides a mapping to physical persistent memory that
does not go away. Or its a block device.

>
> > In case of anonymous memory this can be guaranteed otherwise and is less
> > critical since these pages are not part of the pagecache and are not
> > subject to writeback.
>
> but are subject to being swapped out?

Well that is controlled by mlock and could also involve other means like
disabling swap.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:20               ` Matthew Wilcox
@ 2019-02-06 20:28                 ` Doug Ledford
  2019-02-06 20:41                   ` Matthew Wilcox
  2019-02-06 20:31                 ` Jason Gunthorpe
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 20:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1754 bytes --]

On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > though? If we only allow this use case then we may not have to worry about
> > > > long term GUP because DAX mapped files will stay in the physical location
> > > > regardless.
> > > 
> > > ... except for truncate.  And now that I think about it, there was a
> > > desire to support hot-unplug which also needed revoke.
> > 
> > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > does hot unplug deal with a program running from the device (something
> > that would have returned ETXTBSY)?
> 
> Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> 
> It's straightforward to migrate text pages from one DIMM to another;
> you remove the PTEs from the CPU's page tables, copy the data over and
> pagefaults put the new PTEs in place.  We don't have a way to do similar
> things to an RDMA device, do we?

We don't have a means of migration except in the narrowly scoped sense
of queue pair migration as defined by the IBTA and implemented on some
dual port IB cards.  This narrowly scoped migration even still involves
notification of the app.

Since there's no guarantee that any other port can connect to the same
machine as any port that's going away, it would always be a
disconnect/reconnect sequence in the app to support this, not an under
the covers migration.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:20               ` Matthew Wilcox
  2019-02-06 20:28                 ` Doug Ledford
@ 2019-02-06 20:31                 ` Jason Gunthorpe
  2019-02-06 20:39                 ` Christopher Lameter
  2019-02-06 20:54                 ` Doug Ledford
  3 siblings, 0 replies; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 20:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Doug Ledford, Christopher Lameter, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 12:20:21PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > 
> > > > though? If we only allow this use case then we may not have to worry about
> > > > long term GUP because DAX mapped files will stay in the physical location
> > > > regardless.
> > > 
> > > ... except for truncate.  And now that I think about it, there was a
> > > desire to support hot-unplug which also needed revoke.
> > 
> > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > does hot unplug deal with a program running from the device (something
> > that would have returned ETXTBSY)?
> 
> Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> 
> It's straightforward to migrate text pages from one DIMM to another;
> you remove the PTEs from the CPU's page tables, copy the data over and
> pagefaults put the new PTEs in place.  We don't have a way to do similar
> things to an RDMA device, do we?

I've long said it is reasonable to have an emergency hard revoke for
exceptional error cases - like dis-orderly hot unplug and so forth.

However, IHMO a orderly migration should rely on user space to
co-ordinate the migration and the application usages, including some
user space driven scheme to assure forward progress..

.. and you are kind of touching on my fear here. revoke started out as
only being for ftruncate. Now we need it for data migration - how soon
before someone wants to do revoke just to re-balance
usage/bandwidth/etc between NVDIMMS? 

That is now way outside of what a RDMA using system can reasonably
tolerate. How would a system designer prevent this?

Again, nobody is going to want a design where RDMA applications are
under some undefined threat of SIGKILL - which is where any lease
revoke idea is going. :( 

The priority of systems using RDMA is almost always to keep the RDMA
working right as it is the often key service the box is providing.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:20               ` Matthew Wilcox
  2019-02-06 20:28                 ` Doug Ledford
  2019-02-06 20:31                 ` Jason Gunthorpe
@ 2019-02-06 20:39                 ` Christopher Lameter
  2019-02-06 20:54                 ` Doug Ledford
  3 siblings, 0 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-06 20:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Doug Ledford, Jason Gunthorpe, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Dave Chinner, Michal Hocko

On Wed, 6 Feb 2019, Matthew Wilcox wrote:

> It's straightforward to migrate text pages from one DIMM to another;
> you remove the PTEs from the CPU's page tables, copy the data over and
> pagefaults put the new PTEs in place.  We don't have a way to do similar
> things to an RDMA device, do we?

We have MMU notifier callbacks that can tell the device to release the
mappings. And an RDMA device may operate in ODP mode which is on demand
paging. With that data may be migrated as usual.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:28                 ` Doug Ledford
@ 2019-02-06 20:41                   ` Matthew Wilcox
  2019-02-06 20:47                     ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-06 20:41 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 03:28:35PM -0500, Doug Ledford wrote:
> On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> > On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > > 
> > > > ... except for truncate.  And now that I think about it, there was a
> > > > desire to support hot-unplug which also needed revoke.
> > > 
> > > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > > does hot unplug deal with a program running from the device (something
> > > that would have returned ETXTBSY)?
> > 
> > Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> > 
> > It's straightforward to migrate text pages from one DIMM to another;
> > you remove the PTEs from the CPU's page tables, copy the data over and
> > pagefaults put the new PTEs in place.  We don't have a way to do similar
> > things to an RDMA device, do we?
> 
> We don't have a means of migration except in the narrowly scoped sense
> of queue pair migration as defined by the IBTA and implemented on some
> dual port IB cards.  This narrowly scoped migration even still involves
> notification of the app.
> 
> Since there's no guarantee that any other port can connect to the same
> machine as any port that's going away, it would always be a
> disconnect/reconnect sequence in the app to support this, not an under
> the covers migration.

I don't understand you.  We're not talking about migrating from one IB
card to another, we're talking about changing the addresses that an STag
refers to.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:41                   ` Matthew Wilcox
@ 2019-02-06 20:47                     ` Doug Ledford
  2019-02-06 20:49                       ` Matthew Wilcox
  0 siblings, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 20:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2788 bytes --]

On Wed, 2019-02-06 at 12:41 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 03:28:35PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > > > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > > though? If we only allow this use case then we may not have to worry about
> > > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > > regardless.
> > > > > 
> > > > > ... except for truncate.  And now that I think about it, there was a
> > > > > desire to support hot-unplug which also needed revoke.
> > > > 
> > > > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > > > does hot unplug deal with a program running from the device (something
> > > > that would have returned ETXTBSY)?
> > > 
> > > Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> > > 
> > > It's straightforward to migrate text pages from one DIMM to another;
> > > you remove the PTEs from the CPU's page tables, copy the data over and
> > > pagefaults put the new PTEs in place.  We don't have a way to do similar
> > > things to an RDMA device, do we?
> > 
> > We don't have a means of migration except in the narrowly scoped sense
> > of queue pair migration as defined by the IBTA and implemented on some
> > dual port IB cards.  This narrowly scoped migration even still involves
> > notification of the app.
> > 
> > Since there's no guarantee that any other port can connect to the same
> > machine as any port that's going away, it would always be a
> > disconnect/reconnect sequence in the app to support this, not an under
> > the covers migration.
> 
> I don't understand you.  We're not talking about migrating from one IB
> card to another, we're talking about changing the addresses that an STag
> refers to.

You said "now that I think about it, there was a desire to support hot-
unplug which also needed revoke".  For us, hot unplug is done at the
device level and means all connections must be torn down.  So in the
context of this argument, if people want revoke so DAX can migrate from
one NV-DIMM to another, ok.  But revoke does not help RDMA migrate.

If, instead, you mean that you want to support hot unplug of an NV-DIMM
that is currently the target of RDMA transfers, then I believe
Christoph's answer on this is correct.  It all boils down to which
device you are talking about doing the hot unplug on.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:47                     ` Doug Ledford
@ 2019-02-06 20:49                       ` Matthew Wilcox
  2019-02-06 20:50                         ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-06 20:49 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

On Wed, Feb 06, 2019 at 03:47:53PM -0500, Doug Ledford wrote:
> On Wed, 2019-02-06 at 12:41 -0800, Matthew Wilcox wrote:
> > On Wed, Feb 06, 2019 at 03:28:35PM -0500, Doug Ledford wrote:
> > > On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> > > > Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.

^^^ I think you missed this line ^^^

> You said "now that I think about it, there was a desire to support hot-
> unplug which also needed revoke".  For us, hot unplug is done at the
> device level and means all connections must be torn down.  So in the
> context of this argument, if people want revoke so DAX can migrate from
> one NV-DIMM to another, ok.  But revoke does not help RDMA migrate.
> 
> If, instead, you mean that you want to support hot unplug of an NV-DIMM
> that is currently the target of RDMA transfers, then I believe
> Christoph's answer on this is correct.  It all boils down to which
> device you are talking about doing the hot unplug on.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:49                       ` Matthew Wilcox
@ 2019-02-06 20:50                         ` Doug Ledford
  0 siblings, 0 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 20:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1275 bytes --]

On Wed, 2019-02-06 at 12:49 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 03:47:53PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 12:41 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 03:28:35PM -0500, Doug Ledford wrote:
> > > > On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> > > > > Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> 
> ^^^ I think you missed this line ^^^

Indeed, I did ;-)

> 
> > You said "now that I think about it, there was a desire to support hot-
> > unplug which also needed revoke".  For us, hot unplug is done at the
> > device level and means all connections must be torn down.  So in the
> > context of this argument, if people want revoke so DAX can migrate from
> > one NV-DIMM to another, ok.  But revoke does not help RDMA migrate.
> > 
> > If, instead, you mean that you want to support hot unplug of an NV-DIMM
> > that is currently the target of RDMA transfers, then I believe
> > Christoph's answer on this is correct.  It all boils down to which
> > device you are talking about doing the hot unplug on.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:20               ` Matthew Wilcox
                                   ` (2 preceding siblings ...)
  2019-02-06 20:39                 ` Christopher Lameter
@ 2019-02-06 20:54                 ` Doug Ledford
  2019-02-07 16:48                   ` Jan Kara
  3 siblings, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 20:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christopher Lameter, Jason Gunthorpe, Jan Kara, Ira Weiny,
	lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jerome Glisse, Dan Williams, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1066 bytes --]

On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > though? If we only allow this use case then we may not have to worry about
> > > > long term GUP because DAX mapped files will stay in the physical location
> > > > regardless.
> > > 
> > > ... except for truncate.  And now that I think about it, there was a
> > > desire to support hot-unplug which also needed revoke.
> > 
> > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > does hot unplug deal with a program running from the device (something
> > that would have returned ETXTBSY)?
> 
> Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.

Is an NV-DIMM the only thing we use DAX on?


-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 19:16         ` Christopher Lameter
  2019-02-06 19:40           ` Matthew Wilcox
@ 2019-02-06 21:03           ` Dave Chinner
  2019-02-06 22:08             ` Jason Gunthorpe
  1 sibling, 1 reply; 106+ messages in thread
From: Dave Chinner @ 2019-02-06 21:03 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Doug Ledford, Matthew Wilcox, Jason Gunthorpe, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Michal Hocko

On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> On Wed, 6 Feb 2019, Doug Ledford wrote:
> 
> > > Most of the cases we want revoke for are things like truncate().
> > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > doing awful things like being able to DMA to pages that are now part of
> > > a different file.
> >
> > Why is the solution revoke then?  Is there something besides truncate
> > that we have to worry about?  I ask because EBUSY is not currently
> > listed as a return value of truncate, so extending the API to include
> > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > (or should not be) totally out of the question.
> >
> > Admittedly, I'm coming in late to this conversation, but did I miss the
> > portion where that alternative was ruled out?
> 
> Coming in late here too but isnt the only DAX case that we are concerned
> about where there was an mmap with the O_DAX option to do direct write
> though? If we only allow this use case then we may not have to worry about
> long term GUP because DAX mapped files will stay in the physical location
> regardless.

No, that is not guaranteed. Soon as we have reflink support on XFS,
writes will physically move the data to a new physical location.
This is non-negotiatiable, and cannot be blocked forever by a gup
pin.

IOWs, DAX on RDMA requires a) page fault capable hardware so that
the filesystem can move data physically on write access, and b)
revokable file leases so that the filesystem can kick userspace out
of the way when it needs to.

Truncate is a red herring. It's definitely a case for revokable
leases, but it's the rare case rather than the one we actually care
about. We really care about making copy-on-write capable filesystems like
XFS work with DAX (we've got people asking for it to be supported
yesterday!), and that means DAX+RDMA needs to work with storage that
can change physical location at any time.

> Maybe we can solve the long term GUP problem through the requirement that
> user space acquires some sort of means to pin the pages? In the DAX case
> this is given by the filesystem and the hardware will basically take care
> of writeback.

That's what the revokable file leases provide (it's basically the
same thing as a NFSv4 delegation). We already have all the
infrastructure in the filesystems for triggering revokes when
needed (implemented for pNFS a few years ago), and DAX already
piggy-backs on that infrastructureuses that infrastructure to wait
on gup pinned pages. See dax_layout_busy_page() and BREAK_UNMAP.

The problem is that dax_layout_busy_page can block forever when
userspace pins the file for RDMA. It's not just truncate - it's any
filesystem operation that needs to manipulate the underlying file
layout without doing data IO. i.e. any fallocate() operation, and
when we add reflink support it will include anythign that
shares or de-shares extents between files, too.

The revokable file leases are necessary because access to file data,
internal metadata and the storage is arbitrated by the filesystem,
not the mm/ subsystem and physical pages. i.e. FS-DAX means that the
*filesystem* is managing access to physical pages, not the mm/
subsystem. And we can't just ignore the filesystem in this case
because allowing access to the physical storage outside of the
filesystem's visibility and/or direct control is a potential
security vulnerability, data corruption or filesystem corruption
vector.

And that's the real problem we need to solve here. RDMA has no trust
model other than "I'm userspace, I pinned you, trust me!". That's
not good enough for FS-DAX+RDMA....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:14               ` Doug Ledford
@ 2019-02-06 21:04                 ` Dan Williams
  2019-02-06 21:12                   ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-06 21:04 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
	Jerome Glisse, Dave Chinner, Michal Hocko

On Wed, Feb 6, 2019 at 12:14 PM Doug Ledford <dledford@redhat.com> wrote:
>
> On Wed, 2019-02-06 at 11:45 -0800, Dan Williams wrote:
> > On Wed, Feb 6, 2019 at 10:52 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Wed, Feb 06, 2019 at 10:35:04AM -0800, Matthew Wilcox wrote:
> > >
> > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > portion where that alternative was ruled out?
> > > >
> > > > That's my preferred option too, but the preponderance of opinion leans
> > > > towards "We can't give people a way to make files un-truncatable".
> > >
> > > I haven't heard an explanation why blocking ftruncate is worse than
> > > giving people a way to break RDMA using process by calling ftruncate??
> > >
> > > Isn't it exactly the same argument the other way?
> >
> >
> > If the
> > RDMA application doesn't want it to happen, arrange for it by
> > permissions or other coordination to prevent truncation,
>
> I just argued the *exact* same thing, except from the other side: if you
> want a guaranteed ability to truncate, then arrange the perms so the
> RDMA or DAX capable things can't use the file.

That doesn't make sense. All we have to work with is rwx bits. It's
possible to prevents writes / truncates. There's no permission bit for
mmap, O_DIRECT and RDMA mappings, hence leases.

> >  but once the
> > two conflicting / valid requests have arrived at the filesystem try to
> > move the result forward to the user requested state not block and fail
> > indefinitely.
>
> Except this is wrong.  We already have ETXTBSY, and arguably it is much
> easier for ETXTBSY to simply kill all of the running processes with
> extreme prejudice.  But we don't do that.  We block indefinitely.  So,
> no, there is no expectation that things will "move forward to the user
> requested state".  Not when pages are in use by the kernel, and very
> arguably pages being used for direct I/O are absolutely in use by the
> kernel, then truncate blocks.
>
> There is a major case of dissonant cognitive behavior here if the
> syscall supports ETXTBSY, even though the ability to kill apps using the
> text pages is trivial, but thinks supporting EBUSY is out of the
> question.

It's introducing a new failure mode where one did not exist before.
It's especially problematic when the only difference between the case
when it fails and one where it doesn't comes down to the
idiosyncrasies of DAX mappings and whether or not the RDMA device has
capabilities like ODP.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 21:04                 ` Dan Williams
@ 2019-02-06 21:12                   ` Doug Ledford
  0 siblings, 0 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 21:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
	Jerome Glisse, Dave Chinner, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1865 bytes --]

On Wed, 2019-02-06 at 13:04 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 12:14 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 11:45 -0800, Dan Williams wrote:
> > > On Wed, Feb 6, 2019 at 10:52 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > On Wed, Feb 06, 2019 at 10:35:04AM -0800, Matthew Wilcox wrote:
> > > > 
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > > 
> > > > > That's my preferred option too, but the preponderance of opinion leans
> > > > > towards "We can't give people a way to make files un-truncatable".
> > > > 
> > > > I haven't heard an explanation why blocking ftruncate is worse than
> > > > giving people a way to break RDMA using process by calling ftruncate??
> > > > 
> > > > Isn't it exactly the same argument the other way?
> > > 
> > > If the
> > > RDMA application doesn't want it to happen, arrange for it by
> > > permissions or other coordination to prevent truncation,
> > 
> > I just argued the *exact* same thing, except from the other side: if you
> > want a guaranteed ability to truncate, then arrange the perms so the
> > RDMA or DAX capable things can't use the file.
> 
> That doesn't make sense. All we have to work with is rwx bits. It's
> possible to prevents writes / truncates. There's no permission bit for
> mmap, O_DIRECT and RDMA mappings, hence leases.

There's ownership.  What you can't open, you can't mmap or O_DIRECT or
whatever...

Regardless though, this is mostly moot as Dave's email makes it clear
the underlying issue that is the problem is not ftruncate, but other
things.
> > 

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-05 18:01 ` Ira Weiny
@ 2019-02-06 21:31   ` Dave Chinner
  0 siblings, 0 replies; 106+ messages in thread
From: Dave Chinner @ 2019-02-06 21:31 UTC (permalink / raw)
  To: Ira Weiny
  Cc: lsf-pc, linux-rdma, linux-mm, linux-kernel, John Hubbard,
	Jan Kara, Jerome Glisse, Dan Williams, Matthew Wilcox,
	Doug Ledford, Michal Hocko, Jason Gunthorpe

On Tue, Feb 05, 2019 at 10:01:20AM -0800, Ira Weiny wrote:
> I had an old invalid address for Jason Gunthorpe in my address book...  
> 
> Correcting his email in the thread.

Probably should have cc'd linux-fsdevel, too, but it's too late for
that now....

> On Tue, Feb 05, 2019 at 09:50:59AM -0800, 'Ira Weiny' wrote:
> > 
> > The problem: Once we have pages marked as GUP-pinned how should various
> > subsystems work with those markings.
> > 
> > The current work for John Hubbards proposed solutions (part 1 and 2) is
> > progressing.[1]  But the final part (3) of his solution is also going to take
> > some work.
> > 
> > In Johns presentation he lists 3 alternatives for gup-pinned pages:
> > 
> > 1) Hold off try_to_unmap
> > 2) Allow writeback while pinned (via bounce buffers)
> > 	[Note this will not work for DAX]
> > 3) Use a "revocable reservation" (or lease) on those pages
> > 4) Pin the blocks as busy in the FS allocator
> > 
> > The problem with lease's on pages used by RDMA is that the references to
> > these pages is not local to the machine.  Once the user has been given access
> > to the page they, through the use of a remote tokens, give a reference to that
> > page to remote nodes.  This is the core essence of RDMA, and like it or not,
> > something which is increasingly used by major Linux users.
> > 
> > Therefore we need to discuss the extent by which leases are appropriate and
> > what happens should a lease be revoked which a user does not respond to.
> > 
> > As John Hubbard put it:
> > 
> > "Other filesystem features that need to replace the page with a new one can
> > be inhibited for pages that are GUP-pinned. This will, however, alter and
> > limit some of those filesystem features. The only fix for that would be to
> > require GUP users monitor and respond to CPU page table updates. Subsystems
> > such as ODP and HMM do this, for example. This aspect of the problem is
> > still under discussion."
> > 
> > 	-- John Hubbard[2]
> > 
> > The following people have been involved in previous conversations and would be key to
> > the face to face discussion.
> > 
> > John Hubbard
> > Jan Kara
> > Dave Chinner

Just FYI, I won't be at LSFMM.

Puerto Rico is about as physically far away from me as you can get
on this planet. There's 40 hours in transit from airport to airport,
and that doesn't include the 5 hours of travel the day before (and
hence overnight stay) to be able to get to the first airport in time
for the first flight. I'm looking at a transit time - if all goes
well - of over 60 hours just to get to the conference.

And it's looks like it will be just as bad on the way back.

6 days travel for a 2 day conference makes no sense at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 21:03           ` Dave Chinner
@ 2019-02-06 22:08             ` Jason Gunthorpe
  2019-02-06 22:24               ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 22:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christopher Lameter, Doug Ledford, Matthew Wilcox, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Michal Hocko

On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > 
> > > > Most of the cases we want revoke for are things like truncate().
> > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > doing awful things like being able to DMA to pages that are now part of
> > > > a different file.
> > >
> > > Why is the solution revoke then?  Is there something besides truncate
> > > that we have to worry about?  I ask because EBUSY is not currently
> > > listed as a return value of truncate, so extending the API to include
> > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > (or should not be) totally out of the question.
> > >
> > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > portion where that alternative was ruled out?
> > 
> > Coming in late here too but isnt the only DAX case that we are concerned
> > about where there was an mmap with the O_DAX option to do direct write
> > though? If we only allow this use case then we may not have to worry about
> > long term GUP because DAX mapped files will stay in the physical location
> > regardless.
> 
> No, that is not guaranteed. Soon as we have reflink support on XFS,
> writes will physically move the data to a new physical location.
> This is non-negotiatiable, and cannot be blocked forever by a gup
> pin.
> 
> IOWs, DAX on RDMA requires a) page fault capable hardware so that
> the filesystem can move data physically on write access, and b)
> revokable file leases so that the filesystem can kick userspace out
> of the way when it needs to.

Why do we need both? You want to have leases for normal CPU mmaps too?

> Truncate is a red herring. It's definitely a case for revokable
> leases, but it's the rare case rather than the one we actually care
> about. We really care about making copy-on-write capable filesystems like
> XFS work with DAX (we've got people asking for it to be supported
> yesterday!), and that means DAX+RDMA needs to work with storage that
> can change physical location at any time.

Then we must continue to ban longterm pin with DAX..

Nobody is going to want to deploy a system where revoke can happen at
any time and if you don't respond fast enough your system either locks
with some kind of FS meltdown or your process gets SIGKILL. 

I don't really see a reason to invest so much design work into
something that isn't production worthy.

It *almost* made sense with ftruncate, because you could architect to
avoid ftruncate.. But just any FS op might reallocate? Naw.

Dave, you said the FS is responsible to arbitrate access to the
physical pages..

Is it possible to have a filesystem for DAX that is more suited to
this environment? Ie designed to not require block reallocation (no
COW, no reflinks, different approach to ftruncate, etc)

> And that's the real problem we need to solve here. RDMA has no trust
> model other than "I'm userspace, I pinned you, trust me!". That's
> not good enough for FS-DAX+RDMA....

It is baked into the silicon, and I don't see much motion on this
front right now. My best hope is that IOMMU PASID will get widely
deployed and RDMA silicon will arrive that can use it. Seems to be
years away, if at all.

At least we have one chip design that can work in a page faulting mode
..

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:08             ` Jason Gunthorpe
@ 2019-02-06 22:24               ` Doug Ledford
  2019-02-06 22:44                 ` Dan Williams
  2019-02-07  3:52                 ` Dave Chinner
  0 siblings, 2 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-06 22:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Dave Chinner
  Cc: Christopher Lameter, Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc,
	linux-rdma, linux-mm, linux-kernel, John Hubbard, Jerome Glisse,
	Dan Williams, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 4280 bytes --]

On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > 
> > > > > Most of the cases we want revoke for are things like truncate().
> > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > a different file.
> > > > 
> > > > Why is the solution revoke then?  Is there something besides truncate
> > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > listed as a return value of truncate, so extending the API to include
> > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > (or should not be) totally out of the question.
> > > > 
> > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > portion where that alternative was ruled out?
> > > 
> > > Coming in late here too but isnt the only DAX case that we are concerned
> > > about where there was an mmap with the O_DAX option to do direct write
> > > though? If we only allow this use case then we may not have to worry about
> > > long term GUP because DAX mapped files will stay in the physical location
> > > regardless.
> > 
> > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > writes will physically move the data to a new physical location.
> > This is non-negotiatiable, and cannot be blocked forever by a gup
> > pin.
> > 
> > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > the filesystem can move data physically on write access, and b)
> > revokable file leases so that the filesystem can kick userspace out
> > of the way when it needs to.
> 
> Why do we need both? You want to have leases for normal CPU mmaps too?
> 
> > Truncate is a red herring. It's definitely a case for revokable
> > leases, but it's the rare case rather than the one we actually care
> > about. We really care about making copy-on-write capable filesystems like
> > XFS work with DAX (we've got people asking for it to be supported
> > yesterday!), and that means DAX+RDMA needs to work with storage that
> > can change physical location at any time.
> 
> Then we must continue to ban longterm pin with DAX..
> 
> Nobody is going to want to deploy a system where revoke can happen at
> any time and if you don't respond fast enough your system either locks
> with some kind of FS meltdown or your process gets SIGKILL. 
> 
> I don't really see a reason to invest so much design work into
> something that isn't production worthy.
> 
> It *almost* made sense with ftruncate, because you could architect to
> avoid ftruncate.. But just any FS op might reallocate? Naw.
> 
> Dave, you said the FS is responsible to arbitrate access to the
> physical pages..
> 
> Is it possible to have a filesystem for DAX that is more suited to
> this environment? Ie designed to not require block reallocation (no
> COW, no reflinks, different approach to ftruncate, etc)

Can someone give me a real world scenario that someone is *actually*
asking for with this?  Are DAX users demanding xfs, or is it just the
filesystem of convenience?  Do they need to stick with xfs?  Are they
really trying to do COW backed mappings for the RDMA targets?  Or do
they want a COW backed FS but are perfectly happy if the specific RDMA
targets are *not* COW and are statically allocated?

> > And that's the real problem we need to solve here. RDMA has no trust
> > model other than "I'm userspace, I pinned you, trust me!". That's
> > not good enough for FS-DAX+RDMA....
> 
> It is baked into the silicon, and I don't see much motion on this
> front right now. My best hope is that IOMMU PASID will get widely
> deployed and RDMA silicon will arrive that can use it. Seems to be
> years away, if at all.
> 
> At least we have one chip design that can work in a page faulting mode
> ..
> 
> Jason

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:24               ` Doug Ledford
@ 2019-02-06 22:44                 ` Dan Williams
  2019-02-06 23:21                   ` Jason Gunthorpe
                                     ` (3 more replies)
  2019-02-07  3:52                 ` Dave Chinner
  1 sibling, 4 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-06 22:44 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
>
> On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > >
> > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > a different file.
> > > > >
> > > > > Why is the solution revoke then?  Is there something besides truncate
> > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > listed as a return value of truncate, so extending the API to include
> > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > (or should not be) totally out of the question.
> > > > >
> > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > portion where that alternative was ruled out?
> > > >
> > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > about where there was an mmap with the O_DAX option to do direct write
> > > > though? If we only allow this use case then we may not have to worry about
> > > > long term GUP because DAX mapped files will stay in the physical location
> > > > regardless.
> > >
> > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > writes will physically move the data to a new physical location.
> > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > pin.
> > >
> > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > the filesystem can move data physically on write access, and b)
> > > revokable file leases so that the filesystem can kick userspace out
> > > of the way when it needs to.
> >
> > Why do we need both? You want to have leases for normal CPU mmaps too?
> >
> > > Truncate is a red herring. It's definitely a case for revokable
> > > leases, but it's the rare case rather than the one we actually care
> > > about. We really care about making copy-on-write capable filesystems like
> > > XFS work with DAX (we've got people asking for it to be supported
> > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > can change physical location at any time.
> >
> > Then we must continue to ban longterm pin with DAX..
> >
> > Nobody is going to want to deploy a system where revoke can happen at
> > any time and if you don't respond fast enough your system either locks
> > with some kind of FS meltdown or your process gets SIGKILL.
> >
> > I don't really see a reason to invest so much design work into
> > something that isn't production worthy.
> >
> > It *almost* made sense with ftruncate, because you could architect to
> > avoid ftruncate.. But just any FS op might reallocate? Naw.
> >
> > Dave, you said the FS is responsible to arbitrate access to the
> > physical pages..
> >
> > Is it possible to have a filesystem for DAX that is more suited to
> > this environment? Ie designed to not require block reallocation (no
> > COW, no reflinks, different approach to ftruncate, etc)
>
> Can someone give me a real world scenario that someone is *actually*
> asking for with this?

I'll point to this example. At the 6:35 mark Kodi talks about the
Oracle use case for DAX + RDMA.

https://youtu.be/ywKPPIE8JfQ?t=395

Currently the only way to get this to work is to use ODP capable
hardware, or Device-DAX. Device-DAX is a facility to map persistent
memory statically through device-file. It's great for statically
allocated use cases, but loses all the nice things (provisioning,
permissions, naming) that a filesystem gives you. This debate is what
to do about non-ODP capable hardware and Filesystem-DAX facility. The
current answer is "no RDMA for you".

> Are DAX users demanding xfs, or is it just the
> filesystem of convenience?

xfs is the only Linux filesystem that supports DAX and reflink.

> Do they need to stick with xfs?

Can you clarify the motivation for that question? This problem exists
for any filesystem that implements an mmap that where the physical
page backing the mapping is identical to the physical storage location
for the file data. I don't see it as an xfs specific problem. Rather,
xfs is taking the lead in this space because it has already deployed
and demonstrated that leases work for the pnfs4 block-server case, so
it seems logical to attempt to extend that case for non-ODP-RDMA.

> Are they
> really trying to do COW backed mappings for the RDMA targets?  Or do
> they want a COW backed FS but are perfectly happy if the specific RDMA
> targets are *not* COW and are statically allocated?

I would expect the COW to be broken at registration time. Only ODP
could possibly support reflink + RDMA. So I think this devolves the
problem back to just the "what to do about truncate/punch-hole"
problem in the specific case of non-ODP hardware combined with the
Filesystem-DAX facility.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:44                 ` Dan Williams
@ 2019-02-06 23:21                   ` Jason Gunthorpe
  2019-02-06 23:30                     ` Dan Williams
  2019-02-07  1:57                   ` Doug Ledford
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 23:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:

> > Do they need to stick with xfs?
> 
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. 

.. and needs to dynamicaly change that mapping. Which is not really
something inherent to the general idea of a filesystem. A file system
that had *strictly static* block assignments would work fine.

Not all filesystem even implement hole punch.

Not all filesystem implement reflink.

ftruncate doesn't *have* to instantly return the free blocks to
allocation pool.

ie this is not a DAX & RDMA issue but a XFS & RDMA issue.

Replacing XFS is probably not be reasonable, but I wonder if a XFS--
operating mode could exist that had enough features removed to be
safe? 

Ie turn off REFLINK. Change the semantic of ftruncate to be more like
ETXTBUSY. Turn off hole punch.

> > Are they really trying to do COW backed mappings for the RDMA
> > targets?  Or do they want a COW backed FS but are perfectly happy
> > if the specific RDMA targets are *not* COW and are statically
> > allocated?
> 
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.

Usually the problem with COW is that you make a READ RDMA MR and on a
COW'd file, and some other thread breaks the COW..

This probably becomes a problem if the same process that has the MR
triggers a COW break (ie by writing to the CPU mmap). This would cause
the page to be reassigned but the MR would not be updated, which is
not what the app expects.

WRITE is simpler, once the COW is broken during GUP, the pages cannot
be COW'd again until the DMA pin is released. So new reflinks would be
blocked during the DMA pin period.

To fix READ you'd have to treat it like WRITE and break the COW at GPU.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 23:21                   ` Jason Gunthorpe
@ 2019-02-06 23:30                     ` Dan Williams
  2019-02-06 23:41                       ` Jason Gunthorpe
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-06 23:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
>
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question? This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data.
>
> .. and needs to dynamicaly change that mapping. Which is not really
> something inherent to the general idea of a filesystem. A file system
> that had *strictly static* block assignments would work fine.
>
> Not all filesystem even implement hole punch.
>
> Not all filesystem implement reflink.
>
> ftruncate doesn't *have* to instantly return the free blocks to
> allocation pool.
>
> ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
>
> Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> operating mode could exist that had enough features removed to be
> safe?

You're describing the current situation, i.e. Linux already implements
this, it's called Device-DAX and some users of RDMA find it
insufficient. The choices are to continue to tell them "no", or say
"yes, but you need to submit to lease coordination".

> Ie turn off REFLINK. Change the semantic of ftruncate to be more like
> ETXTBUSY. Turn off hole punch.
>
> > > Are they really trying to do COW backed mappings for the RDMA
> > > targets?  Or do they want a COW backed FS but are perfectly happy
> > > if the specific RDMA targets are *not* COW and are statically
> > > allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> Usually the problem with COW is that you make a READ RDMA MR and on a
> COW'd file, and some other thread breaks the COW..
>
> This probably becomes a problem if the same process that has the MR
> triggers a COW break (ie by writing to the CPU mmap). This would cause
> the page to be reassigned but the MR would not be updated, which is
> not what the app expects.
>
> WRITE is simpler, once the COW is broken during GUP, the pages cannot
> be COW'd again until the DMA pin is released. So new reflinks would be
> blocked during the DMA pin period.
>
> To fix READ you'd have to treat it like WRITE and break the COW at GPU.

Right, that's what I'm proposing that any longterm-GUP break COW as if
it were a write.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 23:30                     ` Dan Williams
@ 2019-02-06 23:41                       ` Jason Gunthorpe
  2019-02-07  0:22                         ` Dan Williams
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 23:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 06, 2019 at 03:30:27PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
> >
> > > > Do they need to stick with xfs?
> > >
> > > Can you clarify the motivation for that question? This problem exists
> > > for any filesystem that implements an mmap that where the physical
> > > page backing the mapping is identical to the physical storage location
> > > for the file data.
> >
> > .. and needs to dynamicaly change that mapping. Which is not really
> > something inherent to the general idea of a filesystem. A file system
> > that had *strictly static* block assignments would work fine.
> >
> > Not all filesystem even implement hole punch.
> >
> > Not all filesystem implement reflink.
> >
> > ftruncate doesn't *have* to instantly return the free blocks to
> > allocation pool.
> >
> > ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
> >
> > Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> > operating mode could exist that had enough features removed to be
> > safe?
> 
> You're describing the current situation, i.e. Linux already implements
> this, it's called Device-DAX and some users of RDMA find it
> insufficient. The choices are to continue to tell them "no", or say
> "yes, but you need to submit to lease coordination".

Device-DAX is not what I'm imagining when I say XFS--.

I mean more like XFS with all features that require rellocation of
blocks disabled.

Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
device-dax.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 23:41                       ` Jason Gunthorpe
@ 2019-02-07  0:22                         ` Dan Williams
  2019-02-07  5:33                           ` Jason Gunthorpe
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-07  0:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko, linux-nvdimm

On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
[..]
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
>
> Device-DAX is not what I'm imagining when I say XFS--.
>
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
>
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> device-dax.

True, not all the way, but the distinction loses significance as you
lose fs features.

Filesystems mark DAX functionality experimental [1] precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.

EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:44                 ` Dan Williams
  2019-02-06 23:21                   ` Jason Gunthorpe
@ 2019-02-07  1:57                   ` Doug Ledford
  2019-02-07  2:48                     ` Dan Williams
  2019-02-07  2:42                   ` Doug Ledford
  2019-02-07 16:25                   ` Doug Ledford
  3 siblings, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-07  1:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 6255 bytes --]

On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > > 
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > > 
> > > > > > Why is the solution revoke then?  Is there something besides truncate
> > > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > > 
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > > 
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > > 
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > > 
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > > 
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > > 
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > > 
> > > Then we must continue to ban longterm pin with DAX..
> > > 
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > > 
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > > 
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > > 
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > > 
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> > 
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
> 
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
> 
> https://youtu.be/ywKPPIE8JfQ?t=395

Thanks for the link, I'll review the panel.

> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
> 
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
> 
> xfs is the only Linux filesystem that supports DAX and reflink.

Is it going to be clear from the link above why reflink + DAX + RDMA is
a good/desirable thing?

> > Do they need to stick with xfs?
> 
> Can you clarify the motivation for that question?

I did a little googling and research before I asked that question. 
According to the documentation, other FSes can work with DAX too (namely
ext2 and ext4).  The question was more or less pondering whether or not
ext2 or ext4 + RDMA + DAX would solve people's problems without the
issues that xfs brings.

>  This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
> 
> > Are they
> > really trying to do COW backed mappings for the RDMA targets?  Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
> 
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.

If that's the case, then we are back to EBUSY *could* work (despite the
objections made so far).

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:44                 ` Dan Williams
  2019-02-06 23:21                   ` Jason Gunthorpe
  2019-02-07  1:57                   ` Doug Ledford
@ 2019-02-07  2:42                   ` Doug Ledford
  2019-02-07  3:13                     ` Dan Williams
  2019-02-07 16:25                   ` Doug Ledford
  3 siblings, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-07  2:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
> 
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
> 
> https://youtu.be/ywKPPIE8JfQ?t=395

I watched this, and I see that Oracle is all sorts of excited that their
storage machines can scale out, and they can access the storage and it
has basically no CPU load on the storage server while performing
millions of queries.  What I didn't hear in there is why DAX has to be
in the picture, or why Oracle couldn't do the same thing with a simple
memory region exported directly to the RDMA subsystem, or why reflink or
any of the other features you talk about are needed.  So, while these
things may legitimately be needed, this video did not tell me about
how/why they are needed, just that RDMA is really, *really* cool for
their use case and gets them 0% CPU utilization on their storage
servers.  I didn't watch the whole thing though.  Do they get into that
later on?  Do they get to that level of technical discussion, or is this
all higher level?

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  1:57                   ` Doug Ledford
@ 2019-02-07  2:48                     ` Dan Williams
  0 siblings, 0 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-07  2:48 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko, linux-nvdimm

On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford@redhat.com> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>

No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.

> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4).  The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.

No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.

6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional

So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.

> >  This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets?  Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).

I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.

The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  2:42                   ` Doug Ledford
@ 2019-02-07  3:13                     ` Dan Williams
  2019-02-07 17:23                       ` Ira Weiny
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-07  3:13 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 6, 2019 at 6:42 PM Doug Ledford <dledford@redhat.com> wrote:
>
> On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> > On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> I watched this, and I see that Oracle is all sorts of excited that their
> storage machines can scale out, and they can access the storage and it
> has basically no CPU load on the storage server while performing
> millions of queries.  What I didn't hear in there is why DAX has to be
> in the picture, or why Oracle couldn't do the same thing with a simple
> memory region exported directly to the RDMA subsystem, or why reflink or
> any of the other features you talk about are needed.  So, while these
> things may legitimately be needed, this video did not tell me about
> how/why they are needed, just that RDMA is really, *really* cool for
> their use case and gets them 0% CPU utilization on their storage
> servers.  I didn't watch the whole thing though.  Do they get into that
> later on?  Do they get to that level of technical discussion, or is this
> all higher level?

They don't. The point of sharing that video was illustrating that RDMA
to persistent memory use case. That 0% cpu utilization is because the
RDMA target is not page-cache / anonymous on the storage box it's
directly to a file offset in DAX / persistent memory. A solution to
truncate lets that use case use more than just Device-DAX or ODP
capable adapters. That said, I need to let Ira jump in here because
saying layout leases solves the problem is not true, it's just the
start of potentially solving the problem. It's not clear to me what
the long tail of work looks like once the filesystem raises a
notification to the RDMA target process.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:24               ` Doug Ledford
  2019-02-06 22:44                 ` Dan Williams
@ 2019-02-07  3:52                 ` Dave Chinner
  2019-02-07  5:23                   ` Jason Gunthorpe
  1 sibling, 1 reply; 106+ messages in thread
From: Dave Chinner @ 2019-02-07  3:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, Christopher Lameter, Matthew Wilcox, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Michal Hocko

On Wed, Feb 06, 2019 at 05:24:50PM -0500, Doug Ledford wrote:
> On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > 
> > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > a different file.
> > > > > 
> > > > > Why is the solution revoke then?  Is there something besides truncate
> > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > listed as a return value of truncate, so extending the API to include
> > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > (or should not be) totally out of the question.
> > > > > 
> > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > portion where that alternative was ruled out?
> > > > 
> > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > about where there was an mmap with the O_DAX option to do direct write
> > > > though? If we only allow this use case then we may not have to worry about
> > > > long term GUP because DAX mapped files will stay in the physical location
> > > > regardless.
> > > 
> > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > writes will physically move the data to a new physical location.
> > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > pin.
> > > 
> > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > the filesystem can move data physically on write access, and b)
> > > revokable file leases so that the filesystem can kick userspace out
> > > of the way when it needs to.
> > 
> > Why do we need both? You want to have leases for normal CPU mmaps too?

We don't need them for normal CPU mmaps because that's locally
addressable page fault capable hardware. i.e. if we need to
serialise something, we just use kernel locks, etc. When it's a
remote entity (such as RDMA) we have to get that remote entity to
release it's reference/access so the kernel has exclusive access
to the resource it needs to act on.

IOWs, file layout leases are required for remote access to local
filesystem controlled storage. That's the access arbitration model
the pNFS implementation hooked into XFS uses and it seems to work
just fine. Local access just hooks in to the kernel XFS paths and
triggers lease/delegation recalls through the NFS server when
required.

If your argument is that "existing RDMA apps don't have a recall
mechanism" then that's what they are going to need to implement to
work with DAX+RDMA. Reliable remote access arbitration is required
for DAX+RDMA, regardless of what filesysetm the data is hosted on.
Anything less is a potential security hole.

> > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > can change physical location at any time.
> > 
> > Then we must continue to ban longterm pin with DAX..
> > 
> > Nobody is going to want to deploy a system where revoke can happen at
> > any time and if you don't respond fast enough your system either locks
> > with some kind of FS meltdown or your process gets SIGKILL. 
> > 
> > I don't really see a reason to invest so much design work into
> > something that isn't production worthy.
> > 
> > It *almost* made sense with ftruncate, because you could architect to
> > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > 
> > Dave, you said the FS is responsible to arbitrate access to the
> > physical pages..
> > 
> > Is it possible to have a filesystem for DAX that is more suited to
> > this environment? Ie designed to not require block reallocation (no
> > COW, no reflinks, different approach to ftruncate, etc)
> 
> Can someone give me a real world scenario that someone is *actually*
> asking for with this?  Are DAX users demanding xfs, or is it just the
> filesystem of convenience?

I had a conference call last week with a room full of people who
want reflink functionality on DAX ASAP. They have customers that are
asking them to provide it, and the only vehicle they have to
delivery that functionality in any reasonable timeframe is XFS.

> Do they need to stick with xfs?  Are they
> really trying to do COW backed mappings for the RDMA targets?

I have no idea if they want RDMA. It is also irrelevant to the
requirement of and timeframe to support reflink on XFS w/ DAX.

Especially because:

# mkfs.xfs -f -m reflink=0 /dev/pmem1

And now you have an XFS fileysetm configuration that does not
support dynamic moving of physical storage on write. You have to do
this anyway to use DAX right now, so it's hardly an issue to
require this for non-ODP capable RDMA hardware.

---

I think people are missing the point of LSFMM here - it is to work
out what we need to do to support all the functionality that both
users want and that the hardware provides in the medium term.

Once we have reflink on DAX, somebody is going to ask for
no-compromise RDMA support on these filesystems (e.g. NFSv4 file
server on pmem/FS-DAX that allows server side clones and clients use
RDMA access) and we're going to have to work out how to support it.
Rather than shouting at the messenger (XFS) that reports the hard
problems we have to solve, how about we work out exactly what we
need to do to support this functionality because it is coming and
people want it.

Requiring ODP capable hardware and applications that control RDMA
access to use file leases and be able to cancel/recall client side
delegations (like NFS is already able to do!) seems like a pretty
solid way forward here. We've already solved this "remote direct
physical accesses to local fileystem storage arbitration" problem
with NFSv4, we have both a server and a client in the kernel, so
maybe that should be the first application we aim to support with
DAX+RDMA?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  3:52                 ` Dave Chinner
@ 2019-02-07  5:23                   ` Jason Gunthorpe
  2019-02-07  6:00                     ` Dan Williams
                                       ` (2 more replies)
  0 siblings, 3 replies; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-07  5:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Doug Ledford, Christopher Lameter, Matthew Wilcox, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Michal Hocko

On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> On Wed, Feb 06, 2019 at 05:24:50PM -0500, Doug Ledford wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > > 
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > > 
> > > > > > Why is the solution revoke then?  Is there something besides truncate
> > > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > > 
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > > 
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > > 
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > > 
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > > 
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> 
> We don't need them for normal CPU mmaps because that's locally
> addressable page fault capable hardware. i.e. if we need to
> serialise something, we just use kernel locks, etc. When it's a
> remote entity (such as RDMA) we have to get that remote entity to
> release it's reference/access so the kernel has exclusive access
> to the resource it needs to act on.

Why can't DAX follow the path of GPU? Jerome has been working on
patches that let GPU do page migrations and other activities and
maintain full sync with ODP MRs.

I don't know of a reason why DAX migration would be different from GPU
migration.

The ODP RDMA HW does support halting RDMA access and interrupting the
CPU to re-establish access, so you can get your locks/etc as. With
today's implemetnation DAX has to trigger all the needed MM notifier
call backs to make this work. Tomorrow it will have to interact with
the HMM mirror API.

Jerome is already demoing this for the GPU case, so the RDMA ODP HW is
fine.

Is DAX migration different in some way from GPU's migration that it
can't use this flow and needs a lease to??? This would be a big
surprise to me.

> If your argument is that "existing RDMA apps don't have a recall
> mechanism" then that's what they are going to need to implement to
> work with DAX+RDMA. Reliable remote access arbitration is required
> for DAX+RDMA, regardless of what filesysetm the data is hosted on.

My argument is that is a toy configuration that no production user
would use. It either has the ability to wait for the lease to revoke
'forever' without consequence or the application will be critically
de-stablized by the kernel's escalation to time bound the response.
(or production systems never get revoke)

> Anything less is a potential security hole.

How does it get to a security hole? Obviously the pages under DMA
can't be re-used for anything..

> Once we have reflink on DAX, somebody is going to ask for
> no-compromise RDMA support on these filesystems (e.g. NFSv4 file
> server on pmem/FS-DAX that allows server side clones and clients use
> RDMA access) and we're going to have to work out how to support it.
> Rather than shouting at the messenger (XFS) that reports the hard
> problems we have to solve, how about we work out exactly what we
> need to do to support this functionality because it is coming and
> people want it.

I've thought this was basically solved - use ODP and you get full
functionality.  Until you just now brought up the idea that ODP is
not enough..

The arguing here is that there is certainly a subset of people that
don't want to use ODP. If we tell them a hard 'no' then the
conversation is done.

Otherwise, I like the idea of telling them to use a less featureful
XFS configuration that is 'safe' for non-ODP cases. The kernel has a
long history of catering to certain configurations by limiting
functionality or performance.

I don't like the idea of building toy leases just for this one,
arguably baroque, case.

> Requiring ODP capable hardware and applications that control RDMA
> access to use file leases and be able to cancel/recall client side
> delegations (like NFS is already able to do!) seems like a pretty

So, what happens on NFS if the revoke takes too long?

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  0:22                         ` Dan Williams
@ 2019-02-07  5:33                           ` Jason Gunthorpe
  0 siblings, 0 replies; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-07  5:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko, linux-nvdimm

On Wed, Feb 06, 2019 at 04:22:16PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> [..]
> > > You're describing the current situation, i.e. Linux already implements
> > > this, it's called Device-DAX and some users of RDMA find it
> > > insufficient. The choices are to continue to tell them "no", or say
> > > "yes, but you need to submit to lease coordination".
> >
> > Device-DAX is not what I'm imagining when I say XFS--.
> >
> > I mean more like XFS with all features that require rellocation of
> > blocks disabled.
> >
> > Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> > device-dax.
> 
> True, not all the way, but the distinction loses significance as you
> lose fs features.
> 
> Filesystems mark DAX functionality experimental [1] precisely because
> it forbids otherwise typical operations that work in the nominal page
> cache case. An approach that says "lets cement the list of things a
> filesystem or a core-memory-mangement facility can't do because RDMA
> finds it awkward" is bad precedent. 

I'm not saying these rules should apply globaly.

I'm suggesting you could have a FS that supports gup_longterm by
design, and a FS that doesn't. And that is OK. They can have different
rules.

Obviously the golden case here is to use ODP (which doesn't call
gup_longterm at all) - that works for both.

Supporting non-ODP is a trade off case - users that want to run on
limited HW must accept limited functionality. Limited functionality is
better than no-funtionality.

Linux has many of these user-choose tradeoffs. This is how it supports
such a wide range of HW capabilities. Not all HW can do all
things. Some features really do need HW support. It has always been
that way.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  5:23                   ` Jason Gunthorpe
@ 2019-02-07  6:00                     ` Dan Williams
  2019-02-07 17:17                       ` Jason Gunthorpe
  2019-02-07 15:04                     ` Chuck Lever
  2019-02-07 16:54                     ` Ira Weiny
  2 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-07  6:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 6, 2019 at 9:23 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> > On Wed, Feb 06, 2019 at 05:24:50PM -0500, Doug Ledford wrote:
> > > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > > >
> > > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > > a different file.
> > > > > > >
> > > > > > > Why is the solution revoke then?  Is there something besides truncate
> > > > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > > (or should not be) totally out of the question.
> > > > > > >
> > > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > > portion where that alternative was ruled out?
> > > > > >
> > > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > > though? If we only allow this use case then we may not have to worry about
> > > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > > regardless.
> > > > >
> > > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > > writes will physically move the data to a new physical location.
> > > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > > pin.
> > > > >
> > > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > > the filesystem can move data physically on write access, and b)
> > > > > revokable file leases so that the filesystem can kick userspace out
> > > > > of the way when it needs to.
> > > >
> > > > Why do we need both? You want to have leases for normal CPU mmaps too?
> >
> > We don't need them for normal CPU mmaps because that's locally
> > addressable page fault capable hardware. i.e. if we need to
> > serialise something, we just use kernel locks, etc. When it's a
> > remote entity (such as RDMA) we have to get that remote entity to
> > release it's reference/access so the kernel has exclusive access
> > to the resource it needs to act on.
>
> Why can't DAX follow the path of GPU? Jerome has been working on
> patches that let GPU do page migrations and other activities and
> maintain full sync with ODP MRs.
>
> I don't know of a reason why DAX migration would be different from GPU
> migration.

I don't think we need leases in the ODP case.

> The ODP RDMA HW does support halting RDMA access and interrupting the
> CPU to re-establish access, so you can get your locks/etc as. With
> today's implemetnation DAX has to trigger all the needed MM notifier
> call backs to make this work. Tomorrow it will have to interact with
> the HMM mirror API.
>
> Jerome is already demoing this for the GPU case, so the RDMA ODP HW is
> fine.
>
> Is DAX migration different in some way from GPU's migration that it
> can't use this flow and needs a lease to??? This would be a big
> surprise to me.

Agree, I see no need for leases in the ODP case the mmu_notifier is
already fulfilling the same role as a lease notification.

> > If your argument is that "existing RDMA apps don't have a recall
> > mechanism" then that's what they are going to need to implement to
> > work with DAX+RDMA. Reliable remote access arbitration is required
> > for DAX+RDMA, regardless of what filesysetm the data is hosted on.
>
> My argument is that is a toy configuration that no production user
> would use. It either has the ability to wait for the lease to revoke
> 'forever' without consequence or the application will be critically
> de-stablized by the kernel's escalation to time bound the response.
> (or production systems never get revoke)

I think we're off track on the need for leases for anything other than
non-ODP hardware.

Otherwise this argument seems to be saying there is absolutely no safe
way to recall a memory registration from hardware, which does not make
sense because SIGKILL needs to work as a last resort.

> > Anything less is a potential security hole.
>
> How does it get to a security hole? Obviously the pages under DMA
> can't be re-used for anything..

Writes to storage outside the security domain they were intended due
to the filesystem reallocating the physical blocks.

> > Once we have reflink on DAX, somebody is going to ask for
> > no-compromise RDMA support on these filesystems (e.g. NFSv4 file
> > server on pmem/FS-DAX that allows server side clones and clients use
> > RDMA access) and we're going to have to work out how to support it.
> > Rather than shouting at the messenger (XFS) that reports the hard
> > problems we have to solve, how about we work out exactly what we
> > need to do to support this functionality because it is coming and
> > people want it.
>
> I've thought this was basically solved - use ODP and you get full
> functionality.  Until you just now brought up the idea that ODP is
> not enough..

I think that was a misunderstanding, I struggle to see how a driver
that agrees to be bound by mmu notifications (ODP) has any problems
with anything the filesystem wants to do with the mapping. My
assumption is that ODP == filesystem can invalidate mappings at will
and all is good.

> The arguing here is that there is certainly a subset of people that
> don't want to use ODP. If we tell them a hard 'no' then the
> conversation is done.

Again, SIGKILL must work the RDMA target can't survive that, so it's
not impossible, or are you saying not even SIGKILL can guarantee an
RDMA registration goes idle? Then I can see that "hard no" having real
teeth otherwise it's a matter of software.

> Otherwise, I like the idea of telling them to use a less featureful
> XFS configuration that is 'safe' for non-ODP cases. The kernel has a
> long history of catering to certain configurations by limiting
> functionality or performance.

That's an unsustainable maintenance burden for the filesystem, just
keep the status quo of failing the registration at that point or
requiring a filesystem to not be used.

> I don't like the idea of building toy leases just for this one,
> arguably baroque, case.

What makes it a toy and baroque? Outside of RDMA registrations being
irretrievable I have a gap in my understanding of what makes this
pointless to even attempt?

> > Requiring ODP capable hardware and applications that control RDMA
> > access to use file leases and be able to cancel/recall client side
> > delegations (like NFS is already able to do!) seems like a pretty
>
> So, what happens on NFS if the revoke takes too long?
>
> Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  5:23                   ` Jason Gunthorpe
  2019-02-07  6:00                     ` Dan Williams
@ 2019-02-07 15:04                     ` Chuck Lever
  2019-02-07 15:28                       ` Tom Talpey
  2019-02-07 16:54                     ` Ira Weiny
  2 siblings, 1 reply; 106+ messages in thread
From: Chuck Lever @ 2019-02-07 15:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, linux-mm,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Dan Williams, Michal Hocko



> On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> 
>> Requiring ODP capable hardware and applications that control RDMA
>> access to use file leases and be able to cancel/recall client side
>> delegations (like NFS is already able to do!) seems like a pretty
> 
> So, what happens on NFS if the revoke takes too long?

NFS distinguishes between "recall" and "revoke". Dave used "recall"
here, it means that the server recalls the client's delegation. If
the client doesn't respond, the server revokes the delegation
unilaterally and other users are allowed to proceed.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 15:04                     ` Chuck Lever
@ 2019-02-07 15:28                       ` Tom Talpey
  2019-02-07 15:37                         ` Doug Ledford
  2019-02-07 16:57                         ` Ira Weiny
  0 siblings, 2 replies; 106+ messages in thread
From: Tom Talpey @ 2019-02-07 15:28 UTC (permalink / raw)
  To: Chuck Lever, Jason Gunthorpe
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, linux-mm,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Dan Williams, Michal Hocko

On 2/7/2019 10:04 AM, Chuck Lever wrote:
> 
> 
>> On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
>>
>>> Requiring ODP capable hardware and applications that control RDMA
>>> access to use file leases and be able to cancel/recall client side
>>> delegations (like NFS is already able to do!) seems like a pretty
>>
>> So, what happens on NFS if the revoke takes too long?
> 
> NFS distinguishes between "recall" and "revoke". Dave used "recall"
> here, it means that the server recalls the client's delegation. If
> the client doesn't respond, the server revokes the delegation
> unilaterally and other users are allowed to proceed.

The SMB3 protocol has a similar "lease break" mechanism, btw.

SMB3 "push mode" has long-expected to allow DAX mapping of files
only when an exclusive lease is held by the requesting client.
The server may recall the lease if the DAX mapping needs to change.

Once local (MMU) and remote (RDMA) mappings are dropped, the
client may re-request that the server reestablish them. No
connection or process is terminated, and no data is silently lost.

Tom.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 15:28                       ` Tom Talpey
@ 2019-02-07 15:37                         ` Doug Ledford
  2019-02-07 15:41                           ` Tom Talpey
  2019-02-07 16:57                         ` Ira Weiny
  1 sibling, 1 reply; 106+ messages in thread
From: Doug Ledford @ 2019-02-07 15:37 UTC (permalink / raw)
  To: Tom Talpey, Chuck Lever, Jason Gunthorpe
  Cc: Dave Chinner, Christopher Lameter, Matthew Wilcox, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Dan Williams, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2270 bytes --]

On Thu, 2019-02-07 at 10:28 -0500, Tom Talpey wrote:
> On 2/7/2019 10:04 AM, Chuck Lever wrote:
> > 
> > > On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > 
> > > On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> > > 
> > > > Requiring ODP capable hardware and applications that control RDMA
> > > > access to use file leases and be able to cancel/recall client side
> > > > delegations (like NFS is already able to do!) seems like a pretty
> > > 
> > > So, what happens on NFS if the revoke takes too long?
> > 
> > NFS distinguishes between "recall" and "revoke". Dave used "recall"
> > here, it means that the server recalls the client's delegation. If
> > the client doesn't respond, the server revokes the delegation
> > unilaterally and other users are allowed to proceed.
> 
> The SMB3 protocol has a similar "lease break" mechanism, btw.
> 
> SMB3 "push mode" has long-expected to allow DAX mapping of files
> only when an exclusive lease is held by the requesting client.
> The server may recall the lease if the DAX mapping needs to change.
> 
> Once local (MMU) and remote (RDMA) mappings are dropped, the
> client may re-request that the server reestablish them. No
> connection or process is terminated, and no data is silently lost.

Yeah, but you're referring to a situation where the communication agent
and the filesystem agent are one and the same and they work
cooperatively to resolve the issue.  With DAX under Linux, the
filesystem agent and the communication agent are separate, and right
now, to my knowledge, the filesystem agent doesn't tell the
communication agent about a broken lease, it want's to be able to do
things 100% transparently without any work on the communication agent's
part.  That works for ODP, but not for anything else.  If the filesystem
notified the communication agent of the need to drop the MMU region and
rebuild it, the communication agent could communicate that to the remote
host, and things would work.  But there's no POSIX message for "your
file is moving on media, redo your mmap".

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 15:37                         ` Doug Ledford
@ 2019-02-07 15:41                           ` Tom Talpey
  2019-02-07 15:56                             ` Doug Ledford
  0 siblings, 1 reply; 106+ messages in thread
From: Tom Talpey @ 2019-02-07 15:41 UTC (permalink / raw)
  To: Doug Ledford, Chuck Lever, Jason Gunthorpe
  Cc: Dave Chinner, Christopher Lameter, Matthew Wilcox, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Dan Williams, Michal Hocko

On 2/7/2019 10:37 AM, Doug Ledford wrote:
> On Thu, 2019-02-07 at 10:28 -0500, Tom Talpey wrote:
>> On 2/7/2019 10:04 AM, Chuck Lever wrote:
>>>
>>>> On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>>
>>>> On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
>>>>
>>>>> Requiring ODP capable hardware and applications that control RDMA
>>>>> access to use file leases and be able to cancel/recall client side
>>>>> delegations (like NFS is already able to do!) seems like a pretty
>>>>
>>>> So, what happens on NFS if the revoke takes too long?
>>>
>>> NFS distinguishes between "recall" and "revoke". Dave used "recall"
>>> here, it means that the server recalls the client's delegation. If
>>> the client doesn't respond, the server revokes the delegation
>>> unilaterally and other users are allowed to proceed.
>>
>> The SMB3 protocol has a similar "lease break" mechanism, btw.
>>
>> SMB3 "push mode" has long-expected to allow DAX mapping of files
>> only when an exclusive lease is held by the requesting client.
>> The server may recall the lease if the DAX mapping needs to change.
>>
>> Once local (MMU) and remote (RDMA) mappings are dropped, the
>> client may re-request that the server reestablish them. No
>> connection or process is terminated, and no data is silently lost.
> 
> Yeah, but you're referring to a situation where the communication agent
> and the filesystem agent are one and the same and they work
> cooperatively to resolve the issue.  With DAX under Linux, the
> filesystem agent and the communication agent are separate, and right
> now, to my knowledge, the filesystem agent doesn't tell the
> communication agent about a broken lease, it want's to be able to do
> things 100% transparently without any work on the communication agent's
> part.  That works for ODP, but not for anything else.  If the filesystem
> notified the communication agent of the need to drop the MMU region and
> rebuild it, the communication agent could communicate that to the remote
> host, and things would work.  But there's no POSIX message for "your
> file is moving on media, redo your mmap".

Indeed, the MMU notifier and the filesystem need to be integrated.

I'm unmoved by the POSIX argument. This stuff didn't happen in 1990.

Tom.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 15:41                           ` Tom Talpey
@ 2019-02-07 15:56                             ` Doug Ledford
  0 siblings, 0 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-07 15:56 UTC (permalink / raw)
  To: Tom Talpey, Chuck Lever, Jason Gunthorpe
  Cc: Dave Chinner, Christopher Lameter, Matthew Wilcox, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Dan Williams, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 3497 bytes --]

On Thu, 2019-02-07 at 10:41 -0500, Tom Talpey wrote:
> On 2/7/2019 10:37 AM, Doug Ledford wrote:
> > On Thu, 2019-02-07 at 10:28 -0500, Tom Talpey wrote:
> > > On 2/7/2019 10:04 AM, Chuck Lever wrote:
> > > > > On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > 
> > > > > On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> > > > > 
> > > > > > Requiring ODP capable hardware and applications that control RDMA
> > > > > > access to use file leases and be able to cancel/recall client side
> > > > > > delegations (like NFS is already able to do!) seems like a pretty
> > > > > 
> > > > > So, what happens on NFS if the revoke takes too long?
> > > > 
> > > > NFS distinguishes between "recall" and "revoke". Dave used "recall"
> > > > here, it means that the server recalls the client's delegation. If
> > > > the client doesn't respond, the server revokes the delegation
> > > > unilaterally and other users are allowed to proceed.
> > > 
> > > The SMB3 protocol has a similar "lease break" mechanism, btw.
> > > 
> > > SMB3 "push mode" has long-expected to allow DAX mapping of files
> > > only when an exclusive lease is held by the requesting client.
> > > The server may recall the lease if the DAX mapping needs to change.
> > > 
> > > Once local (MMU) and remote (RDMA) mappings are dropped, the
> > > client may re-request that the server reestablish them. No
> > > connection or process is terminated, and no data is silently lost.
> > 
> > Yeah, but you're referring to a situation where the communication agent
> > and the filesystem agent are one and the same and they work
> > cooperatively to resolve the issue.  With DAX under Linux, the
> > filesystem agent and the communication agent are separate, and right
> > now, to my knowledge, the filesystem agent doesn't tell the
> > communication agent about a broken lease, it want's to be able to do
> > things 100% transparently without any work on the communication agent's
> > part.  That works for ODP, but not for anything else.  If the filesystem
> > notified the communication agent of the need to drop the MMU region and
> > rebuild it, the communication agent could communicate that to the remote
> > host, and things would work.  But there's no POSIX message for "your
> > file is moving on media, redo your mmap".
> 
> Indeed, the MMU notifier and the filesystem need to be integrated.

And right now, the method of sharing this across the network is:

persistent memory in machine
  local filesystem supporting a DAX mount
    custom application that knows how to mmap then rdma map files,
    and can manage the connection long term

The point being that every single method of sharing this stuff is a one
off custom application (Oracle just being one).  I'm not really all that
thrilled about the idea of writing the same mmap/rdma map/oob-management 
code in every single app out there.  To me, this problem is screaming
for a more general purpose kernel solution, just like NVMe over Fabrics.
I'm thinking a clustered filesystem on top of a shared memory segment
between hosts is a much more natural fit.  Then applications just mmap
the files locally, and the kernel does the rest.

> I'm unmoved by the POSIX argument. This stuff didn't happen in 1990.
> 
> Tom.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 22:44                 ` Dan Williams
                                     ` (2 preceding siblings ...)
  2019-02-07  2:42                   ` Doug Ledford
@ 2019-02-07 16:25                   ` Doug Ledford
  2019-02-07 16:55                     ` Christopher Lameter
  2019-02-07 17:24                     ` Matthew Wilcox
  3 siblings, 2 replies; 106+ messages in thread
From: Doug Ledford @ 2019-02-07 16:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 10169 bytes --]

I think I've finally wrapped my head around all of this.  Let's see if I
have this right:

* People are using filesystem DAX to expose byte addressable persistent
memory because putting a filesystem on the memory makes an easy way to
organize the data in the memory and share it between various processes. 
It's worth noting that this is not the only way to share this memory,
and arguably not even the best way, but it's what people are doing. 
However, to get byte level addressability on the remote side, we must
create files on the server side, mmap those files, then give a handle to
the memory region to the client side that the client then addresses on a
byte by byte basis.  This is because all of the normal kernel based
device sharing mechanisms are block based and don't provide byte level
addressability.

* People are asking for thin allocations, reflinks, deduplication,
whatever else because persistent memory is still small in terms of size
compared to the amount of data people want to put on it, so these
techniques stretch its usefulness.

* Because there is no kernel level mechanism for sharing byte
addressable memory, this only works with specific applications that have
been written to create files on byte addressable memory, mmap them, then
share them out via RDMA.  I bring this up because in the video linked in
this email, Oracle is gushing about how great this feature is.  But it's
important to understand that this only works because the Oracle
processes themselves are the filesystem sharing entity.  That means at
other points in this conversation where we've talked about the need for
forward progress, and non-ODP hardware, and the talk has come down to
sending SIGKILL to a process in order to free memory reservations, I
feel confident in saying that Oracle would *never* agree to this.  If
you kill an Oracle process to make forward progress, you are probably
also killing the very process that needed you to make progress in the
first place.  I'm pretty confident that Oracle will have no problem
what-so-ever saying that ODP capable hardware is a hard requirement for
using their software with DAX.

* So if Oracle is likely to demand ODP hardware, period, are there other
scenarios that might be more accepting of a more limited FS on top of
DAX that doesn't support reflinks and deduplication?  I can think of a
possible yes to that answer rather easily.  Message brokerage servers
(amqp, qpid) have strict requirements about receiving a message and then
making sure that it makes it once, and only once, to all subscribed
receivers.  A natural way of organizing this sort of thing is to create
a persistent ring buffer for incoming messages, one per each connecting
client that is sending messages.  Then a log file for each client you
are sending messages back out to.  Putting these files on persistent
memory and then mapping the ring buffer to the clients, and writing your
own transmission journals to the persistent memory, would allow the
program to be very robust in the face of a program or system crash. 
This sort of usage would not require any thin allocations, reflinks, or
other such features, and yet would still find the filesystem
organization useful.  Therefore I think the answer is yes, there are at
least some use cases that would find a less featureful filesystem that
works with persistent memory and RDMA but without ODP to be of value.

* Really though, as I said in my email to Tom Talpey, this entire
situation is simply screaming that we are doing DAX networking wrong. 
We shouldn't be writing the networking code once in every single
application that wants to do this.  If we had a memory segment that we
shared from server to client(s), and in that memory segment we
implemented a clustered filesystem, then applications would simply mmap
local files and be done with it.  If the file needed to move, the kernel
would update the mmap in the application, done.  If you ask me, it is
the attempt to do this the wrong way that is resulting in all this
heartache.  That said, for today, my recommendation would be to require
ODP hardware for XFS filesystem with the DAX option, but allow ext2
filesystems to mount DAX filesystems on non-ODP hardware, and go in and
modify the ext2 filesystem so that on DAX mounts, it disables hole punch
and ftrunctate any time they would result in the forced removal of an
established mmap.


On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > > 
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > > 
> > > > > > Why is the solution revoke then?  Is there something besides truncate
> > > > > > that we have to worry about?  I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > > 
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > > 
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > > 
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > > 
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > > 
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > > 
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > > 
> > > Then we must continue to ban longterm pin with DAX..
> > > 
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > > 
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > > 
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > > 
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > > 
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> > 
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
> 
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
> 
> https://youtu.be/ywKPPIE8JfQ?t=395
> 
> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
> 
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
> 
> xfs is the only Linux filesystem that supports DAX and reflink.
> 
> > Do they need to stick with xfs?
> 
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
> 
> > Are they
> > really trying to do COW backed mappings for the RDMA targets?  Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
> 
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.



-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-06 20:54                 ` Doug Ledford
@ 2019-02-07 16:48                   ` Jan Kara
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Kara @ 2019-02-07 16:48 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Matthew Wilcox, Christopher Lameter, Jason Gunthorpe, Jan Kara,
	Ira Weiny, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Dave Chinner,
	Michal Hocko

On Wed 06-02-19 15:54:01, Doug Ledford wrote:
> On Wed, 2019-02-06 at 12:20 -0800, Matthew Wilcox wrote:
> > On Wed, Feb 06, 2019 at 03:16:02PM -0500, Doug Ledford wrote:
> > > On Wed, 2019-02-06 at 11:40 -0800, Matthew Wilcox wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > > 
> > > > ... except for truncate.  And now that I think about it, there was a
> > > > desire to support hot-unplug which also needed revoke.
> > > 
> > > We already support hot unplug of RDMA devices.  But it is extreme.  How
> > > does hot unplug deal with a program running from the device (something
> > > that would have returned ETXTBSY)?
> > 
> > Not hot-unplugging the RDMA device but hot-unplugging an NV-DIMM.
> 
> Is an NV-DIMM the only thing we use DAX on?

Currently yes. However KVM people are soon going to use it for their
purposes as well (essentially directly sharing host page cache between
guests).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  5:23                   ` Jason Gunthorpe
  2019-02-07  6:00                     ` Dan Williams
  2019-02-07 15:04                     ` Chuck Lever
@ 2019-02-07 16:54                     ` Ira Weiny
  2 siblings, 0 replies; 106+ messages in thread
From: Ira Weiny @ 2019-02-07 16:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, lsf-pc, linux-rdma, linux-mm, linux-kernel,
	John Hubbard, Jerome Glisse, Dan Williams, Michal Hocko

On Wed, Feb 06, 2019 at 10:23:10PM -0700, Jason Gunthorpe wrote:
> On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> 
> > Requiring ODP capable hardware and applications that control RDMA
> > access to use file leases and be able to cancel/recall client side
> > delegations (like NFS is already able to do!) seems like a pretty
> 
> So, what happens on NFS if the revoke takes too long?

This is the fundamental issue with RDMA revoke.  With RDMA and some hardware
you are going to end up killing processes.  If the decision is that only
processes on non-ODP hardware get killed and the user basically "should not
have done that" then I'm ok with that.  However, then we really need to
prevented them from registering the memory in the first place.  Which means we
leave in the "longterm" GUP registration and fail those registrations can't be
supported.

Ira

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 16:25                   ` Doug Ledford
@ 2019-02-07 16:55                     ` Christopher Lameter
  2019-02-07 17:35                       ` Ira Weiny
  2019-02-08  4:43                       ` Dave Chinner
  2019-02-07 17:24                     ` Matthew Wilcox
  1 sibling, 2 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-07 16:55 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Jason Gunthorpe, Dave Chinner, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

One approach that may be a clean way to solve this:

1. Long term GUP usage requires the virtual mapping to the pages be fixed
   for the duration of the GUP Map. There never has been a way to break
   the pinnning and thus this needs to be preserved.

2. Page Cache Long term pins are not allowed since regular filesystems
   depend on COW and other tricks which are incompatible with a long term
   pin.

3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
   provide the virtual mapping when the PIN is done and DO NO OPERATIONS
   on the longterm pinned range until the long term pin is removed.
   Hardware may do its job (like for persistent memory) but no data
   consistency on the NVDIMM medium is guaranteed until the long term pin
   is removed  and the filesystems regains control over the area.

4. Long term pin means that the mapped sections are an actively used part
   of the file (like a filesystem write) and it cannot be truncated for
   the duration of the pin. It can be thought of as if the truncate is
   immediate followed by a write extending the file again. The mapping
   by RDMA implies after all that remote writes can occur at anytime
   within the area pinned long term.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 15:28                       ` Tom Talpey
  2019-02-07 15:37                         ` Doug Ledford
@ 2019-02-07 16:57                         ` Ira Weiny
  2019-02-07 21:31                           ` Tom Talpey
  1 sibling, 1 reply; 106+ messages in thread
From: Ira Weiny @ 2019-02-07 16:57 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Chuck Lever, Jason Gunthorpe, Dave Chinner, Doug Ledford,
	Christopher Lameter, Matthew Wilcox, Jan Kara, lsf-pc,
	linux-rdma, linux-mm, Linux Kernel Mailing List, John Hubbard,
	Jerome Glisse, Dan Williams, Michal Hocko

On Thu, Feb 07, 2019 at 10:28:05AM -0500, Tom Talpey wrote:
> On 2/7/2019 10:04 AM, Chuck Lever wrote:
> > 
> > 
> > > On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > 
> > > On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
> > > 
> > > > Requiring ODP capable hardware and applications that control RDMA
> > > > access to use file leases and be able to cancel/recall client side
> > > > delegations (like NFS is already able to do!) seems like a pretty
> > > 
> > > So, what happens on NFS if the revoke takes too long?
> > 
> > NFS distinguishes between "recall" and "revoke". Dave used "recall"
> > here, it means that the server recalls the client's delegation. If
> > the client doesn't respond, the server revokes the delegation
> > unilaterally and other users are allowed to proceed.
> 
> The SMB3 protocol has a similar "lease break" mechanism, btw.
> 
> SMB3 "push mode" has long-expected to allow DAX mapping of files
> only when an exclusive lease is held by the requesting client.
> The server may recall the lease if the DAX mapping needs to change.
> 
> Once local (MMU) and remote (RDMA) mappings are dropped, the
> client may re-request that the server reestablish them. No
> connection or process is terminated, and no data is silently lost.

How long does one wait for these remote mappings to be dropped?

Ira


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  6:00                     ` Dan Williams
@ 2019-02-07 17:17                       ` Jason Gunthorpe
  2019-02-07 23:54                         ` Dan Williams
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-07 17:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 06, 2019 at 10:00:28PM -0800, Dan Williams wrote:

> > > If your argument is that "existing RDMA apps don't have a recall
> > > mechanism" then that's what they are going to need to implement to
> > > work with DAX+RDMA. Reliable remote access arbitration is required
> > > for DAX+RDMA, regardless of what filesysetm the data is hosted on.
> >
> > My argument is that is a toy configuration that no production user
> > would use. It either has the ability to wait for the lease to revoke
> > 'forever' without consequence or the application will be critically
> > de-stablized by the kernel's escalation to time bound the response.
> > (or production systems never get revoke)
> 
> I think we're off track on the need for leases for anything other than
> non-ODP hardware.
> 
> Otherwise this argument seems to be saying there is absolutely no safe
> way to recall a memory registration from hardware, which does not make
> sense because SIGKILL needs to work as a last resort.

SIGKILL destroys all the process's resources. This is supported.

You are asking for some way to do a targeted *disablement* (we can't
do destroy) of a single resource.

There is an optional operation that could do what you want
'rereg_user_mr'- however only 3 out of 17 drivers implement it, one of
those drivers supports ODP, and one is supporting old hardware nearing
its end of life.

Of the two that are left, it looks like you might be able to use
IB_MR_REREG_PD to basically disable the MR. Maybe. The spec for this
API is not as a fence - the application is supposed to quiet traffic
before invoking it. So even if it did work, it may not be synchronous
enough to be safe for DAX.

But lets imagine the one driver where this is relavents gets updated
FW that makes this into a fence..

Then the application's communication would more or less explode in a
very strange and unexpected way, but perhaps it could learn to put the
pieces back together, reconnect and restart from scratch.

So, we could imagine doing something here, but it requires things we
don't have, more standardization, and drivers to implement new
functionality. This is not likely to happen.

Thus any lease mechanism is essentially stuck with SIGKILL as the
escalation.

> > The arguing here is that there is certainly a subset of people that
> > don't want to use ODP. If we tell them a hard 'no' then the
> > conversation is done.
> 
> Again, SIGKILL must work the RDMA target can't survive that, so it's
> not impossible, or are you saying not even SIGKILL can guarantee an
> RDMA registration goes idle? Then I can see that "hard no" having real
> teeth otherwise it's a matter of software.

Resorting to SIGKILL makes this into a toy, no real production user
would operate in that world.

> > I don't like the idea of building toy leases just for this one,
> > arguably baroque, case.
> 
> What makes it a toy and baroque? Outside of RDMA registrations being
> irretrievable I have a gap in my understanding of what makes this
> pointless to even attempt?

Insisting to run RDMA & DAX without ODP and building an elaborate
revoke mechanism to support non-ODP HW is inherently baroque. 

Use the HW that supports ODP.

Since no HW can do disable of a MR, the escalation path is SIGKILL
which makes it a non-production toy.

What you keep missing is that for people doing this - the RDMA is a
critical compoment of the system, you can't just say the kernel will
randomly degrade/kill RDMA processes - that is a 'toy' configuration
that is not production worthy.

Especially since this revoke idea is basically a DOS engine for the
RDMA protocol if another process can do actions to trigger revoke. Now
we have a new class of security problems. (again, screams non
production toy)

The only production worthy way is to have the FS be a partner in
making this work without requiring revoke, so the critical RDMA
traffic can operate safely.

Otherwise we need to stick to ODP.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07  3:13                     ` Dan Williams
@ 2019-02-07 17:23                       ` Ira Weiny
  0 siblings, 0 replies; 106+ messages in thread
From: Ira Weiny @ 2019-02-07 17:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Doug Ledford, Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Wed, Feb 06, 2019 at 07:13:16PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 6:42 PM Doug Ledford <dledford@redhat.com> wrote:
> >
> > On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> > > On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > > > Can someone give me a real world scenario that someone is *actually*
> > > > asking for with this?
> > >
> > > I'll point to this example. At the 6:35 mark Kodi talks about the
> > > Oracle use case for DAX + RDMA.
> > >
> > > https://youtu.be/ywKPPIE8JfQ?t=395
> >
> > I watched this, and I see that Oracle is all sorts of excited that their
> > storage machines can scale out, and they can access the storage and it
> > has basically no CPU load on the storage server while performing
> > millions of queries.  What I didn't hear in there is why DAX has to be
> > in the picture, or why Oracle couldn't do the same thing with a simple
> > memory region exported directly to the RDMA subsystem, or why reflink or
> > any of the other features you talk about are needed.  So, while these
> > things may legitimately be needed, this video did not tell me about
> > how/why they are needed, just that RDMA is really, *really* cool for
> > their use case and gets them 0% CPU utilization on their storage
> > servers.  I didn't watch the whole thing though.  Do they get into that
> > later on?  Do they get to that level of technical discussion, or is this
> > all higher level?
> 
> They don't. The point of sharing that video was illustrating that RDMA
> to persistent memory use case. That 0% cpu utilization is because the
> RDMA target is not page-cache / anonymous on the storage box it's
> directly to a file offset in DAX / persistent memory. A solution to
> truncate lets that use case use more than just Device-DAX or ODP
> capable adapters. That said, I need to let Ira jump in here because
> saying layout leases solves the problem is not true, it's just the
> start of potentially solving the problem. It's not clear to me what
> the long tail of work looks like once the filesystem raises a
> notification to the RDMA target process.

This is exactly the problem which has been touched on by others throughout this
thread.

1) To fully support leases on all hardware we will have to allow for RMDA
   processes to be killed when they don't respond to the lease

   a) If the process has done something bad (like truncate or hole punch) then
      the idea that "they get what they deserve" may be ok.

   b) However, if this is because of some underlying file system maintenance
      this is as Jason says unreasonable.  It would be much better to tell the
      application "you can't do this"

2) To fully respond to a lease revocation involves a number of kernel changes
   in the RDMA stack but more importantly modifying every user space RDMA
   application to respond to a message from a channel they may not even be
   listening to.

I think this is where Jason is getting very concerned.  When you
combine 1b and 2 you end up with a "non production" worthy solution.

NOTE: This is somewhat true of ODP hardware as well since applications register
each individual RDMA memory region as either ODP or not.  So out of the box not
all application would work automatically.

Ira


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 16:25                   ` Doug Ledford
  2019-02-07 16:55                     ` Christopher Lameter
@ 2019-02-07 17:24                     ` Matthew Wilcox
  2019-02-07 17:26                       ` Jason Gunthorpe
  1 sibling, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-07 17:24 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dan Williams, Jason Gunthorpe, Dave Chinner, Christopher Lameter,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 11:25:35AM -0500, Doug Ledford wrote:
> * Really though, as I said in my email to Tom Talpey, this entire
> situation is simply screaming that we are doing DAX networking wrong. 
> We shouldn't be writing the networking code once in every single
> application that wants to do this.  If we had a memory segment that we
> shared from server to client(s), and in that memory segment we
> implemented a clustered filesystem, then applications would simply mmap
> local files and be done with it.  If the file needed to move, the kernel
> would update the mmap in the application, done.  If you ask me, it is
> the attempt to do this the wrong way that is resulting in all this
> heartache.  That said, for today, my recommendation would be to require
> ODP hardware for XFS filesystem with the DAX option, but allow ext2
> filesystems to mount DAX filesystems on non-ODP hardware, and go in and
> modify the ext2 filesystem so that on DAX mounts, it disables hole punch
> and ftrunctate any time they would result in the forced removal of an
> established mmap.

I agree that something's wrong, but I think the fundamental problem is
that there's no concept in RDMA of having an STag for storage rather
than for memory.

Imagine if we could associate an STag with a file descriptor on the
server.  The client could then perform an RDMA to that STag.  On the
server, we'd need lots of smarts in the card and in the OS to know how
to treat that packet on arrival -- depending on what the file descriptor
referred to, it might only have to write into the page cache, or it
might set up an NVMe DMA, or it might resolve the underlying physical
address and DMA directly to an NV-DIMM.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 17:24                     ` Matthew Wilcox
@ 2019-02-07 17:26                       ` Jason Gunthorpe
  0 siblings, 0 replies; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-07 17:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Doug Ledford, Dan Williams, Dave Chinner, Christopher Lameter,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 09:24:05AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 07, 2019 at 11:25:35AM -0500, Doug Ledford wrote:
> > * Really though, as I said in my email to Tom Talpey, this entire
> > situation is simply screaming that we are doing DAX networking wrong. 
> > We shouldn't be writing the networking code once in every single
> > application that wants to do this.  If we had a memory segment that we
> > shared from server to client(s), and in that memory segment we
> > implemented a clustered filesystem, then applications would simply mmap
> > local files and be done with it.  If the file needed to move, the kernel
> > would update the mmap in the application, done.  If you ask me, it is
> > the attempt to do this the wrong way that is resulting in all this
> > heartache.  That said, for today, my recommendation would be to require
> > ODP hardware for XFS filesystem with the DAX option, but allow ext2
> > filesystems to mount DAX filesystems on non-ODP hardware, and go in and
> > modify the ext2 filesystem so that on DAX mounts, it disables hole punch
> > and ftrunctate any time they would result in the forced removal of an
> > established mmap.
> 
> I agree that something's wrong, but I think the fundamental problem is
> that there's no concept in RDMA of having an STag for storage rather
> than for memory.
> 
> Imagine if we could associate an STag with a file descriptor on the
> server.  The client could then perform an RDMA to that STag.  On the
> server, we'd need lots of smarts in the card and in the OS to know how
> to treat that packet on arrival -- depending on what the file descriptor
> referred to, it might only have to write into the page cache, or it
> might set up an NVMe DMA, or it might resolve the underlying physical
> address and DMA directly to an NV-DIMM.

I think you just described ODP MRs.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 16:55                     ` Christopher Lameter
@ 2019-02-07 17:35                       ` Ira Weiny
  2019-02-07 18:17                         ` Christopher Lameter
  2019-02-08  4:43                       ` Dave Chinner
  1 sibling, 1 reply; 106+ messages in thread
From: Ira Weiny @ 2019-02-07 17:35 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Dave Chinner,
	Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> One approach that may be a clean way to solve this:
> 
> 1. Long term GUP usage requires the virtual mapping to the pages be fixed
>    for the duration of the GUP Map. There never has been a way to break
>    the pinnning and thus this needs to be preserved.

How does this fit in with the changes John is making?

> 
> 2. Page Cache Long term pins are not allowed since regular filesystems
>    depend on COW and other tricks which are incompatible with a long term
>    pin.

Unless the hardware supports ODP or equivalent functionality.  Right?

> 
> 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
>    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
>    on the longterm pinned range until the long term pin is removed.
>    Hardware may do its job (like for persistent memory) but no data
>    consistency on the NVDIMM medium is guaranteed until the long term pin
>    is removed  and the filesystems regains control over the area.

I believe Dan attempted something like this and it became pretty difficult.

> 
> 4. Long term pin means that the mapped sections are an actively used part
>    of the file (like a filesystem write) and it cannot be truncated for
>    the duration of the pin. It can be thought of as if the truncate is
>    immediate followed by a write extending the file again. The mapping
>    by RDMA implies after all that remote writes can occur at anytime
>    within the area pinned long term.
>

This is a very interesting idea.  I've never quite thought of it that way.

That would be essentially like failing the truncate but without actually
failing it...  sneaky.  ;-)

What if user space then writes to the end of the file?  Does that write end
up at the point they truncated to or off the end of the mmaped area (old
length)?

I can see the behavior being defined either way.  But one interferes with the
RDMA data and the other does not.  Not sure which is easier for the FS to
handle either.

Ira


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 17:35                       ` Ira Weiny
@ 2019-02-07 18:17                         ` Christopher Lameter
  0 siblings, 0 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-07 18:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Dave Chinner,
	Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, 7 Feb 2019, Ira Weiny wrote:

> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> >
> > 1. Long term GUP usage requires the virtual mapping to the pages be fixed
> >    for the duration of the GUP Map. There never has been a way to break
> >    the pinnning and thus this needs to be preserved.
>
> How does this fit in with the changes John is making?
>
> >
> > 2. Page Cache Long term pins are not allowed since regular filesystems
> >    depend on COW and other tricks which are incompatible with a long term
> >    pin.
>
> Unless the hardware supports ODP or equivalent functionality.  Right?

Ok we could make an exception there. But that is not required as a first
step and only some hardware would support it.

> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> >    on the longterm pinned range until the long term pin is removed.
> >    Hardware may do its job (like for persistent memory) but no data
> >    consistency on the NVDIMM medium is guaranteed until the long term pin
> >    is removed  and the filesystems regains control over the area.
>
> I believe Dan attempted something like this and it became pretty difficult.

What is difficult about leaving things alone that are pinned? We already
have to do that currently because the refcount is elevated.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 16:57                         ` Ira Weiny
@ 2019-02-07 21:31                           ` Tom Talpey
  0 siblings, 0 replies; 106+ messages in thread
From: Tom Talpey @ 2019-02-07 21:31 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Chuck Lever, Jason Gunthorpe, Dave Chinner, Doug Ledford,
	Christopher Lameter, Matthew Wilcox, Jan Kara, lsf-pc,
	linux-rdma, linux-mm, Linux Kernel Mailing List, John Hubbard,
	Jerome Glisse, Dan Williams, Michal Hocko

On 2/7/2019 11:57 AM, Ira Weiny wrote:
> On Thu, Feb 07, 2019 at 10:28:05AM -0500, Tom Talpey wrote:
>> On 2/7/2019 10:04 AM, Chuck Lever wrote:
>>>
>>>
>>>> On Feb 7, 2019, at 12:23 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>>
>>>> On Thu, Feb 07, 2019 at 02:52:58PM +1100, Dave Chinner wrote:
>>>>
>>>>> Requiring ODP capable hardware and applications that control RDMA
>>>>> access to use file leases and be able to cancel/recall client side
>>>>> delegations (like NFS is already able to do!) seems like a pretty
>>>>
>>>> So, what happens on NFS if the revoke takes too long?
>>>
>>> NFS distinguishes between "recall" and "revoke". Dave used "recall"
>>> here, it means that the server recalls the client's delegation. If
>>> the client doesn't respond, the server revokes the delegation
>>> unilaterally and other users are allowed to proceed.
>>
>> The SMB3 protocol has a similar "lease break" mechanism, btw.
>>
>> SMB3 "push mode" has long-expected to allow DAX mapping of files
>> only when an exclusive lease is held by the requesting client.
>> The server may recall the lease if the DAX mapping needs to change.
>>
>> Once local (MMU) and remote (RDMA) mappings are dropped, the
>> client may re-request that the server reestablish them. No
>> connection or process is terminated, and no data is silently lost.
> 
> How long does one wait for these remote mappings to be dropped?

The recall process depends on several things, but it certainly takes a
network round trip.

If recall fails, the file protocols allow the server to revoke. However,
since this results in loss of data, it's a last resort.

Tom.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 17:17                       ` Jason Gunthorpe
@ 2019-02-07 23:54                         ` Dan Williams
  2019-02-08  1:44                           ` Ira Weiny
  2019-02-08  5:19                           ` Jason Gunthorpe
  0 siblings, 2 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-07 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 7, 2019 at 9:17 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Feb 06, 2019 at 10:00:28PM -0800, Dan Williams wrote:
>
> > > > If your argument is that "existing RDMA apps don't have a recall
> > > > mechanism" then that's what they are going to need to implement to
> > > > work with DAX+RDMA. Reliable remote access arbitration is required
> > > > for DAX+RDMA, regardless of what filesysetm the data is hosted on.
> > >
> > > My argument is that is a toy configuration that no production user
> > > would use. It either has the ability to wait for the lease to revoke
> > > 'forever' without consequence or the application will be critically
> > > de-stablized by the kernel's escalation to time bound the response.
> > > (or production systems never get revoke)
> >
> > I think we're off track on the need for leases for anything other than
> > non-ODP hardware.
> >
> > Otherwise this argument seems to be saying there is absolutely no safe
> > way to recall a memory registration from hardware, which does not make
> > sense because SIGKILL needs to work as a last resort.
>
> SIGKILL destroys all the process's resources. This is supported.
>
> You are asking for some way to do a targeted *disablement* (we can't
> do destroy) of a single resource.
>
> There is an optional operation that could do what you want
> 'rereg_user_mr'- however only 3 out of 17 drivers implement it, one of
> those drivers supports ODP, and one is supporting old hardware nearing
> its end of life.
>
> Of the two that are left, it looks like you might be able to use
> IB_MR_REREG_PD to basically disable the MR. Maybe. The spec for this
> API is not as a fence - the application is supposed to quiet traffic
> before invoking it. So even if it did work, it may not be synchronous
> enough to be safe for DAX.
>
> But lets imagine the one driver where this is relavents gets updated
> FW that makes this into a fence..
>
> Then the application's communication would more or less explode in a
> very strange and unexpected way, but perhaps it could learn to put the
> pieces back together, reconnect and restart from scratch.
>
> So, we could imagine doing something here, but it requires things we
> don't have, more standardization, and drivers to implement new
> functionality. This is not likely to happen.
>
> Thus any lease mechanism is essentially stuck with SIGKILL as the
> escalation.
>
> > > The arguing here is that there is certainly a subset of people that
> > > don't want to use ODP. If we tell them a hard 'no' then the
> > > conversation is done.
> >
> > Again, SIGKILL must work the RDMA target can't survive that, so it's
> > not impossible, or are you saying not even SIGKILL can guarantee an
> > RDMA registration goes idle? Then I can see that "hard no" having real
> > teeth otherwise it's a matter of software.
>
> Resorting to SIGKILL makes this into a toy, no real production user
> would operate in that world.
>
> > > I don't like the idea of building toy leases just for this one,
> > > arguably baroque, case.
> >
> > What makes it a toy and baroque? Outside of RDMA registrations being
> > irretrievable I have a gap in my understanding of what makes this
> > pointless to even attempt?
>
> Insisting to run RDMA & DAX without ODP and building an elaborate
> revoke mechanism to support non-ODP HW is inherently baroque.
>
> Use the HW that supports ODP.
>
> Since no HW can do disable of a MR, the escalation path is SIGKILL
> which makes it a non-production toy.
>
> What you keep missing is that for people doing this - the RDMA is a
> critical compoment of the system, you can't just say the kernel will
> randomly degrade/kill RDMA processes - that is a 'toy' configuration
> that is not production worthy.
>
> Especially since this revoke idea is basically a DOS engine for the
> RDMA protocol if another process can do actions to trigger revoke. Now
> we have a new class of security problems. (again, screams non
> production toy)
>
> The only production worthy way is to have the FS be a partner in
> making this work without requiring revoke, so the critical RDMA
> traffic can operate safely.
>
> Otherwise we need to stick to ODP.

Thanks for this it clears a lot of things up for me...

...but this statement:

> The only production worthy way is to have the FS be a partner in
> making this work without requiring revoke, so the critical RDMA
> traffic can operate safely.

...belies a path forward. Just swap out "FS be a partner" with "system
administrator be a partner". In other words, If the RDMA stack can't
tolerate an MR being disabled then the administrator needs to actively
disable the paths that would trigger it. Turn off reflink, don't
truncate, avoid any future FS feature that might generate unwanted
lease breaks. We would need to make sure that lease notifications
include the information to identify the lease breaker to debug escapes
that might happen, but it is a solution that can be qualified to not
lease break. In any event, this lets end users pick their filesystem
(modulo RDMA incompatible features), provides an enumeration of lease
break sources in the kernel, and opens up FS-DAX to a wider array of
RDMA adapters. In general this is what Linux has historically done,
give end users technology freedom.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 23:54                         ` Dan Williams
@ 2019-02-08  1:44                           ` Ira Weiny
  2019-02-08  5:19                           ` Jason Gunthorpe
  1 sibling, 0 replies; 106+ messages in thread
From: Ira Weiny @ 2019-02-08  1:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Dave Chinner, Doug Ledford, Christopher Lameter,
	Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 03:54:58PM -0800, Dan Williams wrote:
> On Thu, Feb 7, 2019 at 9:17 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > Insisting to run RDMA & DAX without ODP and building an elaborate
> > revoke mechanism to support non-ODP HW is inherently baroque.
> >
> > Use the HW that supports ODP.
> >
> > Since no HW can do disable of a MR, the escalation path is SIGKILL
> > which makes it a non-production toy.
> >
> > What you keep missing is that for people doing this - the RDMA is a
> > critical compoment of the system, you can't just say the kernel will
> > randomly degrade/kill RDMA processes - that is a 'toy' configuration
> > that is not production worthy.
> >
> > Especially since this revoke idea is basically a DOS engine for the
> > RDMA protocol if another process can do actions to trigger revoke. Now
> > we have a new class of security problems. (again, screams non
> > production toy)
> >
> > The only production worthy way is to have the FS be a partner in
> > making this work without requiring revoke, so the critical RDMA
> > traffic can operate safely.
> >
> > Otherwise we need to stick to ODP.
> 
> Thanks for this it clears a lot of things up for me...
> 
> ...but this statement:
> 
> > The only production worthy way is to have the FS be a partner in
> > making this work without requiring revoke, so the critical RDMA
> > traffic can operate safely.
> 
> ...belies a path forward. Just swap out "FS be a partner" with "system
> administrator be a partner". In other words, If the RDMA stack can't
> tolerate an MR being disabled then the administrator needs to actively
> disable the paths that would trigger it. Turn off reflink, don't
> truncate, avoid any future FS feature that might generate unwanted
> lease breaks. We would need to make sure that lease notifications
> include the information to identify the lease breaker to debug escapes
> that might happen, but it is a solution that can be qualified to not
> lease break. In any event, this lets end users pick their filesystem
> (modulo RDMA incompatible features), provides an enumeration of lease
> break sources in the kernel, and opens up FS-DAX to a wider array of
> RDMA adapters. In general this is what Linux has historically done,
> give end users technology freedom.

To back off the details of this thread a bit...

The details of limitations imposed and how they would be tracked within the
kernel would be a great thing to discuss face to face.  Hence the reason for my
proposal as a topic.

Ira


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 16:55                     ` Christopher Lameter
  2019-02-07 17:35                       ` Ira Weiny
@ 2019-02-08  4:43                       ` Dave Chinner
  2019-02-08 11:10                         ` Jan Kara
  2019-02-08 15:33                         ` Christopher Lameter
  1 sibling, 2 replies; 106+ messages in thread
From: Dave Chinner @ 2019-02-08  4:43 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> One approach that may be a clean way to solve this:
> 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
>    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
>    on the longterm pinned range until the long term pin is removed.

So, ummm, how do we do block allocation then, which is done on
demand during writes?

IOWs, this requires the application to set up the file in the
correct state for the filesystem to lock it down so somebody else
can write to it.  That means the file can't be sparse, it can't be
preallocated (i.e. can't contain unwritten extents), it must have zeroes
written to it's full size before being shared because otherwise it
exposes stale data to the remote client (secure sites are going to
love that!), they can't be extended, etc.

IOWs, once the file is prepped and leased out for RDMA, it becomes
an immutable for the purposes of local access.

Which, essentially we can already do. Prep the file, map it
read/write, mark it immutable, then pin it via the longterm gup
interface which can do the necessary checks.

Simple to implement, the reasons for errors trying to modify the
file are already documented and queriable, and it's hard for
applications to get wrong.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-07 23:54                         ` Dan Williams
  2019-02-08  1:44                           ` Ira Weiny
@ 2019-02-08  5:19                           ` Jason Gunthorpe
  2019-02-08  7:20                             ` Dan Williams
  1 sibling, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-08  5:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 03:54:58PM -0800, Dan Williams wrote:

> > The only production worthy way is to have the FS be a partner in
> > making this work without requiring revoke, so the critical RDMA
> > traffic can operate safely.
> 
> ...belies a path forward. Just swap out "FS be a partner" with "system
> administrator be a partner". In other words, If the RDMA stack can't
> tolerate an MR being disabled then the administrator needs to actively
> disable the paths that would trigger it. Turn off reflink, don't
> truncate, avoid any future FS feature that might generate unwanted
> lease breaks. 

This is what I suggested already, except with explicit kernel aid, not
left as some gordian riddle for the administrator to unravel.

You already said it is too hard for expert FS developers to maintain a
mode switch, it seems like a really big stretch to think application
and systems architects will have any hope to do better.

It makes much more sense for the admin to flip some kind of bit and
the FS guarentees the safety that you are asking the admin to create.

> We would need to make sure that lease notifications include the
> information to identify the lease breaker to debug escapes that
> might happen, but it is a solution that can be qualified to not
> lease break. 

I think building a complicated lease framework and then telling
everyone in user space to design around it so it never gets used would
be very hard to explain and justify.

Never mind the security implications if some seemingly harmless future
filesystem change causes unexpected lease revokes across something
like a tenant boundary.

> In any event, this lets end users pick their filesystem
> (modulo RDMA incompatible features), provides an enumeration of
> lease break sources in the kernel, and opens up FS-DAX to a wider
> array of RDMA adapters. In general this is what Linux has
> historically done, give end users technology freedom.

I think this is not the Linux model. The kernel should not allow
unpriv user space to do an operation that could be unsafe.

I continue to think this is is the best idea that has come up - but
only if the filesystem is involved and expressly tells the kernel
layers that this combination of DAX & filesystem is safe.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08  5:19                           ` Jason Gunthorpe
@ 2019-02-08  7:20                             ` Dan Williams
  2019-02-08 15:42                               ` Jason Gunthorpe
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-08  7:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 7, 2019 at 9:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Feb 07, 2019 at 03:54:58PM -0800, Dan Williams wrote:
>
> > > The only production worthy way is to have the FS be a partner in
> > > making this work without requiring revoke, so the critical RDMA
> > > traffic can operate safely.
> >
> > ...belies a path forward. Just swap out "FS be a partner" with "system
> > administrator be a partner". In other words, If the RDMA stack can't
> > tolerate an MR being disabled then the administrator needs to actively
> > disable the paths that would trigger it. Turn off reflink, don't
> > truncate, avoid any future FS feature that might generate unwanted
> > lease breaks.
>
> This is what I suggested already, except with explicit kernel aid, not
> left as some gordian riddle for the administrator to unravel.

It's a riddle either way. "Why is my truncate failing?"

The lease path allows the riddle to be solved in a way that moves the
ecosystem forwards. It provides a mechanism to notify (effectively mmu
notifers plumbed to userspace), an opportunity for capable RDMA apps /
drivers to do better than SIGKILL, and a path for filesystems to
continue to innovate and not make users choose filesystems just on the
chance they might need to do RDMA.

> You already said it is too hard for expert FS developers to maintain a
> mode switch

I do disagree with a truncate behavior switch, but reflink already has
a mkfs switch so it's obviously possible for any future feature that
might run afoul of the RDMA restrictions to have fs-feature control.

> , it seems like a really big stretch to think application
> and systems architects will have any hope to do better.

Certainly they can, it's just a matter of documenting options. It can
be made easier if we can get commonly named options across filesystems
to disable lease dependent functionality.

> It makes much more sense for the admin to flip some kind of bit and
> the FS guarentees the safety that you are asking the admin to create.

Flipping the bit changes the ABI contract in backwards incompatible
ways. I'm saying go the other way, audit the configuration for legacy
RDMA safety.

> > We would need to make sure that lease notifications include the
> > information to identify the lease breaker to debug escapes that
> > might happen, but it is a solution that can be qualified to not
> > lease break.
>
> I think building a complicated lease framework and then telling
> everyone in user space to design around it so it never gets used would
> be very hard to explain and justify.

There is no requirement to design around it. If an RDMA-implementation
doesn't use it the longterm-GUPs are already blocked. If the
implementation does use it, but fails to service lease breaks it gets
SIGKILL with information of what lead to the SIGKILL so the
configuration can be fixed. Implementations that want to do better
have an opportunity to be a partner to the filesytem and repair the
MR.

> Never mind the security implications if some seemingly harmless future
> filesystem change causes unexpected lease revokes across something
> like a tenant boundary.

Fileystems innovate quickly, but not that quickly. Ongoing
communication between FS and RDMA developers is not insurmountable.

> > In any event, this lets end users pick their filesystem
> > (modulo RDMA incompatible features), provides an enumeration of
> > lease break sources in the kernel, and opens up FS-DAX to a wider
> > array of RDMA adapters. In general this is what Linux has
> > historically done, give end users technology freedom.
>
> I think this is not the Linux model. The kernel should not allow
> unpriv user space to do an operation that could be unsafe.

There's permission to block unprivileged writes/truncates to a file,
otherwise I'm missing what hole is being opened? That said, the horse
already left the barn. Linux has already shipped in the page-cache
case "punch hole in the middle of a MR succeeds and leaves the state
of the file relative to ongoing RDMA inconsistent". Now that we know
about the bug the question is how do we do better than the current
status quo of taking all of the functionality away.

> I continue to think this is is the best idea that has come up - but
> only if the filesystem is involved and expressly tells the kernel
> layers that this combination of DAX & filesystem is safe.

I think we're getting into "need to discuss at LSF/MM territory",
because the concept of "DAX safety", or even DAX as an explicit FS
capability has been a point of contention since day one. We're trying
change DAX to be defined by mmap API flags like MAP_SYNC and maybe
MAP_DIRECT in the future.

For example, if the MR was not established to a MAP_SYNC vma then the
kernel should be free to indirect the RDMA through the page-cache like
the typical non-DAX case. DAX as a global setting is too coarse.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08  4:43                       ` Dave Chinner
@ 2019-02-08 11:10                         ` Jan Kara
  2019-02-08 20:50                           ` Dan Williams
  2019-02-08 21:20                           ` Dave Chinner
  2019-02-08 15:33                         ` Christopher Lameter
  1 sibling, 2 replies; 106+ messages in thread
From: Jan Kara @ 2019-02-08 11:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christopher Lameter, Doug Ledford, Dan Williams, Jason Gunthorpe,
	Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> >    on the longterm pinned range until the long term pin is removed.
> 
> So, ummm, how do we do block allocation then, which is done on
> demand during writes?
> 
> IOWs, this requires the application to set up the file in the
> correct state for the filesystem to lock it down so somebody else
> can write to it.  That means the file can't be sparse, it can't be
> preallocated (i.e. can't contain unwritten extents), it must have zeroes
> written to it's full size before being shared because otherwise it
> exposes stale data to the remote client (secure sites are going to
> love that!), they can't be extended, etc.
> 
> IOWs, once the file is prepped and leased out for RDMA, it becomes
> an immutable for the purposes of local access.
> 
> Which, essentially we can already do. Prep the file, map it
> read/write, mark it immutable, then pin it via the longterm gup
> interface which can do the necessary checks.

Hum, and what will you do if the immutable file that is target for RDMA
will be a source of reflink? That seems to be currently allowed for
immutable files but RDMA store would be effectively corrupting the data of
the target inode. But we could treat it similarly as swapfiles - those also
have to deal with writes to blocks beyond filesystem control. In fact the
similarity seems to be quite large there. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08  4:43                       ` Dave Chinner
  2019-02-08 11:10                         ` Jan Kara
@ 2019-02-08 15:33                         ` Christopher Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-08 15:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Fri, 8 Feb 2019, Dave Chinner wrote:

> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> >    on the longterm pinned range until the long term pin is removed.
>
> So, ummm, how do we do block allocation then, which is done on
> demand during writes?

If a memory region is mapped by RDMA then this is essentially a long
write. The allocation needs to happen then.

> IOWs, this requires the application to set up the file in the
> correct state for the filesystem to lock it down so somebody else
> can write to it.  That means the file can't be sparse, it can't be
> preallocated (i.e. can't contain unwritten extents), it must have zeroes
> written to it's full size before being shared because otherwise it
> exposes stale data to the remote client (secure sites are going to
> love that!), they can't be extended, etc.

Yes. That is required.

> IOWs, once the file is prepped and leased out for RDMA, it becomes
> an immutable for the purposes of local access.

The contents are mutable but the mapping to the physical medium is
immutable.


> Which, essentially we can already do. Prep the file, map it
> read/write, mark it immutable, then pin it via the longterm gup
> interface which can do the necessary checks.
>
> Simple to implement, the reasons for errors trying to modify the
> file are already documented and queriable, and it's hard for
> applications to get wrong.

Yup. Why not do it this way? Just make the sections actually long term GUP
mapped inmutable and not subject to the other page cache things.

This is basically a straight through bypass of the page cache for a file.

HEY! It may be used to map huge pages in the future too!!

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08  7:20                             ` Dan Williams
@ 2019-02-08 15:42                               ` Jason Gunthorpe
  0 siblings, 0 replies; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-08 15:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Doug Ledford, Christopher Lameter, Matthew Wilcox,
	Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Thu, Feb 07, 2019 at 11:20:37PM -0800, Dan Williams wrote:
> On Thu, Feb 7, 2019 at 9:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Feb 07, 2019 at 03:54:58PM -0800, Dan Williams wrote:
> >
> > > > The only production worthy way is to have the FS be a partner in
> > > > making this work without requiring revoke, so the critical RDMA
> > > > traffic can operate safely.
> > >
> > > ...belies a path forward. Just swap out "FS be a partner" with "system
> > > administrator be a partner". In other words, If the RDMA stack can't
> > > tolerate an MR being disabled then the administrator needs to actively
> > > disable the paths that would trigger it. Turn off reflink, don't
> > > truncate, avoid any future FS feature that might generate unwanted
> > > lease breaks.
> >
> > This is what I suggested already, except with explicit kernel aid, not
> > left as some gordian riddle for the administrator to unravel.
> 
> It's a riddle either way. "Why is my truncate failing?"

At least that riddle errs on the side of system safety and will not be
hit anyhow. Doug is right, we already allow ftruncate to fail with
PROT_EXEC maps (ETXTBUSY) so this isn't even abnormal.

Or do as CL says and succeed the ftruncate but nothing happens (the
continuous write philosophy)

> > You already said it is too hard for expert FS developers to maintain a
> > mode switch
> 
> I do disagree with a truncate behavior switch, but reflink already has
> a mkfs switch so it's obviously possible for any future feature that
> might run afoul of the RDMA restrictions to have fs-feature control.

More precedent that this is the right path..

> > It makes much more sense for the admin to flip some kind of bit and
> > the FS guarentees the safety that you are asking the admin to create.
> 
> Flipping the bit changes the ABI contract in backwards incompatible
> ways. I'm saying go the other way, audit the configuration for legacy
> RDMA safety.

We have precedent for this too. Lots of FSs don't support hole punch,
reflink or in some rare cases ftruncate. It is not exactly new ground.
 
> > > In any event, this lets end users pick their filesystem
> > > (modulo RDMA incompatible features), provides an enumeration of
> > > lease break sources in the kernel, and opens up FS-DAX to a wider
> > > array of RDMA adapters. In general this is what Linux has
> > > historically done, give end users technology freedom.
> >
> > I think this is not the Linux model. The kernel should not allow
> > unpriv user space to do an operation that could be unsafe.
> 
> There's permission to block unprivileged writes/truncates to a file,
> otherwise I'm missing what hole is being opened? That said, the horse
> already left the barn. Linux has already shipped in the page-cache
> case "punch hole in the middle of a MR succeeds and leaves the state
> of the file relative to ongoing RDMA inconsistent". Now that we know
> about the bug the question is how do we do better than the current
> status quo of taking all of the functionality away.

I've always felt this is a bug in RDMA - but we have no path to fix
it. The best I can say is that it doesn't cause any security or
corruption problem today.

> > I continue to think this is is the best idea that has come up - but
> > only if the filesystem is involved and expressly tells the kernel
> > layers that this combination of DAX & filesystem is safe.
> 
> I think we're getting into "need to discuss at LSF/MM territory",
> because the concept of "DAX safety", or even DAX as an explicit FS
> capability has been a point of contention since day one. We're trying
> change DAX to be defined by mmap API flags like MAP_SYNC and maybe
> MAP_DIRECT in the future.
> 
> For example, if the MR was not established to a MAP_SYNC vma then the
> kernel should be free to indirect the RDMA through the page-cache like
> the typical non-DAX case. DAX as a global setting is too coarse.

Whatever this flag is would be is linked to the mmap, and maybe you
could make it a per-file flag instead of some mount option, I don't
know. Kind of up to the FS.

I'm just advocating for the idea that the FS itself can reject/deny the
longterm pin request based on its internal status.  If the FS meets
the defined contract then it can allow long term to proceed. Otherwise
it fails.

I feel this is what people actually want here, and is a far more
maintainable overall system than some sketchy lease revoke SIGKILL.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08 11:10                         ` Jan Kara
@ 2019-02-08 20:50                           ` Dan Williams
  2019-02-11 10:24                             ` Jan Kara
  2019-02-08 21:20                           ` Dave Chinner
  1 sibling, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-08 20:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > One approach that may be a clean way to solve this:
> > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > >    on the longterm pinned range until the long term pin is removed.
> >
> > So, ummm, how do we do block allocation then, which is done on
> > demand during writes?
> >
> > IOWs, this requires the application to set up the file in the
> > correct state for the filesystem to lock it down so somebody else
> > can write to it.  That means the file can't be sparse, it can't be
> > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > written to it's full size before being shared because otherwise it
> > exposes stale data to the remote client (secure sites are going to
> > love that!), they can't be extended, etc.
> >
> > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > an immutable for the purposes of local access.
> >
> > Which, essentially we can already do. Prep the file, map it
> > read/write, mark it immutable, then pin it via the longterm gup
> > interface which can do the necessary checks.
>
> Hum, and what will you do if the immutable file that is target for RDMA
> will be a source of reflink? That seems to be currently allowed for
> immutable files but RDMA store would be effectively corrupting the data of
> the target inode. But we could treat it similarly as swapfiles - those also
> have to deal with writes to blocks beyond filesystem control. In fact the
> similarity seems to be quite large there. What do you think?

This sounds so familiar...

    https://lwn.net/Articles/726481/

I'm not opposed to trying again, but leases was what crawled out
smoking crater when this last proposal was nuked.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08 11:10                         ` Jan Kara
  2019-02-08 20:50                           ` Dan Williams
@ 2019-02-08 21:20                           ` Dave Chinner
  1 sibling, 0 replies; 106+ messages in thread
From: Dave Chinner @ 2019-02-08 21:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christopher Lameter, Doug Ledford, Dan Williams, Jason Gunthorpe,
	Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Fri, Feb 08, 2019 at 12:10:28PM +0100, Jan Kara wrote:
> On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > One approach that may be a clean way to solve this:
> > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > >    on the longterm pinned range until the long term pin is removed.
> > 
> > So, ummm, how do we do block allocation then, which is done on
> > demand during writes?
> > 
> > IOWs, this requires the application to set up the file in the
> > correct state for the filesystem to lock it down so somebody else
> > can write to it.  That means the file can't be sparse, it can't be
> > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > written to it's full size before being shared because otherwise it
> > exposes stale data to the remote client (secure sites are going to
> > love that!), they can't be extended, etc.
> > 
> > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > an immutable for the purposes of local access.
> > 
> > Which, essentially we can already do. Prep the file, map it
> > read/write, mark it immutable, then pin it via the longterm gup
> > interface which can do the necessary checks.
> 
> Hum, and what will you do if the immutable file that is target for RDMA
> will be a source of reflink?

I think we'd have to disallow it. reflink does actually change the
source inode on XFS (adds an inode flag to say it has shared
extents)...

Similarly, we'd have to make sure the inode is pinned in memory
but the gup_longterm operation, not jus thave it's pages pinned...

> That seems to be currently allowed for
> immutable files but RDMA store would be effectively corrupting the data of
> the target inode. But we could treat it similarly as swapfiles - those also
> have to deal with writes to blocks beyond filesystem control. In fact the
> similarity seems to be quite large there. What do you think?

Yes, swapfiles are probably a better analogy as the mm subsystem
pins them, maps them checking the layout is appropriate (i.e. no
holes) and then writes straight through them without the filesystem
being aware of the IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-08 20:50                           ` Dan Williams
@ 2019-02-11 10:24                             ` Jan Kara
  2019-02-11 17:22                               ` Dan Williams
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Kara @ 2019-02-11 10:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
	Jason Gunthorpe, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Fri 08-02-19 12:50:37, Dan Williams wrote:
> On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > One approach that may be a clean way to solve this:
> > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > >    on the longterm pinned range until the long term pin is removed.
> > >
> > > So, ummm, how do we do block allocation then, which is done on
> > > demand during writes?
> > >
> > > IOWs, this requires the application to set up the file in the
> > > correct state for the filesystem to lock it down so somebody else
> > > can write to it.  That means the file can't be sparse, it can't be
> > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > written to it's full size before being shared because otherwise it
> > > exposes stale data to the remote client (secure sites are going to
> > > love that!), they can't be extended, etc.
> > >
> > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > an immutable for the purposes of local access.
> > >
> > > Which, essentially we can already do. Prep the file, map it
> > > read/write, mark it immutable, then pin it via the longterm gup
> > > interface which can do the necessary checks.
> >
> > Hum, and what will you do if the immutable file that is target for RDMA
> > will be a source of reflink? That seems to be currently allowed for
> > immutable files but RDMA store would be effectively corrupting the data of
> > the target inode. But we could treat it similarly as swapfiles - those also
> > have to deal with writes to blocks beyond filesystem control. In fact the
> > similarity seems to be quite large there. What do you think?
> 
> This sounds so familiar...
> 
>     https://lwn.net/Articles/726481/
> 
> I'm not opposed to trying again, but leases was what crawled out
> smoking crater when this last proposal was nuked.

Umm, don't think this is that similar to daxctl() discussion. We are not
speaking about providing any new userspace API for this. Also I think the
situation about leases has somewhat cleared up with this discussion - ODP
hardware does not need leases since it can use MMU notifiers, for non-ODP
hardware it is difficult to handle leases as such hardware has only one big
kill-everything call and using that would effectively mean lot of work on
the userspace side to resetup everything to make things useful if workable
at all.

So my proposal would be:

1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
its teardown when fs needs it.

2) Hardware not capable of tearing down pins from MMU notifiers will have
to use gup_longterm() (we may actually rename it to a more suitable name).
FS may just refuse such calls (for normal page cache backed file, it will
just return success but for DAX file it will do sanity checks whether the
file is fully allocated etc. like we currently do for swapfiles) but if
gup_longterm() returns success, it will provide the same guarantees as for
swapfiles. So the only thing that we need is some call from gup_longterm()
to a filesystem callback to tell it - this file is going to be used by a
third party as an IO buffer, don't touch it. And we can (and should)
probably refactor the handling to be shared between swapfiles and
gup_longterm().

								Honza


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 10:24                             ` Jan Kara
@ 2019-02-11 17:22                               ` Dan Williams
  2019-02-11 18:06                                 ` Jason Gunthorpe
  2019-02-12 16:07                                 ` Jan Kara
  0 siblings, 2 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-11 17:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > One approach that may be a clean way to solve this:
> > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > >    on the longterm pinned range until the long term pin is removed.
> > > >
> > > > So, ummm, how do we do block allocation then, which is done on
> > > > demand during writes?
> > > >
> > > > IOWs, this requires the application to set up the file in the
> > > > correct state for the filesystem to lock it down so somebody else
> > > > can write to it.  That means the file can't be sparse, it can't be
> > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > written to it's full size before being shared because otherwise it
> > > > exposes stale data to the remote client (secure sites are going to
> > > > love that!), they can't be extended, etc.
> > > >
> > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > an immutable for the purposes of local access.
> > > >
> > > > Which, essentially we can already do. Prep the file, map it
> > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > interface which can do the necessary checks.
> > >
> > > Hum, and what will you do if the immutable file that is target for RDMA
> > > will be a source of reflink? That seems to be currently allowed for
> > > immutable files but RDMA store would be effectively corrupting the data of
> > > the target inode. But we could treat it similarly as swapfiles - those also
> > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > similarity seems to be quite large there. What do you think?
> >
> > This sounds so familiar...
> >
> >     https://lwn.net/Articles/726481/
> >
> > I'm not opposed to trying again, but leases was what crawled out
> > smoking crater when this last proposal was nuked.
>
> Umm, don't think this is that similar to daxctl() discussion. We are not
> speaking about providing any new userspace API for this.

I thought explicit userspace API was one of the outcomes, i.e. that we
can't depend on this behavior being an implicit side effect of a page
pin?

> Also I think the
> situation about leases has somewhat cleared up with this discussion - ODP
> hardware does not need leases since it can use MMU notifiers, for non-ODP
> hardware it is difficult to handle leases as such hardware has only one big
> kill-everything call and using that would effectively mean lot of work on
> the userspace side to resetup everything to make things useful if workable
> at all.
>
> So my proposal would be:
>
> 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> its teardown when fs needs it.
>
> 2) Hardware not capable of tearing down pins from MMU notifiers will have
> to use gup_longterm() (we may actually rename it to a more suitable name).
> FS may just refuse such calls (for normal page cache backed file, it will
> just return success but for DAX file it will do sanity checks whether the
> file is fully allocated etc. like we currently do for swapfiles) but if
> gup_longterm() returns success, it will provide the same guarantees as for
> swapfiles. So the only thing that we need is some call from gup_longterm()
> to a filesystem callback to tell it - this file is going to be used by a
> third party as an IO buffer, don't touch it. And we can (and should)
> probably refactor the handling to be shared between swapfiles and
> gup_longterm().

Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
solution I thought we dax folks walked away from in the original
MAP_DIRECT discussion [1]. Here is where leases were the response to
MAP_DIRECT [2]. ...and here is where we had tame discussions about
implications of notifying memory-registrations of lease break events
[3].

I honestly don't like the idea that random subsystems can pin down
file blocks as a side effect of gup on the result of mmap. Recall that
it's not just RDMA that wants this guarantee. It seems safer to have
the file be in an explicit block-allocation-immutable-mode so that the
fallocate man page can describe this error case. Otherwise how would
you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?

[1]: https://lwn.net/Articles/736333/
[2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html
[3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 17:22                               ` Dan Williams
@ 2019-02-11 18:06                                 ` Jason Gunthorpe
  2019-02-11 18:15                                   ` Dan Williams
                                                     ` (3 more replies)
  2019-02-12 16:07                                 ` Jan Kara
  1 sibling, 4 replies; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 18:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
	Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:

> I honestly don't like the idea that random subsystems can pin down
> file blocks as a side effect of gup on the result of mmap. Recall that
> it's not just RDMA that wants this guarantee. It seems safer to have
> the file be in an explicit block-allocation-immutable-mode so that the
> fallocate man page can describe this error case. Otherwise how would
> you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?

I rather liked CL's version of this - ftruncate/etc is simply racing
with a parallel pwrite - and it doesn't fail.

But it also doesnt' trucate/create a hole. Another thread wrote to it
right away and the 'hole' was essentially instantly reallocated. This
is an inherent, pre-existing, race in the ftrucate/etc APIs.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:06                                 ` Jason Gunthorpe
@ 2019-02-11 18:15                                   ` Dan Williams
  2019-02-11 18:19                                   ` Ira Weiny
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-11 18:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
	Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 10:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
>
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.

If options are telling the truth with a potentially unexpected error,
or lying that operation succeeded when it will be immediately undone,
I'd choose the former.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:06                                 ` Jason Gunthorpe
  2019-02-11 18:15                                   ` Dan Williams
@ 2019-02-11 18:19                                   ` Ira Weiny
  2019-02-11 18:26                                     ` Jason Gunthorpe
                                                       ` (2 more replies)
  2019-02-12 16:28                                   ` Jan Kara
  2019-02-14 20:26                                   ` Jerome Glisse
  3 siblings, 3 replies; 106+ messages in thread
From: Ira Weiny @ 2019-02-11 18:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
	Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> 
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> 
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
> 
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.

I kind of like it as well, except Christopher did not answer my question:

What if user space then writes to the end of the file with a regular write?
Does that write end up at the point they truncated to or off the end of the
mmaped area (old length)?

To make this work I think it has to be the later.  And as you say the semantic
is as if another thread wrote to the file first (but in this case the other
thread is the RDMA device).

In addition I'm not sure what the overall work is for this case?

John's patches will indicate to the FS that the page is gup pinned.  But they
will not indicate longterm vs not "shorterm".  A shortterm pin could be handled
as a "real truncate".  So, are we back to needing a longterm "bit" in struct
page to indicate a longterm pin and allow the FS to perform this "virtual
write" after truncate?

Or is it safe to consider all gup pinned pages this way?

Ira


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:19                                   ` Ira Weiny
@ 2019-02-11 18:26                                     ` Jason Gunthorpe
  2019-02-11 18:40                                       ` Matthew Wilcox
  2019-02-11 21:08                                     ` Jerome Glisse
  2019-02-11 21:22                                     ` John Hubbard
  2 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 18:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
	Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> > 
> > > I honestly don't like the idea that random subsystems can pin down
> > > file blocks as a side effect of gup on the result of mmap. Recall that
> > > it's not just RDMA that wants this guarantee. It seems safer to have
> > > the file be in an explicit block-allocation-immutable-mode so that the
> > > fallocate man page can describe this error case. Otherwise how would
> > > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> > 
> > I rather liked CL's version of this - ftruncate/etc is simply racing
> > with a parallel pwrite - and it doesn't fail.
> > 
> > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > right away and the 'hole' was essentially instantly reallocated. This
> > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> 
> I kind of like it as well, except Christopher did not answer my question:
> 
> What if user space then writes to the end of the file with a regular write?
> Does that write end up at the point they truncated to or off the end of the
> mmaped area (old length)?

IIRC it depends how the user does the write..

pwrite() with a given offset will write to that offset, re-extending
the file if needed

A file opened with O_APPEND and a write done with write() should
append to the new end

A normal file with a normal write should write to the FD's current
seek pointer.

I'm not sure what happens if you write via mmap/msync.

RDMA is similar to pwrite() and mmap.

> Or is it safe to consider all gup pinned pages this way?

O_DIRECT still has to work sensibly, and if you ftruncate something
that is currently being written with O_DIRECT it should behave the
same as if the CPU touched the mmap'd memory, IMHO.

The only real change here is that if there is a GUP then ftruncate/etc
races are always resolved as 'GUP user goes last' instead of randomly.

ftrunacte/etc already only work as you'd expect if the operator has
excluded writes. Otherwise blocks are instantly reallocated by another
racing thread. 

I'm not sure why RDMA should be so special to earn an error code ..

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:26                                     ` Jason Gunthorpe
@ 2019-02-11 18:40                                       ` Matthew Wilcox
  2019-02-11 19:58                                         ` Dan Williams
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-11 18:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > What if user space then writes to the end of the file with a regular write?
> > Does that write end up at the point they truncated to or off the end of the
> > mmaped area (old length)?
> 
> IIRC it depends how the user does the write..
> 
> pwrite() with a given offset will write to that offset, re-extending
> the file if needed
> 
> A file opened with O_APPEND and a write done with write() should
> append to the new end
> 
> A normal file with a normal write should write to the FD's current
> seek pointer.
> 
> I'm not sure what happens if you write via mmap/msync.
> 
> RDMA is similar to pwrite() and mmap.

A pertinent point that you didn't mention is that ftruncate() does not change
the file offset.  So there's no user-visible change in behaviour.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:40                                       ` Matthew Wilcox
@ 2019-02-11 19:58                                         ` Dan Williams
  2019-02-11 20:49                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 106+ messages in thread
From: Dan Williams @ 2019-02-11 19:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Ira Weiny, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > What if user space then writes to the end of the file with a regular write?
> > > Does that write end up at the point they truncated to or off the end of the
> > > mmaped area (old length)?
> >
> > IIRC it depends how the user does the write..
> >
> > pwrite() with a given offset will write to that offset, re-extending
> > the file if needed
> >
> > A file opened with O_APPEND and a write done with write() should
> > append to the new end
> >
> > A normal file with a normal write should write to the FD's current
> > seek pointer.
> >
> > I'm not sure what happens if you write via mmap/msync.
> >
> > RDMA is similar to pwrite() and mmap.
>
> A pertinent point that you didn't mention is that ftruncate() does not change
> the file offset.  So there's no user-visible change in behaviour.

...but there is. The blocks you thought you freed, especially if the
system was under -ENOSPC pressure, won't actually be free after the
successful ftruncate().

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 19:58                                         ` Dan Williams
@ 2019-02-11 20:49                                           ` Jason Gunthorpe
  2019-02-11 21:02                                             ` Dan Williams
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 20:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > What if user space then writes to the end of the file with a regular write?
> > > > Does that write end up at the point they truncated to or off the end of the
> > > > mmaped area (old length)?
> > >
> > > IIRC it depends how the user does the write..
> > >
> > > pwrite() with a given offset will write to that offset, re-extending
> > > the file if needed
> > >
> > > A file opened with O_APPEND and a write done with write() should
> > > append to the new end
> > >
> > > A normal file with a normal write should write to the FD's current
> > > seek pointer.
> > >
> > > I'm not sure what happens if you write via mmap/msync.
> > >
> > > RDMA is similar to pwrite() and mmap.
> >
> > A pertinent point that you didn't mention is that ftruncate() does not change
> > the file offset.  So there's no user-visible change in behaviour.
> 
> ...but there is. The blocks you thought you freed, especially if the
> system was under -ENOSPC pressure, won't actually be free after the
> successful ftruncate().

They won't be free after something dirties the existing mmap either.

Blocks also won't be free if you unlink a file that is currently still
open.

This isn't really new behavior for a FS.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 20:49                                           ` Jason Gunthorpe
@ 2019-02-11 21:02                                             ` Dan Williams
  2019-02-11 21:09                                               ` Jason Gunthorpe
  2019-02-12 16:36                                               ` Christopher Lameter
  0 siblings, 2 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-11 21:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > What if user space then writes to the end of the file with a regular write?
> > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > mmaped area (old length)?
> > > >
> > > > IIRC it depends how the user does the write..
> > > >
> > > > pwrite() with a given offset will write to that offset, re-extending
> > > > the file if needed
> > > >
> > > > A file opened with O_APPEND and a write done with write() should
> > > > append to the new end
> > > >
> > > > A normal file with a normal write should write to the FD's current
> > > > seek pointer.
> > > >
> > > > I'm not sure what happens if you write via mmap/msync.
> > > >
> > > > RDMA is similar to pwrite() and mmap.
> > >
> > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > the file offset.  So there's no user-visible change in behaviour.
> >
> > ...but there is. The blocks you thought you freed, especially if the
> > system was under -ENOSPC pressure, won't actually be free after the
> > successful ftruncate().
>
> They won't be free after something dirties the existing mmap either.
>
> Blocks also won't be free if you unlink a file that is currently still
> open.
>
> This isn't really new behavior for a FS.

An mmap write after a fault due to a hole punch is free to trigger
SIGBUS if the subsequent page allocation fails. So no, I don't see
them as the same unless you're allowing for the holder of the MR to
receive a re-fault failure.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:19                                   ` Ira Weiny
  2019-02-11 18:26                                     ` Jason Gunthorpe
@ 2019-02-11 21:08                                     ` Jerome Glisse
  2019-02-11 21:22                                     ` John Hubbard
  2 siblings, 0 replies; 106+ messages in thread
From: Jerome Glisse @ 2019-02-11 21:08 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
	Michal Hocko

On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> > 
> > > I honestly don't like the idea that random subsystems can pin down
> > > file blocks as a side effect of gup on the result of mmap. Recall that
> > > it's not just RDMA that wants this guarantee. It seems safer to have
> > > the file be in an explicit block-allocation-immutable-mode so that the
> > > fallocate man page can describe this error case. Otherwise how would
> > > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> > 
> > I rather liked CL's version of this - ftruncate/etc is simply racing
> > with a parallel pwrite - and it doesn't fail.
> > 
> > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > right away and the 'hole' was essentially instantly reallocated. This
> > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> 
> I kind of like it as well, except Christopher did not answer my question:
> 
> What if user space then writes to the end of the file with a regular write?
> Does that write end up at the point they truncated to or off the end of the
> mmaped area (old length)?
> 
> To make this work I think it has to be the later.  And as you say the semantic
> is as if another thread wrote to the file first (but in this case the other
> thread is the RDMA device).
> 
> In addition I'm not sure what the overall work is for this case?
> 
> John's patches will indicate to the FS that the page is gup pinned.  But they
> will not indicate longterm vs not "shorterm".  A shortterm pin could be handled
> as a "real truncate".  So, are we back to needing a longterm "bit" in struct
> page to indicate a longterm pin and allow the FS to perform this "virtual
> write" after truncate?
> 
> Or is it safe to consider all gup pinned pages this way?

So i have been working on several patchset to convert all user that can
abide to mmu notifier to HMM mirror which does not pin pages ie does not
take reference on the page. So all the left over GUP users would be the
long term problematic one with few exceptions: direct I/O, KVM (i
think xen too but i am less familiar with that), virtio.

For direct I/O i believe the ignore the truncate solution would work too.
For KVM and virtio i think it only does GUP on anonymous memory.

So the answer would be that it is safe to consider all pin pages as being
longterm pin.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 21:02                                             ` Dan Williams
@ 2019-02-11 21:09                                               ` Jason Gunthorpe
  2019-02-12 16:34                                                 ` Jan Kara
  2019-02-12 16:36                                               ` Christopher Lameter
  1 sibling, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 21:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 01:02:37PM -0800, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > > What if user space then writes to the end of the file with a regular write?
> > > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > > mmaped area (old length)?
> > > > >
> > > > > IIRC it depends how the user does the write..
> > > > >
> > > > > pwrite() with a given offset will write to that offset, re-extending
> > > > > the file if needed
> > > > >
> > > > > A file opened with O_APPEND and a write done with write() should
> > > > > append to the new end
> > > > >
> > > > > A normal file with a normal write should write to the FD's current
> > > > > seek pointer.
> > > > >
> > > > > I'm not sure what happens if you write via mmap/msync.
> > > > >
> > > > > RDMA is similar to pwrite() and mmap.
> > > >
> > > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > > the file offset.  So there's no user-visible change in behaviour.
> > >
> > > ...but there is. The blocks you thought you freed, especially if the
> > > system was under -ENOSPC pressure, won't actually be free after the
> > > successful ftruncate().
> >
> > They won't be free after something dirties the existing mmap either.
> >
> > Blocks also won't be free if you unlink a file that is currently still
> > open.
> >
> > This isn't really new behavior for a FS.
> 
> An mmap write after a fault due to a hole punch is free to trigger
> SIGBUS if the subsequent page allocation fails.

Isn't that already racy? If the mmap user is fast enough can't it
prevent the page from becoming freed in the first place today?

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:19                                   ` Ira Weiny
  2019-02-11 18:26                                     ` Jason Gunthorpe
  2019-02-11 21:08                                     ` Jerome Glisse
@ 2019-02-11 21:22                                     ` John Hubbard
  2019-02-11 22:12                                       ` Jason Gunthorpe
  2 siblings, 1 reply; 106+ messages in thread
From: John Hubbard @ 2019-02-11 21:22 UTC (permalink / raw)
  To: Ira Weiny, Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
	Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, Jerome Glisse, Michal Hocko

On 2/11/19 10:19 AM, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
>> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
[...]
> John's patches will indicate to the FS that the page is gup pinned.  But they
> will not indicate longterm vs not "shorterm".  A shortterm pin could be handled
> as a "real truncate".  So, are we back to needing a longterm "bit" in struct
> page to indicate a longterm pin and allow the FS to perform this "virtual
> write" after truncate?
> 
> Or is it safe to consider all gup pinned pages this way?
> 
> Ira
> 

I mentioned this in another thread, but I'm not great at email threading. :)
Anyway, it seems better to just drop the entire "longterm" concept from the 
internal APIs, and just deal in "it's either gup-pinned *at the moment*, or 
it's not". And let the filesystem respond appropriately. So for a pinned page 
that hits clear_page_dirty_for_io or whatever else care about pinned pages:

-- fire mmu notifiers, revoke leases, generally do everything as if it were a
long term gup pin

-- if it's long term, then you've taken the right actions.

-- if the pin really is short term, everything works great anyway.


The only way that breaks is if longterm pins imply an irreversible action, such
as blocking and waiting in a way that you can't back out of or get interrupted
out of. And the design doesn't seem to be going in that direction, right?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 21:22                                     ` John Hubbard
@ 2019-02-11 22:12                                       ` Jason Gunthorpe
  2019-02-11 22:33                                         ` John Hubbard
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 22:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, Jerome Glisse,
	Michal Hocko

On Mon, Feb 11, 2019 at 01:22:11PM -0800, John Hubbard wrote:

> The only way that breaks is if longterm pins imply an irreversible action, such
> as blocking and waiting in a way that you can't back out of or get interrupted
> out of. And the design doesn't seem to be going in that direction, right?

RDMA, vfio, etc will always have 'long term' pins that are
irreversible on demand. It is part of the HW capability.

I think the flag is badly named, it is really more of a
GUP_LOCK_PHYSICAL_ADDRESSES flag.

ie indicate to the FS that is should not attempt to remap physical
memory addresses backing this VMA. If the FS can't do that it must
fail.

Short term GUP doesn't need that kind of lock.

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 22:12                                       ` Jason Gunthorpe
@ 2019-02-11 22:33                                         ` John Hubbard
  2019-02-12 16:39                                           ` Christopher Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: John Hubbard @ 2019-02-11 22:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, Jerome Glisse,
	Michal Hocko

On 2/11/19 2:12 PM, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 01:22:11PM -0800, John Hubbard wrote:
> 
>> The only way that breaks is if longterm pins imply an irreversible action, such
>> as blocking and waiting in a way that you can't back out of or get interrupted
>> out of. And the design doesn't seem to be going in that direction, right?
> 
> RDMA, vfio, etc will always have 'long term' pins that are
> irreversible on demand. It is part of the HW capability.
> 

Yes, I get that about the HW. But I didn't quite phrase it accurately. What I
meant was, irreversible from the kernel code's point of view; specifically,
the filesystem while in various writeback paths.

But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
my head--if we actually go that way, and prevent the caller from setting up
a problematic gup pin in the first place, then that may make this point sort
of moot.


> I think the flag is badly named, it is really more of a
> GUP_LOCK_PHYSICAL_ADDRESSES flag.
> 
> ie indicate to the FS that is should not attempt to remap physical
> memory addresses backing this VMA. If the FS can't do that it must
> fail.
> 

Yes. Duration is probably less important than the fact that the page
is specially treated.

[1] https://lore.kernel.org/r/20190211102402.GF19029@quack2.suse.cz
thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 17:22                               ` Dan Williams
  2019-02-11 18:06                                 ` Jason Gunthorpe
@ 2019-02-12 16:07                                 ` Jan Kara
  2019-02-12 21:53                                   ` Dan Williams
  1 sibling, 1 reply; 106+ messages in thread
From: Jan Kara @ 2019-02-12 16:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
	Jason Gunthorpe, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon 11-02-19 09:22:58, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > > One approach that may be a clean way to solve this:
> > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > >    on the longterm pinned range until the long term pin is removed.
> > > > >
> > > > > So, ummm, how do we do block allocation then, which is done on
> > > > > demand during writes?
> > > > >
> > > > > IOWs, this requires the application to set up the file in the
> > > > > correct state for the filesystem to lock it down so somebody else
> > > > > can write to it.  That means the file can't be sparse, it can't be
> > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > > written to it's full size before being shared because otherwise it
> > > > > exposes stale data to the remote client (secure sites are going to
> > > > > love that!), they can't be extended, etc.
> > > > >
> > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > > an immutable for the purposes of local access.
> > > > >
> > > > > Which, essentially we can already do. Prep the file, map it
> > > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > > interface which can do the necessary checks.
> > > >
> > > > Hum, and what will you do if the immutable file that is target for RDMA
> > > > will be a source of reflink? That seems to be currently allowed for
> > > > immutable files but RDMA store would be effectively corrupting the data of
> > > > the target inode. But we could treat it similarly as swapfiles - those also
> > > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > > similarity seems to be quite large there. What do you think?
> > >
> > > This sounds so familiar...
> > >
> > >     https://lwn.net/Articles/726481/
> > >
> > > I'm not opposed to trying again, but leases was what crawled out
> > > smoking crater when this last proposal was nuked.
> >
> > Umm, don't think this is that similar to daxctl() discussion. We are not
> > speaking about providing any new userspace API for this.
> 
> I thought explicit userspace API was one of the outcomes, i.e. that we
> can't depend on this behavior being an implicit side effect of a page
> pin?

I was thinking an implicit sideeffect of gup_longterm() call. Similarly as
swapon(2) does not require the file to be marked in any special way. But
OTOH I agree that RDMA is a less controlled usage than swapon so it is
questionable. I'd still require something like CAP_LINUX_IMMUTABLE at least
for gup_longterm() calls that end up pinning the file.

Inspired by Christoph's idea you reference in [2], maybe gup_longterm()
will succeed only if there is FL_LAYOUT lease for the range being pinned
and we don't allow the lease to be released until there's a pinned page in
the range. And we make the file protected (i.e. treat it like swapfile) if
there's any such lease in it. But this is just a rough sketch and needs more
thinking.

> > Also I think the
> > situation about leases has somewhat cleared up with this discussion - ODP
> > hardware does not need leases since it can use MMU notifiers, for non-ODP
> > hardware it is difficult to handle leases as such hardware has only one big
> > kill-everything call and using that would effectively mean lot of work on
> > the userspace side to resetup everything to make things useful if workable
> > at all.
> >
> > So my proposal would be:
> >
> > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> > its teardown when fs needs it.
> >
> > 2) Hardware not capable of tearing down pins from MMU notifiers will have
> > to use gup_longterm() (we may actually rename it to a more suitable name).
> > FS may just refuse such calls (for normal page cache backed file, it will
> > just return success but for DAX file it will do sanity checks whether the
> > file is fully allocated etc. like we currently do for swapfiles) but if
> > gup_longterm() returns success, it will provide the same guarantees as for
> > swapfiles. So the only thing that we need is some call from gup_longterm()
> > to a filesystem callback to tell it - this file is going to be used by a
> > third party as an IO buffer, don't touch it. And we can (and should)
> > probably refactor the handling to be shared between swapfiles and
> > gup_longterm().
> 
> Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
> solution I thought we dax folks walked away from in the original
> MAP_DIRECT discussion [1]. Here is where leases were the response to
> MAP_DIRECT [2]. ...and here is where we had tame discussions about
> implications of notifying memory-registrations of lease break events
> [3].

Yeah, thanks for the references.

> I honestly don't like the idea that random subsystems can pin down
> file blocks as a side effect of gup on the result of mmap. Recall that
> it's not just RDMA that wants this guarantee. It seems safer to have
> the file be in an explicit block-allocation-immutable-mode so that the
> fallocate man page can describe this error case. Otherwise how would
> you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?

So with requiring lease for gup_longterm() to succeed (and the
FALLOC_FL_PUNCH_HOLE failure being keyed from the existence of such lease),
does it look more reasonable to you?

> [1]: https://lwn.net/Articles/736333/
> [2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html
> [3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:06                                 ` Jason Gunthorpe
  2019-02-11 18:15                                   ` Dan Williams
  2019-02-11 18:19                                   ` Ira Weiny
@ 2019-02-12 16:28                                   ` Jan Kara
  2019-02-14 20:26                                   ` Jerome Glisse
  3 siblings, 0 replies; 106+ messages in thread
From: Jan Kara @ 2019-02-12 16:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
	Doug Ledford, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon 11-02-19 11:06:54, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> 
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> 
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.

The problem is page pins are not really like pwrite(). They are more like
mmap access. And that will just SIGBUS after truncate. So from user point
of view I agree the result may not be that surprising (it would seem just
as if somebody did additional pwrite) but from filesystem point of view it
is very different and it would mean a special handling in lots of places.
So I think that locking down the file before allowing gup_longterm() looks
like a more viable alternative.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 21:09                                               ` Jason Gunthorpe
@ 2019-02-12 16:34                                                 ` Jan Kara
  2019-02-12 16:55                                                   ` Christopher Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Kara @ 2019-02-12 16:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon 11-02-19 14:09:56, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 01:02:37PM -0800, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > > > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > > > What if user space then writes to the end of the file with a regular write?
> > > > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > > > mmaped area (old length)?
> > > > > >
> > > > > > IIRC it depends how the user does the write..
> > > > > >
> > > > > > pwrite() with a given offset will write to that offset, re-extending
> > > > > > the file if needed
> > > > > >
> > > > > > A file opened with O_APPEND and a write done with write() should
> > > > > > append to the new end
> > > > > >
> > > > > > A normal file with a normal write should write to the FD's current
> > > > > > seek pointer.
> > > > > >
> > > > > > I'm not sure what happens if you write via mmap/msync.
> > > > > >
> > > > > > RDMA is similar to pwrite() and mmap.
> > > > >
> > > > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > > > the file offset.  So there's no user-visible change in behaviour.
> > > >
> > > > ...but there is. The blocks you thought you freed, especially if the
> > > > system was under -ENOSPC pressure, won't actually be free after the
> > > > successful ftruncate().
> > >
> > > They won't be free after something dirties the existing mmap either.
> > >
> > > Blocks also won't be free if you unlink a file that is currently still
> > > open.
> > >
> > > This isn't really new behavior for a FS.
> > 
> > An mmap write after a fault due to a hole punch is free to trigger
> > SIGBUS if the subsequent page allocation fails.
> 
> Isn't that already racy? If the mmap user is fast enough can't it
> prevent the page from becoming freed in the first place today?

No, it cannot. We block page faulting for the file (via a lock), tear down
page tables, free pages and blocks. Then we resume faults and return
SIGBUS (if the page ends up being after the new end of file in case of
truncate) or do new page fault and fresh block allocation (which can end
with SIGBUS if the filesystem cannot allocate new block to back the page).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 21:02                                             ` Dan Williams
  2019-02-11 21:09                                               ` Jason Gunthorpe
@ 2019-02-12 16:36                                               ` Christopher Lameter
  2019-02-12 16:44                                                 ` Jan Kara
  1 sibling, 1 reply; 106+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Matthew Wilcox, Ira Weiny, Jan Kara,
	Dave Chinner, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Mon, 11 Feb 2019, Dan Williams wrote:

> An mmap write after a fault due to a hole punch is free to trigger
> SIGBUS if the subsequent page allocation fails. So no, I don't see
> them as the same unless you're allowing for the holder of the MR to
> receive a re-fault failure.

Order 0 page allocation failures are generally not possible in that path.
System will reclaim and OOM before that happens.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 22:33                                         ` John Hubbard
@ 2019-02-12 16:39                                           ` Christopher Lameter
  2019-02-13  2:58                                             ` John Hubbard
  0 siblings, 1 reply; 106+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
	Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, Jerome Glisse, Michal Hocko

On Mon, 11 Feb 2019, John Hubbard wrote:

> But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
> my head--if we actually go that way, and prevent the caller from setting up
> a problematic gup pin in the first place, then that may make this point sort
> of moot.

Ok well can be document how we think it would work somewhere? Long term
mapping a page cache page could a problem and we need to explain that
somewhere.

> > ie indicate to the FS that is should not attempt to remap physical
> > memory addresses backing this VMA. If the FS can't do that it must
> > fail.
> >
>
> Yes. Duration is probably less important than the fact that the page
> is specially treated.

Yup.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-12 16:36                                               ` Christopher Lameter
@ 2019-02-12 16:44                                                 ` Jan Kara
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Kara @ 2019-02-12 16:44 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Dan Williams, Jason Gunthorpe, Matthew Wilcox, Ira Weiny,
	Jan Kara, Dave Chinner, Doug Ledford, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Tue 12-02-19 16:36:36, Christopher Lameter wrote:
> On Mon, 11 Feb 2019, Dan Williams wrote:
> 
> > An mmap write after a fault due to a hole punch is free to trigger
> > SIGBUS if the subsequent page allocation fails. So no, I don't see
> > them as the same unless you're allowing for the holder of the MR to
> > receive a re-fault failure.
> 
> Order 0 page allocation failures are generally not possible in that path.
> System will reclaim and OOM before that happens.

But also block allocation can fail in the filesystem or you can have memcgs
set up that make the page allocation fail, can't you? So in principle Dan
is right. Page faults can and do fail...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-12 16:34                                                 ` Jan Kara
@ 2019-02-12 16:55                                                   ` Christopher Lameter
  2019-02-13 15:06                                                     ` Jan Kara
  0 siblings, 1 reply; 106+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Ira Weiny,
	Dave Chinner, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Tue, 12 Feb 2019, Jan Kara wrote:

> > Isn't that already racy? If the mmap user is fast enough can't it
> > prevent the page from becoming freed in the first place today?
>
> No, it cannot. We block page faulting for the file (via a lock), tear down
> page tables, free pages and blocks. Then we resume faults and return
> SIGBUS (if the page ends up being after the new end of file in case of
> truncate) or do new page fault and fresh block allocation (which can end
> with SIGBUS if the filesystem cannot allocate new block to back the page).

Well that is already pretty inconsistent behavior. Under what conditions
is the SIGBUS occurring without the new fault attempt?

If a new fault is attempted then we have resource constraints that could
have caused a SIGBUS independently of the truncate. So that case is not
really something special to be considered for truncation.

So the only concern left is to figure out under what conditions SIGBUS
occurs with a racing truncate (if at all) if there are sufficient
resources to complete the page fault.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-12 16:07                                 ` Jan Kara
@ 2019-02-12 21:53                                   ` Dan Williams
  0 siblings, 0 replies; 106+ messages in thread
From: Dan Williams @ 2019-02-12 21:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
	Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Tue, Feb 12, 2019 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 11-02-19 09:22:58, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > > > One approach that may be a clean way to solve this:
> > > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > > >    provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > > >    on the longterm pinned range until the long term pin is removed.
> > > > > >
> > > > > > So, ummm, how do we do block allocation then, which is done on
> > > > > > demand during writes?
> > > > > >
> > > > > > IOWs, this requires the application to set up the file in the
> > > > > > correct state for the filesystem to lock it down so somebody else
> > > > > > can write to it.  That means the file can't be sparse, it can't be
> > > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > > > written to it's full size before being shared because otherwise it
> > > > > > exposes stale data to the remote client (secure sites are going to
> > > > > > love that!), they can't be extended, etc.
> > > > > >
> > > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > > > an immutable for the purposes of local access.
> > > > > >
> > > > > > Which, essentially we can already do. Prep the file, map it
> > > > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > > > interface which can do the necessary checks.
> > > > >
> > > > > Hum, and what will you do if the immutable file that is target for RDMA
> > > > > will be a source of reflink? That seems to be currently allowed for
> > > > > immutable files but RDMA store would be effectively corrupting the data of
> > > > > the target inode. But we could treat it similarly as swapfiles - those also
> > > > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > > > similarity seems to be quite large there. What do you think?
> > > >
> > > > This sounds so familiar...
> > > >
> > > >     https://lwn.net/Articles/726481/
> > > >
> > > > I'm not opposed to trying again, but leases was what crawled out
> > > > smoking crater when this last proposal was nuked.
> > >
> > > Umm, don't think this is that similar to daxctl() discussion. We are not
> > > speaking about providing any new userspace API for this.
> >
> > I thought explicit userspace API was one of the outcomes, i.e. that we
> > can't depend on this behavior being an implicit side effect of a page
> > pin?
>
> I was thinking an implicit sideeffect of gup_longterm() call. Similarly as
> swapon(2) does not require the file to be marked in any special way. But
> OTOH I agree that RDMA is a less controlled usage than swapon so it is
> questionable. I'd still require something like CAP_LINUX_IMMUTABLE at least
> for gup_longterm() calls that end up pinning the file.
>
> Inspired by Christoph's idea you reference in [2], maybe gup_longterm()
> will succeed only if there is FL_LAYOUT lease for the range being pinned
> and we don't allow the lease to be released until there's a pinned page in
> the range. And we make the file protected (i.e. treat it like swapfile) if
> there's any such lease in it. But this is just a rough sketch and needs more
> thinking.
>
> > > Also I think the
> > > situation about leases has somewhat cleared up with this discussion - ODP
> > > hardware does not need leases since it can use MMU notifiers, for non-ODP
> > > hardware it is difficult to handle leases as such hardware has only one big
> > > kill-everything call and using that would effectively mean lot of work on
> > > the userspace side to resetup everything to make things useful if workable
> > > at all.
> > >
> > > So my proposal would be:
> > >
> > > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> > > its teardown when fs needs it.
> > >
> > > 2) Hardware not capable of tearing down pins from MMU notifiers will have
> > > to use gup_longterm() (we may actually rename it to a more suitable name).
> > > FS may just refuse such calls (for normal page cache backed file, it will
> > > just return success but for DAX file it will do sanity checks whether the
> > > file is fully allocated etc. like we currently do for swapfiles) but if
> > > gup_longterm() returns success, it will provide the same guarantees as for
> > > swapfiles. So the only thing that we need is some call from gup_longterm()
> > > to a filesystem callback to tell it - this file is going to be used by a
> > > third party as an IO buffer, don't touch it. And we can (and should)
> > > probably refactor the handling to be shared between swapfiles and
> > > gup_longterm().
> >
> > Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
> > solution I thought we dax folks walked away from in the original
> > MAP_DIRECT discussion [1]. Here is where leases were the response to
> > MAP_DIRECT [2]. ...and here is where we had tame discussions about
> > implications of notifying memory-registrations of lease break events
> > [3].
>
> Yeah, thanks for the references.
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> So with requiring lease for gup_longterm() to succeed (and the
> FALLOC_FL_PUNCH_HOLE failure being keyed from the existence of such lease),
> does it look more reasonable to you?

That sounds reasonable to me, just the small matter of teaching the
non-ODP RDMA ecosystem to take out FL_LAYOUT leases and do something
reasonable when the lease needs to be recalled.

I would hope that RDMA-to-FSDAX-PMEM support is enough motivation to
either make the necessary application changes, or switch to an
ODP-capable adapter.

Note that I think we need FL_LAYOUT regardless of whether the
legacy-RDMA stack ever takes advantage of it. VFIO device passthrough
to a guest that has a host DAX file mapped as physical PMEM in the
guest needs guarantees that the guest will be killed and DMA force
blocked by the IOMMU if someone punches a hole in memory in use by a
guest, or otherwise have a paravirtualized driver in the guest to
coordinate what effectively looks like a physical memory unplug event.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-12 16:39                                           ` Christopher Lameter
@ 2019-02-13  2:58                                             ` John Hubbard
  0 siblings, 0 replies; 106+ messages in thread
From: John Hubbard @ 2019-02-13  2:58 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jason Gunthorpe, Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
	Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, Jerome Glisse, Michal Hocko

On 2/12/19 8:39 AM, Christopher Lameter wrote:
> On Mon, 11 Feb 2019, John Hubbard wrote:
> 
>> But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
>> my head--if we actually go that way, and prevent the caller from setting up
>> a problematic gup pin in the first place, then that may make this point sort
>> of moot.
> 
> Ok well can be document how we think it would work somewhere? Long term
> mapping a page cache page could a problem and we need to explain that
> somewhere.
> 

Yes, once the dust settles, I think Documentation/vm/get_user_pages.rst is the
right place. I started to create that file, but someone observed that my initial
content was entirely backward-looking (described the original problem, instead 
of describing how the new system would work). So I'll use this opportunity for 
a do-over. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-12 16:55                                                   ` Christopher Lameter
@ 2019-02-13 15:06                                                     ` Jan Kara
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Kara @ 2019-02-13 15:06 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Jan Kara, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
	Ira Weiny, Dave Chinner, Doug Ledford, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
	Michal Hocko

On Tue 12-02-19 16:55:21, Christopher Lameter wrote:
> On Tue, 12 Feb 2019, Jan Kara wrote:
> 
> > > Isn't that already racy? If the mmap user is fast enough can't it
> > > prevent the page from becoming freed in the first place today?
> >
> > No, it cannot. We block page faulting for the file (via a lock), tear down
> > page tables, free pages and blocks. Then we resume faults and return
> > SIGBUS (if the page ends up being after the new end of file in case of
> > truncate) or do new page fault and fresh block allocation (which can end
> > with SIGBUS if the filesystem cannot allocate new block to back the page).
> 
> Well that is already pretty inconsistent behavior. Under what conditions
> is the SIGBUS occurring without the new fault attempt?

I probably didn't express myself clearly enough. I didn't say that SIGBUS
can occur without a page fault. The evaluation of whether a page would be
beyond EOF, page allocation, and block allocation happen only in response
to a page fault...

> If a new fault is attempted then we have resource constraints that could
> have caused a SIGBUS independently of the truncate. So that case is not
> really something special to be considered for truncation.

Agreed. I was just reacting to Jason's question whether an application
cannot prevent page freeing by being aggressive enough.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-11 18:06                                 ` Jason Gunthorpe
                                                     ` (2 preceding siblings ...)
  2019-02-12 16:28                                   ` Jan Kara
@ 2019-02-14 20:26                                   ` Jerome Glisse
  2019-02-14 20:50                                     ` Matthew Wilcox
  3 siblings, 1 reply; 106+ messages in thread
From: Jerome Glisse @ 2019-02-14 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
	Doug Ledford, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> 
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> 
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
> 
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.

So it is kind of a // point to this, but direct I/O do "truncate" pages
or more exactly after a write direct I/O invalidate_inode_pages2_range()
is call and it will try to unmap and remove from page cache all pages
that have been written too.

So we probably want to think about what we want to do here if a device
like RDMA has also pin those pages. Do we want to abort the invalidate ?
Which would mean that then the direct I/O write was just a pointless
exercise. Do we want to not direct I/O but instead memcpy into page
cache memory ? Then you are just ignoring the direct I/O property of
the write. Or do we want to both direct I/O to the block and also
do a memcopy to the page so that we preserve the direct I/O semantics ?

I would probably go with the last one. In any cases we will need to
update the direct I/O code to handle GUPed page cache page.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-14 20:26                                   ` Jerome Glisse
@ 2019-02-14 20:50                                     ` Matthew Wilcox
  2019-02-14 21:39                                       ` Jerome Glisse
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-14 20:50 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Thu, Feb 14, 2019 at 03:26:22PM -0500, Jerome Glisse wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > right away and the 'hole' was essentially instantly reallocated. This
> > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> 
> So it is kind of a // point to this, but direct I/O do "truncate" pages
> or more exactly after a write direct I/O invalidate_inode_pages2_range()
> is call and it will try to unmap and remove from page cache all pages
> that have been written too.

Hang on.  Pages are tossed out of the page cache _before_ an O_DIRECT
write starts.  The only way what you're describing can happen is if
there's a race between an O_DIRECT writer and an mmap.  Which is either
an incredibly badly written application or someone trying an exploit.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-14 20:50                                     ` Matthew Wilcox
@ 2019-02-14 21:39                                       ` Jerome Glisse
  2019-02-15  1:19                                         ` Dave Chinner
  0 siblings, 1 reply; 106+ messages in thread
From: Jerome Glisse @ 2019-02-14 21:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Dan Williams, Jan Kara, Dave Chinner,
	Christopher Lameter, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Thu, Feb 14, 2019 at 12:50:49PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 14, 2019 at 03:26:22PM -0500, Jerome Glisse wrote:
> > On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > > right away and the 'hole' was essentially instantly reallocated. This
> > > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> > 
> > So it is kind of a // point to this, but direct I/O do "truncate" pages
> > or more exactly after a write direct I/O invalidate_inode_pages2_range()
> > is call and it will try to unmap and remove from page cache all pages
> > that have been written too.
> 
> Hang on.  Pages are tossed out of the page cache _before_ an O_DIRECT
> write starts.  The only way what you're describing can happen is if
> there's a race between an O_DIRECT writer and an mmap.  Which is either
> an incredibly badly written application or someone trying an exploit.

I believe they are tossed after O_DIRECT starts (dio_complete). But
regardless the issues is that an RDMA can have pin the page long
before the DIO in which case the page can not be toss from the page
cache and what ever is written to the block device will be discarded
once the RDMA unpin the pages. So we would end up in the code path
that spit out big error message in the kernel log.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-14 21:39                                       ` Jerome Glisse
@ 2019-02-15  1:19                                         ` Dave Chinner
  2019-02-15 15:42                                           ` Christopher Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Dave Chinner @ 2019-02-15  1:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Matthew Wilcox, Jason Gunthorpe, Dan Williams, Jan Kara,
	Christopher Lameter, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Thu, Feb 14, 2019 at 04:39:22PM -0500, Jerome Glisse wrote:
> On Thu, Feb 14, 2019 at 12:50:49PM -0800, Matthew Wilcox wrote:
> > On Thu, Feb 14, 2019 at 03:26:22PM -0500, Jerome Glisse wrote:
> > > On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > > > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > > > right away and the 'hole' was essentially instantly reallocated. This
> > > > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> > > 
> > > So it is kind of a // point to this, but direct I/O do "truncate" pages
> > > or more exactly after a write direct I/O invalidate_inode_pages2_range()
> > > is call and it will try to unmap and remove from page cache all pages
> > > that have been written too.
> > 
> > Hang on.  Pages are tossed out of the page cache _before_ an O_DIRECT
> > write starts.  The only way what you're describing can happen is if
> > there's a race between an O_DIRECT writer and an mmap.  Which is either
> > an incredibly badly written application or someone trying an exploit.
> 
> I believe they are tossed after O_DIRECT starts (dio_complete). But

Yes, but also before. See iomap_dio_rw() and
generic_file_direct_write().

> regardless the issues is that an RDMA can have pin the page long
> before the DIO in which case the page can not be toss from the page
> cache and what ever is written to the block device will be discarded
> once the RDMA unpin the pages. So we would end up in the code path
> that spit out big error message in the kernel log.

Which tells us filesystem people that the applications are doing
something that _will_ cause data corruption and hence not to spend
any time triaging data corruption reports because it's not a
filesystem bug that caused it.

See open(2):

	Applications should avoid mixing O_DIRECT and normal I/O to
	the same file, and especially to overlapping byte regions in
	the same file.  Even when the filesystem correctly handles
	the coherency issues in this situation, overall I/O
	throughput is likely to be slower than using either mode
	alone.  Likewise, applications should avoid mixing mmap(2)
	of files with direct I/O to the same files.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15  1:19                                         ` Dave Chinner
@ 2019-02-15 15:42                                           ` Christopher Lameter
  2019-02-15 18:08                                             ` Matthew Wilcox
  0 siblings, 1 reply; 106+ messages in thread
From: Christopher Lameter @ 2019-02-15 15:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jerome Glisse, Matthew Wilcox, Jason Gunthorpe, Dan Williams,
	Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Fri, 15 Feb 2019, Dave Chinner wrote:

> Which tells us filesystem people that the applications are doing
> something that _will_ cause data corruption and hence not to spend
> any time triaging data corruption reports because it's not a
> filesystem bug that caused it.
>
> See open(2):
>
> 	Applications should avoid mixing O_DIRECT and normal I/O to
> 	the same file, and especially to overlapping byte regions in
> 	the same file.  Even when the filesystem correctly handles
> 	the coherency issues in this situation, overall I/O
> 	throughput is likely to be slower than using either mode
> 	alone.  Likewise, applications should avoid mixing mmap(2)
> 	of files with direct I/O to the same files.

Since RDMA is something similar: Can we say that a file that is used for
RDMA should not use the page cache?

And can we enforce this in the future? I.e. have some file state that says
that this file is direct/RDMA or contains long term pinning and thus
allows only a certain type of operations to ensure data consistency?

If we cannot enforce it then we may want to spit out some warning?


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15 15:42                                           ` Christopher Lameter
@ 2019-02-15 18:08                                             ` Matthew Wilcox
  2019-02-15 18:31                                               ` Christopher Lameter
  0 siblings, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2019-02-15 18:08 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Dave Chinner, Jerome Glisse, Jason Gunthorpe, Dan Williams,
	Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Fri, Feb 15, 2019 at 03:42:02PM +0000, Christopher Lameter wrote:
> On Fri, 15 Feb 2019, Dave Chinner wrote:
> 
> > Which tells us filesystem people that the applications are doing
> > something that _will_ cause data corruption and hence not to spend
> > any time triaging data corruption reports because it's not a
> > filesystem bug that caused it.
> >
> > See open(2):
> >
> > 	Applications should avoid mixing O_DIRECT and normal I/O to
> > 	the same file, and especially to overlapping byte regions in
> > 	the same file.  Even when the filesystem correctly handles
> > 	the coherency issues in this situation, overall I/O
> > 	throughput is likely to be slower than using either mode
> > 	alone.  Likewise, applications should avoid mixing mmap(2)
> > 	of files with direct I/O to the same files.
> 
> Since RDMA is something similar: Can we say that a file that is used for
> RDMA should not use the page cache?

That makes no sense.  The page cache is the standard synchronisation point
for filesystems and processes.  The only problems come in for the things
which bypass the page cache like O_DIRECT and DAX.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15 18:08                                             ` Matthew Wilcox
@ 2019-02-15 18:31                                               ` Christopher Lameter
  2019-02-15 22:00                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 106+ messages in thread
From: Christopher Lameter @ 2019-02-15 18:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Jerome Glisse, Jason Gunthorpe, Dan Williams,
	Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Fri, 15 Feb 2019, Matthew Wilcox wrote:

> > Since RDMA is something similar: Can we say that a file that is used for
> > RDMA should not use the page cache?
>
> That makes no sense.  The page cache is the standard synchronisation point
> for filesystems and processes.  The only problems come in for the things
> which bypass the page cache like O_DIRECT and DAX.

It makes a lot of sense since the filesystems play COW etc games with the
pages and RDMA is very much like O_DIRECT in that the pages are modified
directly under I/O. It also bypasses the page cache in case you have
not noticed yet.

Both filesysetms and RDMA acting on a page cache at
the same time lead to the mess that we are trying to solve.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15 18:31                                               ` Christopher Lameter
@ 2019-02-15 22:00                                                 ` Jason Gunthorpe
  2019-02-15 23:38                                                   ` Ira Weiny
  0 siblings, 1 reply; 106+ messages in thread
From: Jason Gunthorpe @ 2019-02-15 22:00 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Matthew Wilcox, Dave Chinner, Jerome Glisse, Dan Williams,
	Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
	Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Fri, Feb 15, 2019 at 06:31:36PM +0000, Christopher Lameter wrote:
> On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> 
> > > Since RDMA is something similar: Can we say that a file that is used for
> > > RDMA should not use the page cache?
> >
> > That makes no sense.  The page cache is the standard synchronisation point
> > for filesystems and processes.  The only problems come in for the things
> > which bypass the page cache like O_DIRECT and DAX.
> 
> It makes a lot of sense since the filesystems play COW etc games with the
> pages and RDMA is very much like O_DIRECT in that the pages are modified
> directly under I/O. It also bypasses the page cache in case you have
> not noticed yet.

It is quite different, O_DIRECT modifies the physical blocks on the
storage, bypassing the memory copy. RDMA modifies the memory copy.

pages are necessary to do RDMA, and those pages have to be flushed to
disk.. So I'm not seeing how it can be disconnected from the page
cache?

Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15 22:00                                                 ` Jason Gunthorpe
@ 2019-02-15 23:38                                                   ` Ira Weiny
  2019-02-16 22:42                                                     ` Dave Chinner
  2019-02-17  2:54                                                     ` Christopher Lameter
  0 siblings, 2 replies; 106+ messages in thread
From: Ira Weiny @ 2019-02-15 23:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christopher Lameter, Matthew Wilcox, Dave Chinner, Jerome Glisse,
	Dan Williams, Jan Kara, Doug Ledford, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Fri, Feb 15, 2019 at 03:00:31PM -0700, Jason Gunthorpe wrote:
> On Fri, Feb 15, 2019 at 06:31:36PM +0000, Christopher Lameter wrote:
> > On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> > 
> > > > Since RDMA is something similar: Can we say that a file that is used for
> > > > RDMA should not use the page cache?
> > >
> > > That makes no sense.  The page cache is the standard synchronisation point
> > > for filesystems and processes.  The only problems come in for the things
> > > which bypass the page cache like O_DIRECT and DAX.
> > 
> > It makes a lot of sense since the filesystems play COW etc games with the
> > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > directly under I/O. It also bypasses the page cache in case you have
> > not noticed yet.
> 
> It is quite different, O_DIRECT modifies the physical blocks on the
> storage, bypassing the memory copy.
>

Really?  I thought O_DIRECT allowed the block drivers to write to/from user
space buffers.  But the _storage_ was still under the control of the block
drivers?

>
> RDMA modifies the memory copy.
> 
> pages are necessary to do RDMA, and those pages have to be flushed to
> disk.. So I'm not seeing how it can be disconnected from the page
> cache?

I don't disagree with this.

Ira

> 
> Jason

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15 23:38                                                   ` Ira Weiny
@ 2019-02-16 22:42                                                     ` Dave Chinner
  2019-02-17  2:54                                                     ` Christopher Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Dave Chinner @ 2019-02-16 22:42 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Christopher Lameter, Matthew Wilcox,
	Jerome Glisse, Dan Williams, Jan Kara, Doug Ledford, lsf-pc,
	linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
	Michal Hocko

On Fri, Feb 15, 2019 at 03:38:29PM -0800, Ira Weiny wrote:
> On Fri, Feb 15, 2019 at 03:00:31PM -0700, Jason Gunthorpe wrote:
> > On Fri, Feb 15, 2019 at 06:31:36PM +0000, Christopher Lameter wrote:
> > > On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> > > 
> > > > > Since RDMA is something similar: Can we say that a file that is used for
> > > > > RDMA should not use the page cache?
> > > >
> > > > That makes no sense.  The page cache is the standard synchronisation point
> > > > for filesystems and processes.  The only problems come in for the things
> > > > which bypass the page cache like O_DIRECT and DAX.
> > > 
> > > It makes a lot of sense since the filesystems play COW etc games with the
> > > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > > directly under I/O. It also bypasses the page cache in case you have
> > > not noticed yet.
> > 
> > It is quite different, O_DIRECT modifies the physical blocks on the
> > storage, bypassing the memory copy.
> >
> 
> Really?  I thought O_DIRECT allowed the block drivers to write to/from user
> space buffers.  But the _storage_ was still under the control of the block
> drivers?

Yup, in a nutshell. Even O_DIRECT on DAX doesn't modify the physical
storage directly - it ends up in the pmem driver and it does a
memcpy() to move the data to/from the physical storage and the user
space buffer. It's exactly the same IO path as moving data to/from
the physical storage into the page cache pages....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
  2019-02-15 23:38                                                   ` Ira Weiny
  2019-02-16 22:42                                                     ` Dave Chinner
@ 2019-02-17  2:54                                                     ` Christopher Lameter
  1 sibling, 0 replies; 106+ messages in thread
From: Christopher Lameter @ 2019-02-17  2:54 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jason Gunthorpe, Matthew Wilcox, Dave Chinner, Jerome Glisse,
	Dan Williams, Jan Kara, Doug Ledford, lsf-pc, linux-rdma,
	Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko

On Fri, 15 Feb 2019, Ira Weiny wrote:

> > > > for filesystems and processes.  The only problems come in for the things
> > > > which bypass the page cache like O_DIRECT and DAX.
> > >
> > > It makes a lot of sense since the filesystems play COW etc games with the
> > > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > > directly under I/O. It also bypasses the page cache in case you have
> > > not noticed yet.
> >
> > It is quite different, O_DIRECT modifies the physical blocks on the
> > storage, bypassing the memory copy.
> >
>
> Really?  I thought O_DIRECT allowed the block drivers to write to/from user
> space buffers.  But the _storage_ was still under the control of the block
> drivers?

It depends on what you see as the modification target. O_DIRECT uses
memory as a target and source like RDMA. The block device is at the other
end of the handling.

> > RDMA modifies the memory copy.
> >
> > pages are necessary to do RDMA, and those pages have to be flushed to
> > disk.. So I'm not seeing how it can be disconnected from the page
> > cache?
>
> I don't disagree with this.

RDMA does direct access to memory. If that memmory is a mmmap of a regular
block  device then we have a problem (this has not been a standard use case to my
knowledge). The semantics are simmply different. RDMA expects memory to be
pinned and always to be able to read and write from it. The block
device/filesystem expects memory access to be controllable via the page
permission. In particular access to be page need to be able to be stopped.

This is fundamentally incompatible. RDMA access to such an mmapped section
must preserve the RDMA semantics while the pinning is done and can only
provide the access control after RDMA is finished. Pages in the RDMA range
cannot be handled like normal page cache pages.

This is in particular evident in the DAX case in which we have direct pass
through even to the storage medium. And in this case write through can
replace the page cache.

^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2019-02-17  2:54 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-05 17:50 [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA Ira Weiny
2019-02-05 18:01 ` Ira Weiny
2019-02-06 21:31   ` Dave Chinner
2019-02-06  9:50 ` Jan Kara
2019-02-06 17:31   ` Jason Gunthorpe
2019-02-06 17:52     ` Matthew Wilcox
2019-02-06 18:32       ` Doug Ledford
2019-02-06 18:35         ` Matthew Wilcox
2019-02-06 18:44           ` Doug Ledford
2019-02-06 18:52           ` Jason Gunthorpe
2019-02-06 19:45             ` Dan Williams
2019-02-06 20:14               ` Doug Ledford
2019-02-06 21:04                 ` Dan Williams
2019-02-06 21:12                   ` Doug Ledford
2019-02-06 19:16         ` Christopher Lameter
2019-02-06 19:40           ` Matthew Wilcox
2019-02-06 20:16             ` Doug Ledford
2019-02-06 20:20               ` Matthew Wilcox
2019-02-06 20:28                 ` Doug Ledford
2019-02-06 20:41                   ` Matthew Wilcox
2019-02-06 20:47                     ` Doug Ledford
2019-02-06 20:49                       ` Matthew Wilcox
2019-02-06 20:50                         ` Doug Ledford
2019-02-06 20:31                 ` Jason Gunthorpe
2019-02-06 20:39                 ` Christopher Lameter
2019-02-06 20:54                 ` Doug Ledford
2019-02-07 16:48                   ` Jan Kara
2019-02-06 20:24             ` Christopher Lameter
2019-02-06 21:03           ` Dave Chinner
2019-02-06 22:08             ` Jason Gunthorpe
2019-02-06 22:24               ` Doug Ledford
2019-02-06 22:44                 ` Dan Williams
2019-02-06 23:21                   ` Jason Gunthorpe
2019-02-06 23:30                     ` Dan Williams
2019-02-06 23:41                       ` Jason Gunthorpe
2019-02-07  0:22                         ` Dan Williams
2019-02-07  5:33                           ` Jason Gunthorpe
2019-02-07  1:57                   ` Doug Ledford
2019-02-07  2:48                     ` Dan Williams
2019-02-07  2:42                   ` Doug Ledford
2019-02-07  3:13                     ` Dan Williams
2019-02-07 17:23                       ` Ira Weiny
2019-02-07 16:25                   ` Doug Ledford
2019-02-07 16:55                     ` Christopher Lameter
2019-02-07 17:35                       ` Ira Weiny
2019-02-07 18:17                         ` Christopher Lameter
2019-02-08  4:43                       ` Dave Chinner
2019-02-08 11:10                         ` Jan Kara
2019-02-08 20:50                           ` Dan Williams
2019-02-11 10:24                             ` Jan Kara
2019-02-11 17:22                               ` Dan Williams
2019-02-11 18:06                                 ` Jason Gunthorpe
2019-02-11 18:15                                   ` Dan Williams
2019-02-11 18:19                                   ` Ira Weiny
2019-02-11 18:26                                     ` Jason Gunthorpe
2019-02-11 18:40                                       ` Matthew Wilcox
2019-02-11 19:58                                         ` Dan Williams
2019-02-11 20:49                                           ` Jason Gunthorpe
2019-02-11 21:02                                             ` Dan Williams
2019-02-11 21:09                                               ` Jason Gunthorpe
2019-02-12 16:34                                                 ` Jan Kara
2019-02-12 16:55                                                   ` Christopher Lameter
2019-02-13 15:06                                                     ` Jan Kara
2019-02-12 16:36                                               ` Christopher Lameter
2019-02-12 16:44                                                 ` Jan Kara
2019-02-11 21:08                                     ` Jerome Glisse
2019-02-11 21:22                                     ` John Hubbard
2019-02-11 22:12                                       ` Jason Gunthorpe
2019-02-11 22:33                                         ` John Hubbard
2019-02-12 16:39                                           ` Christopher Lameter
2019-02-13  2:58                                             ` John Hubbard
2019-02-12 16:28                                   ` Jan Kara
2019-02-14 20:26                                   ` Jerome Glisse
2019-02-14 20:50                                     ` Matthew Wilcox
2019-02-14 21:39                                       ` Jerome Glisse
2019-02-15  1:19                                         ` Dave Chinner
2019-02-15 15:42                                           ` Christopher Lameter
2019-02-15 18:08                                             ` Matthew Wilcox
2019-02-15 18:31                                               ` Christopher Lameter
2019-02-15 22:00                                                 ` Jason Gunthorpe
2019-02-15 23:38                                                   ` Ira Weiny
2019-02-16 22:42                                                     ` Dave Chinner
2019-02-17  2:54                                                     ` Christopher Lameter
2019-02-12 16:07                                 ` Jan Kara
2019-02-12 21:53                                   ` Dan Williams
2019-02-08 21:20                           ` Dave Chinner
2019-02-08 15:33                         ` Christopher Lameter
2019-02-07 17:24                     ` Matthew Wilcox
2019-02-07 17:26                       ` Jason Gunthorpe
2019-02-07  3:52                 ` Dave Chinner
2019-02-07  5:23                   ` Jason Gunthorpe
2019-02-07  6:00                     ` Dan Williams
2019-02-07 17:17                       ` Jason Gunthorpe
2019-02-07 23:54                         ` Dan Williams
2019-02-08  1:44                           ` Ira Weiny
2019-02-08  5:19                           ` Jason Gunthorpe
2019-02-08  7:20                             ` Dan Williams
2019-02-08 15:42                               ` Jason Gunthorpe
2019-02-07 15:04                     ` Chuck Lever
2019-02-07 15:28                       ` Tom Talpey
2019-02-07 15:37                         ` Doug Ledford
2019-02-07 15:41                           ` Tom Talpey
2019-02-07 15:56                             ` Doug Ledford
2019-02-07 16:57                         ` Ira Weiny
2019-02-07 21:31                           ` Tom Talpey
2019-02-07 16:54                     ` Ira Weiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).