* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-06 22:44 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-06 22:44 UTC (permalink / raw)
To: Doug Ledford
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
>
> On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > >
> > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > a different file.
> > > > >
> > > > > Why is the solution revoke then? Is there something besides truncate
> > > > > that we have to worry about? I ask because EBUSY is not currently
> > > > > listed as a return value of truncate, so extending the API to include
> > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > (or should not be) totally out of the question.
> > > > >
> > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > portion where that alternative was ruled out?
> > > >
> > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > about where there was an mmap with the O_DAX option to do direct write
> > > > though? If we only allow this use case then we may not have to worry about
> > > > long term GUP because DAX mapped files will stay in the physical location
> > > > regardless.
> > >
> > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > writes will physically move the data to a new physical location.
> > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > pin.
> > >
> > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > the filesystem can move data physically on write access, and b)
> > > revokable file leases so that the filesystem can kick userspace out
> > > of the way when it needs to.
> >
> > Why do we need both? You want to have leases for normal CPU mmaps too?
> >
> > > Truncate is a red herring. It's definitely a case for revokable
> > > leases, but it's the rare case rather than the one we actually care
> > > about. We really care about making copy-on-write capable filesystems like
> > > XFS work with DAX (we've got people asking for it to be supported
> > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > can change physical location at any time.
> >
> > Then we must continue to ban longterm pin with DAX..
> >
> > Nobody is going to want to deploy a system where revoke can happen at
> > any time and if you don't respond fast enough your system either locks
> > with some kind of FS meltdown or your process gets SIGKILL.
> >
> > I don't really see a reason to invest so much design work into
> > something that isn't production worthy.
> >
> > It *almost* made sense with ftruncate, because you could architect to
> > avoid ftruncate.. But just any FS op might reallocate? Naw.
> >
> > Dave, you said the FS is responsible to arbitrate access to the
> > physical pages..
> >
> > Is it possible to have a filesystem for DAX that is more suited to
> > this environment? Ie designed to not require block reallocation (no
> > COW, no reflinks, different approach to ftruncate, etc)
>
> Can someone give me a real world scenario that someone is *actually*
> asking for with this?
I'll point to this example. At the 6:35 mark Kodi talks about the
Oracle use case for DAX + RDMA.
https://youtu.be/ywKPPIE8JfQ?t=395
Currently the only way to get this to work is to use ODP capable
hardware, or Device-DAX. Device-DAX is a facility to map persistent
memory statically through device-file. It's great for statically
allocated use cases, but loses all the nice things (provisioning,
permissions, naming) that a filesystem gives you. This debate is what
to do about non-ODP capable hardware and Filesystem-DAX facility. The
current answer is "no RDMA for you".
> Are DAX users demanding xfs, or is it just the
> filesystem of convenience?
xfs is the only Linux filesystem that supports DAX and reflink.
> Do they need to stick with xfs?
Can you clarify the motivation for that question? This problem exists
for any filesystem that implements an mmap that where the physical
page backing the mapping is identical to the physical storage location
for the file data. I don't see it as an xfs specific problem. Rather,
xfs is taking the lead in this space because it has already deployed
and demonstrated that leases work for the pnfs4 block-server case, so
it seems logical to attempt to extend that case for non-ODP-RDMA.
> Are they
> really trying to do COW backed mappings for the RDMA targets? Or do
> they want a COW backed FS but are perfectly happy if the specific RDMA
> targets are *not* COW and are statically allocated?
I would expect the COW to be broken at registration time. Only ODP
could possibly support reflink + RDMA. So I think this devolves the
problem back to just the "what to do about truncate/punch-hole"
problem in the specific case of non-ODP hardware combined with the
Filesystem-DAX facility.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-06 22:44 ` Dan Williams
(?)
@ 2019-02-06 23:21 ` Jason Gunthorpe
2019-02-06 23:30 ` Dan Williams
-1 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 23:21 UTC (permalink / raw)
To: Dan Williams
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
> > Do they need to stick with xfs?
>
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data.
.. and needs to dynamicaly change that mapping. Which is not really
something inherent to the general idea of a filesystem. A file system
that had *strictly static* block assignments would work fine.
Not all filesystem even implement hole punch.
Not all filesystem implement reflink.
ftruncate doesn't *have* to instantly return the free blocks to
allocation pool.
ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
Replacing XFS is probably not be reasonable, but I wonder if a XFS--
operating mode could exist that had enough features removed to be
safe?
Ie turn off REFLINK. Change the semantic of ftruncate to be more like
ETXTBUSY. Turn off hole punch.
> > Are they really trying to do COW backed mappings for the RDMA
> > targets? Or do they want a COW backed FS but are perfectly happy
> > if the specific RDMA targets are *not* COW and are statically
> > allocated?
>
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.
Usually the problem with COW is that you make a READ RDMA MR and on a
COW'd file, and some other thread breaks the COW..
This probably becomes a problem if the same process that has the MR
triggers a COW break (ie by writing to the CPU mmap). This would cause
the page to be reassigned but the MR would not be updated, which is
not what the app expects.
WRITE is simpler, once the COW is broken during GUP, the pages cannot
be COW'd again until the DMA pin is released. So new reflinks would be
blocked during the DMA pin period.
To fix READ you'd have to treat it like WRITE and break the COW at GPU.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-06 23:21 ` Jason Gunthorpe
@ 2019-02-06 23:30 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-06 23:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
>
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question? This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data.
>
> .. and needs to dynamicaly change that mapping. Which is not really
> something inherent to the general idea of a filesystem. A file system
> that had *strictly static* block assignments would work fine.
>
> Not all filesystem even implement hole punch.
>
> Not all filesystem implement reflink.
>
> ftruncate doesn't *have* to instantly return the free blocks to
> allocation pool.
>
> ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
>
> Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> operating mode could exist that had enough features removed to be
> safe?
You're describing the current situation, i.e. Linux already implements
this, it's called Device-DAX and some users of RDMA find it
insufficient. The choices are to continue to tell them "no", or say
"yes, but you need to submit to lease coordination".
> Ie turn off REFLINK. Change the semantic of ftruncate to be more like
> ETXTBUSY. Turn off hole punch.
>
> > > Are they really trying to do COW backed mappings for the RDMA
> > > targets? Or do they want a COW backed FS but are perfectly happy
> > > if the specific RDMA targets are *not* COW and are statically
> > > allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> Usually the problem with COW is that you make a READ RDMA MR and on a
> COW'd file, and some other thread breaks the COW..
>
> This probably becomes a problem if the same process that has the MR
> triggers a COW break (ie by writing to the CPU mmap). This would cause
> the page to be reassigned but the MR would not be updated, which is
> not what the app expects.
>
> WRITE is simpler, once the COW is broken during GUP, the pages cannot
> be COW'd again until the DMA pin is released. So new reflinks would be
> blocked during the DMA pin period.
>
> To fix READ you'd have to treat it like WRITE and break the COW at GPU.
Right, that's what I'm proposing that any longterm-GUP break COW as if
it were a write.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-06 23:30 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-06 23:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
>
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question? This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data.
>
> .. and needs to dynamicaly change that mapping. Which is not really
> something inherent to the general idea of a filesystem. A file system
> that had *strictly static* block assignments would work fine.
>
> Not all filesystem even implement hole punch.
>
> Not all filesystem implement reflink.
>
> ftruncate doesn't *have* to instantly return the free blocks to
> allocation pool.
>
> ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
>
> Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> operating mode could exist that had enough features removed to be
> safe?
You're describing the current situation, i.e. Linux already implements
this, it's called Device-DAX and some users of RDMA find it
insufficient. The choices are to continue to tell them "no", or say
"yes, but you need to submit to lease coordination".
> Ie turn off REFLINK. Change the semantic of ftruncate to be more like
> ETXTBUSY. Turn off hole punch.
>
> > > Are they really trying to do COW backed mappings for the RDMA
> > > targets? Or do they want a COW backed FS but are perfectly happy
> > > if the specific RDMA targets are *not* COW and are statically
> > > allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> Usually the problem with COW is that you make a READ RDMA MR and on a
> COW'd file, and some other thread breaks the COW..
>
> This probably becomes a problem if the same process that has the MR
> triggers a COW break (ie by writing to the CPU mmap). This would cause
> the page to be reassigned but the MR would not be updated, which is
> not what the app expects.
>
> WRITE is simpler, once the COW is broken during GUP, the pages cannot
> be COW'd again until the DMA pin is released. So new reflinks would be
> blocked during the DMA pin period.
>
> To fix READ you'd have to treat it like WRITE and break the COW at GPU.
Right, that's what I'm proposing that any longterm-GUP break COW as if
it were a write.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-06 23:30 ` Dan Williams
(?)
@ 2019-02-06 23:41 ` Jason Gunthorpe
2019-02-07 0:22 ` Dan Williams
-1 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-06 23:41 UTC (permalink / raw)
To: Dan Williams
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 06, 2019 at 03:30:27PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
> >
> > > > Do they need to stick with xfs?
> > >
> > > Can you clarify the motivation for that question? This problem exists
> > > for any filesystem that implements an mmap that where the physical
> > > page backing the mapping is identical to the physical storage location
> > > for the file data.
> >
> > .. and needs to dynamicaly change that mapping. Which is not really
> > something inherent to the general idea of a filesystem. A file system
> > that had *strictly static* block assignments would work fine.
> >
> > Not all filesystem even implement hole punch.
> >
> > Not all filesystem implement reflink.
> >
> > ftruncate doesn't *have* to instantly return the free blocks to
> > allocation pool.
> >
> > ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
> >
> > Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> > operating mode could exist that had enough features removed to be
> > safe?
>
> You're describing the current situation, i.e. Linux already implements
> this, it's called Device-DAX and some users of RDMA find it
> insufficient. The choices are to continue to tell them "no", or say
> "yes, but you need to submit to lease coordination".
Device-DAX is not what I'm imagining when I say XFS--.
I mean more like XFS with all features that require rellocation of
blocks disabled.
Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
device-dax.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 0:22 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 0:22 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jerome Glisse, Jan Kara, linux-nvdimm, linux-rdma, John Hubbard,
Dave Chinner, Linux Kernel Mailing List, Matthew Wilcox,
Michal Hocko, Linux MM, Doug Ledford, Christopher Lameter,
lsf-pc
On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
[..]
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
>
> Device-DAX is not what I'm imagining when I say XFS--.
>
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
>
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> device-dax.
True, not all the way, but the distinction loses significance as you
lose fs features.
Filesystems mark DAX functionality experimental [1] precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.
EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 0:22 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 0:22 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko, linux-nvdimm
On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
[..]
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
>
> Device-DAX is not what I'm imagining when I say XFS--.
>
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
>
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> device-dax.
True, not all the way, but the distinction loses significance as you
lose fs features.
Filesystems mark DAX functionality experimental [1] precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.
EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 0:22 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 0:22 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko, linux-nvdimm
On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
[..]
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
>
> Device-DAX is not what I'm imagining when I say XFS--.
>
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
>
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> device-dax.
True, not all the way, but the distinction loses significance as you
lose fs features.
Filesystems mark DAX functionality experimental [1] precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.
EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 0:22 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 0:22 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jerome Glisse, Jan Kara, linux-nvdimm, linux-rdma, John Hubbard,
Dave Chinner, Linux Kernel Mailing List, Matthew Wilcox,
Michal Hocko, Linux MM, Doug Ledford, Christopher Lameter,
lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
[..]
> > You're describing the current situation, i.e. Linux already implements
> > this, it's called Device-DAX and some users of RDMA find it
> > insufficient. The choices are to continue to tell them "no", or say
> > "yes, but you need to submit to lease coordination".
>
> Device-DAX is not what I'm imagining when I say XFS--.
>
> I mean more like XFS with all features that require rellocation of
> blocks disabled.
>
> Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> device-dax.
True, not all the way, but the distinction loses significance as you
lose fs features.
Filesystems mark DAX functionality experimental [1] precisely because
it forbids otherwise typical operations that work in the nominal page
cache case. An approach that says "lets cement the list of things a
filesystem or a core-memory-mangement facility can't do because RDMA
finds it awkward" is bad precedent. It's bad precedent because it
abdicates core kernel functionality to userspace and weakens the api
contract in surprising ways.
EBUSY is a horrible status code especially if an administrator is
presented with an emergency situation that a filesystem needs to free
up storage capacity and get established memory registrations out of
the way. The motivation for the current status quo of failing memory
registration for DAX mappings is to help ensure the system does not
get into this situation where forward progress cannot be guaranteed.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 0:22 ` Dan Williams
` (2 preceding siblings ...)
(?)
@ 2019-02-07 5:33 ` Jason Gunthorpe
-1 siblings, 0 replies; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-07 5:33 UTC (permalink / raw)
To: Dan Williams
Cc: Doug Ledford, Dave Chinner, Christopher Lameter, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko, linux-nvdimm
On Wed, Feb 06, 2019 at 04:22:16PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 3:41 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> [..]
> > > You're describing the current situation, i.e. Linux already implements
> > > this, it's called Device-DAX and some users of RDMA find it
> > > insufficient. The choices are to continue to tell them "no", or say
> > > "yes, but you need to submit to lease coordination".
> >
> > Device-DAX is not what I'm imagining when I say XFS--.
> >
> > I mean more like XFS with all features that require rellocation of
> > blocks disabled.
> >
> > Forbidding hold punch, reflink, cow, etc, doesn't devolve back to
> > device-dax.
>
> True, not all the way, but the distinction loses significance as you
> lose fs features.
>
> Filesystems mark DAX functionality experimental [1] precisely because
> it forbids otherwise typical operations that work in the nominal page
> cache case. An approach that says "lets cement the list of things a
> filesystem or a core-memory-mangement facility can't do because RDMA
> finds it awkward" is bad precedent.
I'm not saying these rules should apply globaly.
I'm suggesting you could have a FS that supports gup_longterm by
design, and a FS that doesn't. And that is OK. They can have different
rules.
Obviously the golden case here is to use ODP (which doesn't call
gup_longterm at all) - that works for both.
Supporting non-ODP is a trade off case - users that want to run on
limited HW must accept limited functionality. Limited functionality is
better than no-funtionality.
Linux has many of these user-choose tradeoffs. This is how it supports
such a wide range of HW capabilities. Not all HW can do all
things. Some features really do need HW support. It has always been
that way.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-06 22:44 ` Dan Williams
@ 2019-02-07 1:57 ` Doug Ledford
-1 siblings, 0 replies; 155+ messages in thread
From: Doug Ledford @ 2019-02-07 1:57 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 6255 bytes --]
On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > >
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > >
> > > > > > Why is the solution revoke then? Is there something besides truncate
> > > > > > that we have to worry about? I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > >
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > >
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > >
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > >
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > >
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > >
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > >
> > > Then we must continue to ban longterm pin with DAX..
> > >
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > >
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > >
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > >
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > >
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> >
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
Thanks for the link, I'll review the panel.
> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
>
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
>
> xfs is the only Linux filesystem that supports DAX and reflink.
Is it going to be clear from the link above why reflink + DAX + RDMA is
a good/desirable thing?
> > Do they need to stick with xfs?
>
> Can you clarify the motivation for that question?
I did a little googling and research before I asked that question.
According to the documentation, other FSes can work with DAX too (namely
ext2 and ext4). The question was more or less pondering whether or not
ext2 or ext4 + RDMA + DAX would solve people's problems without the
issues that xfs brings.
> This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
>
> > Are they
> > really trying to do COW backed mappings for the RDMA targets? Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
>
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.
If that's the case, then we are back to EBUSY *could* work (despite the
objections made so far).
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 1:57 ` Doug Ledford
0 siblings, 0 replies; 155+ messages in thread
From: Doug Ledford @ 2019-02-07 1:57 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 6255 bytes --]
On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > >
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > >
> > > > > > Why is the solution revoke then? Is there something besides truncate
> > > > > > that we have to worry about? I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > >
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > >
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > >
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > >
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > >
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > >
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > >
> > > Then we must continue to ban longterm pin with DAX..
> > >
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > >
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > >
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > >
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > >
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> >
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
Thanks for the link, I'll review the panel.
> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
>
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
>
> xfs is the only Linux filesystem that supports DAX and reflink.
Is it going to be clear from the link above why reflink + DAX + RDMA is
a good/desirable thing?
> > Do they need to stick with xfs?
>
> Can you clarify the motivation for that question?
I did a little googling and research before I asked that question.
According to the documentation, other FSes can work with DAX too (namely
ext2 and ext4). The question was more or less pondering whether or not
ext2 or ext4 + RDMA + DAX would solve people's problems without the
issues that xfs brings.
> This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
>
> > Are they
> > really trying to do COW backed mappings for the RDMA targets? Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
>
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.
If that's the case, then we are back to EBUSY *could* work (despite the
objections made so far).
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 2:48 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 2:48 UTC (permalink / raw)
To: Doug Ledford
Cc: Jan Kara, Linux MM, linux-rdma, John Hubbard, linux-nvdimm,
Dave Chinner, Linux Kernel Mailing List, Matthew Wilcox,
Michal Hocko, Jason Gunthorpe, Jerome Glisse,
Christopher Lameter, lsf-pc
On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford@redhat.com> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>
No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4). The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.
No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.
6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional
So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.
> > This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets? Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).
I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.
The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 2:48 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 2:48 UTC (permalink / raw)
To: Doug Ledford
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko, linux-nvdimm
On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford@redhat.com> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>
No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4). The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.
No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.
6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional
So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.
> > This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets? Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).
I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.
The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 2:48 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 2:48 UTC (permalink / raw)
To: Doug Ledford
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko, linux-nvdimm
On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford@redhat.com> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>
No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4). The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.
No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.
6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional
So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.
> > This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets? Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).
I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.
The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 2:48 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 2:48 UTC (permalink / raw)
To: Doug Ledford
Cc: Jan Kara, Linux MM, linux-rdma, John Hubbard, linux-nvdimm,
Dave Chinner, Linux Kernel Mailing List, Matthew Wilcox,
Michal Hocko, Jason Gunthorpe, Jerome Glisse,
Christopher Lameter,
lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Wed, Feb 6, 2019 at 5:57 PM Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
[..]
> > > > Dave, you said the FS is responsible to arbitrate access to the
> > > > physical pages..
> > > >
> > > > Is it possible to have a filesystem for DAX that is more suited to
> > > > this environment? Ie designed to not require block reallocation (no
> > > > COW, no reflinks, different approach to ftruncate, etc)
> > >
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> Thanks for the link, I'll review the panel.
>
> > Currently the only way to get this to work is to use ODP capable
> > hardware, or Device-DAX. Device-DAX is a facility to map persistent
> > memory statically through device-file. It's great for statically
> > allocated use cases, but loses all the nice things (provisioning,
> > permissions, naming) that a filesystem gives you. This debate is what
> > to do about non-ODP capable hardware and Filesystem-DAX facility. The
> > current answer is "no RDMA for you".
> >
> > > Are DAX users demanding xfs, or is it just the
> > > filesystem of convenience?
> >
> > xfs is the only Linux filesystem that supports DAX and reflink.
>
> Is it going to be clear from the link above why reflink + DAX + RDMA is
> a good/desirable thing?
>
No, unfortunately it will only clarify the DAX + RDMA use case, but
you don't need to look very far to see that the trend for storage
management is more COW / reflink / thin-provisioning etc in more
places. Users want the flexibility to be able delay, change, and
consolidate physical storage allocation decisions, otherwise
device-dax would have solved all these problems and we would not be
having this conversation.
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question?
>
> I did a little googling and research before I asked that question.
> According to the documentation, other FSes can work with DAX too (namely
> ext2 and ext4). The question was more or less pondering whether or not
> ext2 or ext4 + RDMA + DAX would solve people's problems without the
> issues that xfs brings.
No, ext4 also supports hole punch, and the ext2 support is a toy. We
went through quite a bit of work to solve this problem for the
O_DIRECT pinned page case.
6b2bb7265f0b sched/wait: Introduce wait_var_event()
d6dc57e251a4 xfs, dax: introduce xfs_break_dax_layouts()
69eb5fa10eb2 xfs: prepare xfs_break_layouts() for another layout type
c63a8eae63d3 xfs: prepare xfs_break_layouts() to be called with
XFS_MMAPLOCK_EXCL
5fac7408d828 mm, fs, dax: handle layout changes to pinned dax mappings
b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
430657b6be89 ext4: handle layout changes to pinned DAX mappings
cdbf8897cb09 dax: dax_layout_busy_page() warn on !exceptional
So the fs is prepared to notify RDMA applications of the need to
evacuate a mapping (layout change), and the timeout to respond to that
notification can be configured by the administrator. The debate is
about what to do when the platform owner needs to get a mapping out of
the way in bounded time.
> > This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data. I don't see it as an xfs specific problem. Rather,
> > xfs is taking the lead in this space because it has already deployed
> > and demonstrated that leases work for the pnfs4 block-server case, so
> > it seems logical to attempt to extend that case for non-ODP-RDMA.
> >
> > > Are they
> > > really trying to do COW backed mappings for the RDMA targets? Or do
> > > they want a COW backed FS but are perfectly happy if the specific RDMA
> > > targets are *not* COW and are statically allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
>
> If that's the case, then we are back to EBUSY *could* work (despite the
> objections made so far).
I linked it in my response to Jason [1], but the entire reason ext2,
ext4, and xfs scream "experimental" when DAX is enabled is because DAX
makes typical flows fail that used to work in the page-cache backed
mmap case. The failure of a data space management command like
fallocate(punch_hole) is more risky than just not allowing the memory
registration to happen in the first place. Leases result in a system
that has a chance at making forward progress.
The current state of disallowing RDMA for FS-DAX is one of the "if
(dax) goto fail;" conditions that needs to be solved before filesystem
developers graduate DAX from experimental status.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-February/019884.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-06 22:44 ` Dan Williams
@ 2019-02-07 2:42 ` Doug Ledford
-1 siblings, 0 replies; 155+ messages in thread
From: Doug Ledford @ 2019-02-07 2:42 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]
On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
I watched this, and I see that Oracle is all sorts of excited that their
storage machines can scale out, and they can access the storage and it
has basically no CPU load on the storage server while performing
millions of queries. What I didn't hear in there is why DAX has to be
in the picture, or why Oracle couldn't do the same thing with a simple
memory region exported directly to the RDMA subsystem, or why reflink or
any of the other features you talk about are needed. So, while these
things may legitimately be needed, this video did not tell me about
how/why they are needed, just that RDMA is really, *really* cool for
their use case and gets them 0% CPU utilization on their storage
servers. I didn't watch the whole thing though. Do they get into that
later on? Do they get to that level of technical discussion, or is this
all higher level?
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 2:42 ` Doug Ledford
0 siblings, 0 replies; 155+ messages in thread
From: Doug Ledford @ 2019-02-07 2:42 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]
On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
I watched this, and I see that Oracle is all sorts of excited that their
storage machines can scale out, and they can access the storage and it
has basically no CPU load on the storage server while performing
millions of queries. What I didn't hear in there is why DAX has to be
in the picture, or why Oracle couldn't do the same thing with a simple
memory region exported directly to the RDMA subsystem, or why reflink or
any of the other features you talk about are needed. So, while these
things may legitimately be needed, this video did not tell me about
how/why they are needed, just that RDMA is really, *really* cool for
their use case and gets them 0% CPU utilization on their storage
servers. I didn't watch the whole thing though. Do they get into that
later on? Do they get to that level of technical discussion, or is this
all higher level?
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 2:42 ` Doug Ledford
@ 2019-02-07 3:13 ` Dan Williams
-1 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 3:13 UTC (permalink / raw)
To: Doug Ledford
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 6, 2019 at 6:42 PM Doug Ledford <dledford@redhat.com> wrote:
>
> On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> > On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> I watched this, and I see that Oracle is all sorts of excited that their
> storage machines can scale out, and they can access the storage and it
> has basically no CPU load on the storage server while performing
> millions of queries. What I didn't hear in there is why DAX has to be
> in the picture, or why Oracle couldn't do the same thing with a simple
> memory region exported directly to the RDMA subsystem, or why reflink or
> any of the other features you talk about are needed. So, while these
> things may legitimately be needed, this video did not tell me about
> how/why they are needed, just that RDMA is really, *really* cool for
> their use case and gets them 0% CPU utilization on their storage
> servers. I didn't watch the whole thing though. Do they get into that
> later on? Do they get to that level of technical discussion, or is this
> all higher level?
They don't. The point of sharing that video was illustrating that RDMA
to persistent memory use case. That 0% cpu utilization is because the
RDMA target is not page-cache / anonymous on the storage box it's
directly to a file offset in DAX / persistent memory. A solution to
truncate lets that use case use more than just Device-DAX or ODP
capable adapters. That said, I need to let Ira jump in here because
saying layout leases solves the problem is not true, it's just the
start of potentially solving the problem. It's not clear to me what
the long tail of work looks like once the filesystem raises a
notification to the RDMA target process.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 3:13 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-07 3:13 UTC (permalink / raw)
To: Doug Ledford
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 6, 2019 at 6:42 PM Doug Ledford <dledford@redhat.com> wrote:
>
> On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> > On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > > Can someone give me a real world scenario that someone is *actually*
> > > asking for with this?
> >
> > I'll point to this example. At the 6:35 mark Kodi talks about the
> > Oracle use case for DAX + RDMA.
> >
> > https://youtu.be/ywKPPIE8JfQ?t=395
>
> I watched this, and I see that Oracle is all sorts of excited that their
> storage machines can scale out, and they can access the storage and it
> has basically no CPU load on the storage server while performing
> millions of queries. What I didn't hear in there is why DAX has to be
> in the picture, or why Oracle couldn't do the same thing with a simple
> memory region exported directly to the RDMA subsystem, or why reflink or
> any of the other features you talk about are needed. So, while these
> things may legitimately be needed, this video did not tell me about
> how/why they are needed, just that RDMA is really, *really* cool for
> their use case and gets them 0% CPU utilization on their storage
> servers. I didn't watch the whole thing though. Do they get into that
> later on? Do they get to that level of technical discussion, or is this
> all higher level?
They don't. The point of sharing that video was illustrating that RDMA
to persistent memory use case. That 0% cpu utilization is because the
RDMA target is not page-cache / anonymous on the storage box it's
directly to a file offset in DAX / persistent memory. A solution to
truncate lets that use case use more than just Device-DAX or ODP
capable adapters. That said, I need to let Ira jump in here because
saying layout leases solves the problem is not true, it's just the
start of potentially solving the problem. It's not clear to me what
the long tail of work looks like once the filesystem raises a
notification to the RDMA target process.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 3:13 ` Dan Williams
(?)
@ 2019-02-07 17:23 ` Ira Weiny
-1 siblings, 0 replies; 155+ messages in thread
From: Ira Weiny @ 2019-02-07 17:23 UTC (permalink / raw)
To: Dan Williams
Cc: Doug Ledford, Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Wed, Feb 06, 2019 at 07:13:16PM -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 6:42 PM Doug Ledford <dledford@redhat.com> wrote:
> >
> > On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> > > On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > > > Can someone give me a real world scenario that someone is *actually*
> > > > asking for with this?
> > >
> > > I'll point to this example. At the 6:35 mark Kodi talks about the
> > > Oracle use case for DAX + RDMA.
> > >
> > > https://youtu.be/ywKPPIE8JfQ?t=395
> >
> > I watched this, and I see that Oracle is all sorts of excited that their
> > storage machines can scale out, and they can access the storage and it
> > has basically no CPU load on the storage server while performing
> > millions of queries. What I didn't hear in there is why DAX has to be
> > in the picture, or why Oracle couldn't do the same thing with a simple
> > memory region exported directly to the RDMA subsystem, or why reflink or
> > any of the other features you talk about are needed. So, while these
> > things may legitimately be needed, this video did not tell me about
> > how/why they are needed, just that RDMA is really, *really* cool for
> > their use case and gets them 0% CPU utilization on their storage
> > servers. I didn't watch the whole thing though. Do they get into that
> > later on? Do they get to that level of technical discussion, or is this
> > all higher level?
>
> They don't. The point of sharing that video was illustrating that RDMA
> to persistent memory use case. That 0% cpu utilization is because the
> RDMA target is not page-cache / anonymous on the storage box it's
> directly to a file offset in DAX / persistent memory. A solution to
> truncate lets that use case use more than just Device-DAX or ODP
> capable adapters. That said, I need to let Ira jump in here because
> saying layout leases solves the problem is not true, it's just the
> start of potentially solving the problem. It's not clear to me what
> the long tail of work looks like once the filesystem raises a
> notification to the RDMA target process.
This is exactly the problem which has been touched on by others throughout this
thread.
1) To fully support leases on all hardware we will have to allow for RMDA
processes to be killed when they don't respond to the lease
a) If the process has done something bad (like truncate or hole punch) then
the idea that "they get what they deserve" may be ok.
b) However, if this is because of some underlying file system maintenance
this is as Jason says unreasonable. It would be much better to tell the
application "you can't do this"
2) To fully respond to a lease revocation involves a number of kernel changes
in the RDMA stack but more importantly modifying every user space RDMA
application to respond to a message from a channel they may not even be
listening to.
I think this is where Jason is getting very concerned. When you
combine 1b and 2 you end up with a "non production" worthy solution.
NOTE: This is somewhat true of ODP hardware as well since applications register
each individual RDMA memory region as either ODP or not. So out of the box not
all application would work automatically.
Ira
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-06 22:44 ` Dan Williams
@ 2019-02-07 16:25 ` Doug Ledford
-1 siblings, 0 replies; 155+ messages in thread
From: Doug Ledford @ 2019-02-07 16:25 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 10169 bytes --]
I think I've finally wrapped my head around all of this. Let's see if I
have this right:
* People are using filesystem DAX to expose byte addressable persistent
memory because putting a filesystem on the memory makes an easy way to
organize the data in the memory and share it between various processes.
It's worth noting that this is not the only way to share this memory,
and arguably not even the best way, but it's what people are doing.
However, to get byte level addressability on the remote side, we must
create files on the server side, mmap those files, then give a handle to
the memory region to the client side that the client then addresses on a
byte by byte basis. This is because all of the normal kernel based
device sharing mechanisms are block based and don't provide byte level
addressability.
* People are asking for thin allocations, reflinks, deduplication,
whatever else because persistent memory is still small in terms of size
compared to the amount of data people want to put on it, so these
techniques stretch its usefulness.
* Because there is no kernel level mechanism for sharing byte
addressable memory, this only works with specific applications that have
been written to create files on byte addressable memory, mmap them, then
share them out via RDMA. I bring this up because in the video linked in
this email, Oracle is gushing about how great this feature is. But it's
important to understand that this only works because the Oracle
processes themselves are the filesystem sharing entity. That means at
other points in this conversation where we've talked about the need for
forward progress, and non-ODP hardware, and the talk has come down to
sending SIGKILL to a process in order to free memory reservations, I
feel confident in saying that Oracle would *never* agree to this. If
you kill an Oracle process to make forward progress, you are probably
also killing the very process that needed you to make progress in the
first place. I'm pretty confident that Oracle will have no problem
what-so-ever saying that ODP capable hardware is a hard requirement for
using their software with DAX.
* So if Oracle is likely to demand ODP hardware, period, are there other
scenarios that might be more accepting of a more limited FS on top of
DAX that doesn't support reflinks and deduplication? I can think of a
possible yes to that answer rather easily. Message brokerage servers
(amqp, qpid) have strict requirements about receiving a message and then
making sure that it makes it once, and only once, to all subscribed
receivers. A natural way of organizing this sort of thing is to create
a persistent ring buffer for incoming messages, one per each connecting
client that is sending messages. Then a log file for each client you
are sending messages back out to. Putting these files on persistent
memory and then mapping the ring buffer to the clients, and writing your
own transmission journals to the persistent memory, would allow the
program to be very robust in the face of a program or system crash.
This sort of usage would not require any thin allocations, reflinks, or
other such features, and yet would still find the filesystem
organization useful. Therefore I think the answer is yes, there are at
least some use cases that would find a less featureful filesystem that
works with persistent memory and RDMA but without ODP to be of value.
* Really though, as I said in my email to Tom Talpey, this entire
situation is simply screaming that we are doing DAX networking wrong.
We shouldn't be writing the networking code once in every single
application that wants to do this. If we had a memory segment that we
shared from server to client(s), and in that memory segment we
implemented a clustered filesystem, then applications would simply mmap
local files and be done with it. If the file needed to move, the kernel
would update the mmap in the application, done. If you ask me, it is
the attempt to do this the wrong way that is resulting in all this
heartache. That said, for today, my recommendation would be to require
ODP hardware for XFS filesystem with the DAX option, but allow ext2
filesystems to mount DAX filesystems on non-ODP hardware, and go in and
modify the ext2 filesystem so that on DAX mounts, it disables hole punch
and ftrunctate any time they would result in the forced removal of an
established mmap.
On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > >
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > >
> > > > > > Why is the solution revoke then? Is there something besides truncate
> > > > > > that we have to worry about? I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > >
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > >
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > >
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > >
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > >
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > >
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > >
> > > Then we must continue to ban longterm pin with DAX..
> > >
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > >
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > >
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > >
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > >
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> >
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
>
> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
>
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
>
> xfs is the only Linux filesystem that supports DAX and reflink.
>
> > Do they need to stick with xfs?
>
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
>
> > Are they
> > really trying to do COW backed mappings for the RDMA targets? Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
>
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 16:25 ` Doug Ledford
0 siblings, 0 replies; 155+ messages in thread
From: Doug Ledford @ 2019-02-07 16:25 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
[-- Attachment #1: Type: text/plain, Size: 10169 bytes --]
I think I've finally wrapped my head around all of this. Let's see if I
have this right:
* People are using filesystem DAX to expose byte addressable persistent
memory because putting a filesystem on the memory makes an easy way to
organize the data in the memory and share it between various processes.
It's worth noting that this is not the only way to share this memory,
and arguably not even the best way, but it's what people are doing.
However, to get byte level addressability on the remote side, we must
create files on the server side, mmap those files, then give a handle to
the memory region to the client side that the client then addresses on a
byte by byte basis. This is because all of the normal kernel based
device sharing mechanisms are block based and don't provide byte level
addressability.
* People are asking for thin allocations, reflinks, deduplication,
whatever else because persistent memory is still small in terms of size
compared to the amount of data people want to put on it, so these
techniques stretch its usefulness.
* Because there is no kernel level mechanism for sharing byte
addressable memory, this only works with specific applications that have
been written to create files on byte addressable memory, mmap them, then
share them out via RDMA. I bring this up because in the video linked in
this email, Oracle is gushing about how great this feature is. But it's
important to understand that this only works because the Oracle
processes themselves are the filesystem sharing entity. That means at
other points in this conversation where we've talked about the need for
forward progress, and non-ODP hardware, and the talk has come down to
sending SIGKILL to a process in order to free memory reservations, I
feel confident in saying that Oracle would *never* agree to this. If
you kill an Oracle process to make forward progress, you are probably
also killing the very process that needed you to make progress in the
first place. I'm pretty confident that Oracle will have no problem
what-so-ever saying that ODP capable hardware is a hard requirement for
using their software with DAX.
* So if Oracle is likely to demand ODP hardware, period, are there other
scenarios that might be more accepting of a more limited FS on top of
DAX that doesn't support reflinks and deduplication? I can think of a
possible yes to that answer rather easily. Message brokerage servers
(amqp, qpid) have strict requirements about receiving a message and then
making sure that it makes it once, and only once, to all subscribed
receivers. A natural way of organizing this sort of thing is to create
a persistent ring buffer for incoming messages, one per each connecting
client that is sending messages. Then a log file for each client you
are sending messages back out to. Putting these files on persistent
memory and then mapping the ring buffer to the clients, and writing your
own transmission journals to the persistent memory, would allow the
program to be very robust in the face of a program or system crash.
This sort of usage would not require any thin allocations, reflinks, or
other such features, and yet would still find the filesystem
organization useful. Therefore I think the answer is yes, there are at
least some use cases that would find a less featureful filesystem that
works with persistent memory and RDMA but without ODP to be of value.
* Really though, as I said in my email to Tom Talpey, this entire
situation is simply screaming that we are doing DAX networking wrong.
We shouldn't be writing the networking code once in every single
application that wants to do this. If we had a memory segment that we
shared from server to client(s), and in that memory segment we
implemented a clustered filesystem, then applications would simply mmap
local files and be done with it. If the file needed to move, the kernel
would update the mmap in the application, done. If you ask me, it is
the attempt to do this the wrong way that is resulting in all this
heartache. That said, for today, my recommendation would be to require
ODP hardware for XFS filesystem with the DAX option, but allow ext2
filesystems to mount DAX filesystems on non-ODP hardware, and go in and
modify the ext2 filesystem so that on DAX mounts, it disables hole punch
and ftrunctate any time they would result in the forced removal of an
established mmap.
On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@redhat.com> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > >
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > >
> > > > > > Why is the solution revoke then? Is there something besides truncate
> > > > > > that we have to worry about? I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > >
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > >
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > >
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > >
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > >
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > >
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > >
> > > Then we must continue to ban longterm pin with DAX..
> > >
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > >
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > >
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > >
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > >
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> >
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
>
> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
>
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
>
> xfs is the only Linux filesystem that supports DAX and reflink.
>
> > Do they need to stick with xfs?
>
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
>
> > Are they
> > really trying to do COW backed mappings for the RDMA targets? Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
>
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 16:25 ` Doug Ledford
@ 2019-02-07 16:55 ` Christopher Lameter
-1 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-07 16:55 UTC (permalink / raw)
To: Doug Ledford
Cc: Dan Williams, Jason Gunthorpe, Dave Chinner, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
One approach that may be a clean way to solve this:
1. Long term GUP usage requires the virtual mapping to the pages be fixed
for the duration of the GUP Map. There never has been a way to break
the pinnning and thus this needs to be preserved.
2. Page Cache Long term pins are not allowed since regular filesystems
depend on COW and other tricks which are incompatible with a long term
pin.
3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
provide the virtual mapping when the PIN is done and DO NO OPERATIONS
on the longterm pinned range until the long term pin is removed.
Hardware may do its job (like for persistent memory) but no data
consistency on the NVDIMM medium is guaranteed until the long term pin
is removed and the filesystems regains control over the area.
4. Long term pin means that the mapped sections are an actively used part
of the file (like a filesystem write) and it cannot be truncated for
the duration of the pin. It can be thought of as if the truncate is
immediate followed by a write extending the file again. The mapping
by RDMA implies after all that remote writes can occur at anytime
within the area pinned long term.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 16:55 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-07 16:55 UTC (permalink / raw)
To: Doug Ledford
Cc: Dan Williams, Jason Gunthorpe, Dave Chinner, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
One approach that may be a clean way to solve this:
1. Long term GUP usage requires the virtual mapping to the pages be fixed
for the duration of the GUP Map. There never has been a way to break
the pinnning and thus this needs to be preserved.
2. Page Cache Long term pins are not allowed since regular filesystems
depend on COW and other tricks which are incompatible with a long term
pin.
3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
provide the virtual mapping when the PIN is done and DO NO OPERATIONS
on the longterm pinned range until the long term pin is removed.
Hardware may do its job (like for persistent memory) but no data
consistency on the NVDIMM medium is guaranteed until the long term pin
is removed and the filesystems regains control over the area.
4. Long term pin means that the mapped sections are an actively used part
of the file (like a filesystem write) and it cannot be truncated for
the duration of the pin. It can be thought of as if the truncate is
immediate followed by a write extending the file again. The mapping
by RDMA implies after all that remote writes can occur at anytime
within the area pinned long term.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 16:55 ` Christopher Lameter
(?)
@ 2019-02-07 17:35 ` Ira Weiny
2019-02-07 18:17 ` Christopher Lameter
-1 siblings, 1 reply; 155+ messages in thread
From: Ira Weiny @ 2019-02-07 17:35 UTC (permalink / raw)
To: Christopher Lameter
Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Dave Chinner,
Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> One approach that may be a clean way to solve this:
>
> 1. Long term GUP usage requires the virtual mapping to the pages be fixed
> for the duration of the GUP Map. There never has been a way to break
> the pinnning and thus this needs to be preserved.
How does this fit in with the changes John is making?
>
> 2. Page Cache Long term pins are not allowed since regular filesystems
> depend on COW and other tricks which are incompatible with a long term
> pin.
Unless the hardware supports ODP or equivalent functionality. Right?
>
> 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> on the longterm pinned range until the long term pin is removed.
> Hardware may do its job (like for persistent memory) but no data
> consistency on the NVDIMM medium is guaranteed until the long term pin
> is removed and the filesystems regains control over the area.
I believe Dan attempted something like this and it became pretty difficult.
>
> 4. Long term pin means that the mapped sections are an actively used part
> of the file (like a filesystem write) and it cannot be truncated for
> the duration of the pin. It can be thought of as if the truncate is
> immediate followed by a write extending the file again. The mapping
> by RDMA implies after all that remote writes can occur at anytime
> within the area pinned long term.
>
This is a very interesting idea. I've never quite thought of it that way.
That would be essentially like failing the truncate but without actually
failing it... sneaky. ;-)
What if user space then writes to the end of the file? Does that write end
up at the point they truncated to or off the end of the mmaped area (old
length)?
I can see the behavior being defined either way. But one interferes with the
RDMA data and the other does not. Not sure which is easier for the FS to
handle either.
Ira
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 17:35 ` Ira Weiny
@ 2019-02-07 18:17 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-07 18:17 UTC (permalink / raw)
To: Ira Weiny
Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Dave Chinner,
Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Thu, 7 Feb 2019, Ira Weiny wrote:
> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> >
> > 1. Long term GUP usage requires the virtual mapping to the pages be fixed
> > for the duration of the GUP Map. There never has been a way to break
> > the pinnning and thus this needs to be preserved.
>
> How does this fit in with the changes John is making?
>
> >
> > 2. Page Cache Long term pins are not allowed since regular filesystems
> > depend on COW and other tricks which are incompatible with a long term
> > pin.
>
> Unless the hardware supports ODP or equivalent functionality. Right?
Ok we could make an exception there. But that is not required as a first
step and only some hardware would support it.
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > on the longterm pinned range until the long term pin is removed.
> > Hardware may do its job (like for persistent memory) but no data
> > consistency on the NVDIMM medium is guaranteed until the long term pin
> > is removed and the filesystems regains control over the area.
>
> I believe Dan attempted something like this and it became pretty difficult.
What is difficult about leaving things alone that are pinned? We already
have to do that currently because the refcount is elevated.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-07 18:17 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-07 18:17 UTC (permalink / raw)
To: Ira Weiny
Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Dave Chinner,
Matthew Wilcox, Jan Kara, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Thu, 7 Feb 2019, Ira Weiny wrote:
> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> >
> > 1. Long term GUP usage requires the virtual mapping to the pages be fixed
> > for the duration of the GUP Map. There never has been a way to break
> > the pinnning and thus this needs to be preserved.
>
> How does this fit in with the changes John is making?
>
> >
> > 2. Page Cache Long term pins are not allowed since regular filesystems
> > depend on COW and other tricks which are incompatible with a long term
> > pin.
>
> Unless the hardware supports ODP or equivalent functionality. Right?
Ok we could make an exception there. But that is not required as a first
step and only some hardware would support it.
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > on the longterm pinned range until the long term pin is removed.
> > Hardware may do its job (like for persistent memory) but no data
> > consistency on the NVDIMM medium is guaranteed until the long term pin
> > is removed and the filesystems regains control over the area.
>
> I believe Dan attempted something like this and it became pretty difficult.
What is difficult about leaving things alone that are pinned? We already
have to do that currently because the refcount is elevated.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 16:55 ` Christopher Lameter
(?)
(?)
@ 2019-02-08 4:43 ` Dave Chinner
2019-02-08 11:10 ` Jan Kara
2019-02-08 15:33 ` Christopher Lameter
-1 siblings, 2 replies; 155+ messages in thread
From: Dave Chinner @ 2019-02-08 4:43 UTC (permalink / raw)
To: Christopher Lameter
Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> One approach that may be a clean way to solve this:
> 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> on the longterm pinned range until the long term pin is removed.
So, ummm, how do we do block allocation then, which is done on
demand during writes?
IOWs, this requires the application to set up the file in the
correct state for the filesystem to lock it down so somebody else
can write to it. That means the file can't be sparse, it can't be
preallocated (i.e. can't contain unwritten extents), it must have zeroes
written to it's full size before being shared because otherwise it
exposes stale data to the remote client (secure sites are going to
love that!), they can't be extended, etc.
IOWs, once the file is prepped and leased out for RDMA, it becomes
an immutable for the purposes of local access.
Which, essentially we can already do. Prep the file, map it
read/write, mark it immutable, then pin it via the longterm gup
interface which can do the necessary checks.
Simple to implement, the reasons for errors trying to modify the
file are already documented and queriable, and it's hard for
applications to get wrong.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-08 4:43 ` Dave Chinner
@ 2019-02-08 11:10 ` Jan Kara
2019-02-08 20:50 ` Dan Williams
2019-02-08 21:20 ` Dave Chinner
2019-02-08 15:33 ` Christopher Lameter
1 sibling, 2 replies; 155+ messages in thread
From: Jan Kara @ 2019-02-08 11:10 UTC (permalink / raw)
To: Dave Chinner
Cc: Christopher Lameter, Doug Ledford, Dan Williams, Jason Gunthorpe,
Matthew Wilcox, Jan Kara, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > on the longterm pinned range until the long term pin is removed.
>
> So, ummm, how do we do block allocation then, which is done on
> demand during writes?
>
> IOWs, this requires the application to set up the file in the
> correct state for the filesystem to lock it down so somebody else
> can write to it. That means the file can't be sparse, it can't be
> preallocated (i.e. can't contain unwritten extents), it must have zeroes
> written to it's full size before being shared because otherwise it
> exposes stale data to the remote client (secure sites are going to
> love that!), they can't be extended, etc.
>
> IOWs, once the file is prepped and leased out for RDMA, it becomes
> an immutable for the purposes of local access.
>
> Which, essentially we can already do. Prep the file, map it
> read/write, mark it immutable, then pin it via the longterm gup
> interface which can do the necessary checks.
Hum, and what will you do if the immutable file that is target for RDMA
will be a source of reflink? That seems to be currently allowed for
immutable files but RDMA store would be effectively corrupting the data of
the target inode. But we could treat it similarly as swapfiles - those also
have to deal with writes to blocks beyond filesystem control. In fact the
similarity seems to be quite large there. What do you think?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-08 11:10 ` Jan Kara
@ 2019-02-08 20:50 ` Dan Williams
2019-02-08 21:20 ` Dave Chinner
1 sibling, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-08 20:50 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > One approach that may be a clean way to solve this:
> > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > on the longterm pinned range until the long term pin is removed.
> >
> > So, ummm, how do we do block allocation then, which is done on
> > demand during writes?
> >
> > IOWs, this requires the application to set up the file in the
> > correct state for the filesystem to lock it down so somebody else
> > can write to it. That means the file can't be sparse, it can't be
> > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > written to it's full size before being shared because otherwise it
> > exposes stale data to the remote client (secure sites are going to
> > love that!), they can't be extended, etc.
> >
> > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > an immutable for the purposes of local access.
> >
> > Which, essentially we can already do. Prep the file, map it
> > read/write, mark it immutable, then pin it via the longterm gup
> > interface which can do the necessary checks.
>
> Hum, and what will you do if the immutable file that is target for RDMA
> will be a source of reflink? That seems to be currently allowed for
> immutable files but RDMA store would be effectively corrupting the data of
> the target inode. But we could treat it similarly as swapfiles - those also
> have to deal with writes to blocks beyond filesystem control. In fact the
> similarity seems to be quite large there. What do you think?
This sounds so familiar...
https://lwn.net/Articles/726481/
I'm not opposed to trying again, but leases was what crawled out
smoking crater when this last proposal was nuked.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-08 20:50 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-08 20:50 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > One approach that may be a clean way to solve this:
> > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > on the longterm pinned range until the long term pin is removed.
> >
> > So, ummm, how do we do block allocation then, which is done on
> > demand during writes?
> >
> > IOWs, this requires the application to set up the file in the
> > correct state for the filesystem to lock it down so somebody else
> > can write to it. That means the file can't be sparse, it can't be
> > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > written to it's full size before being shared because otherwise it
> > exposes stale data to the remote client (secure sites are going to
> > love that!), they can't be extended, etc.
> >
> > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > an immutable for the purposes of local access.
> >
> > Which, essentially we can already do. Prep the file, map it
> > read/write, mark it immutable, then pin it via the longterm gup
> > interface which can do the necessary checks.
>
> Hum, and what will you do if the immutable file that is target for RDMA
> will be a source of reflink? That seems to be currently allowed for
> immutable files but RDMA store would be effectively corrupting the data of
> the target inode. But we could treat it similarly as swapfiles - those also
> have to deal with writes to blocks beyond filesystem control. In fact the
> similarity seems to be quite large there. What do you think?
This sounds so familiar...
https://lwn.net/Articles/726481/
I'm not opposed to trying again, but leases was what crawled out
smoking crater when this last proposal was nuked.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-08 20:50 ` Dan Williams
(?)
@ 2019-02-11 10:24 ` Jan Kara
2019-02-11 17:22 ` Dan Williams
-1 siblings, 1 reply; 155+ messages in thread
From: Jan Kara @ 2019-02-11 10:24 UTC (permalink / raw)
To: Dan Williams
Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
Jason Gunthorpe, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri 08-02-19 12:50:37, Dan Williams wrote:
> On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > One approach that may be a clean way to solve this:
> > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > on the longterm pinned range until the long term pin is removed.
> > >
> > > So, ummm, how do we do block allocation then, which is done on
> > > demand during writes?
> > >
> > > IOWs, this requires the application to set up the file in the
> > > correct state for the filesystem to lock it down so somebody else
> > > can write to it. That means the file can't be sparse, it can't be
> > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > written to it's full size before being shared because otherwise it
> > > exposes stale data to the remote client (secure sites are going to
> > > love that!), they can't be extended, etc.
> > >
> > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > an immutable for the purposes of local access.
> > >
> > > Which, essentially we can already do. Prep the file, map it
> > > read/write, mark it immutable, then pin it via the longterm gup
> > > interface which can do the necessary checks.
> >
> > Hum, and what will you do if the immutable file that is target for RDMA
> > will be a source of reflink? That seems to be currently allowed for
> > immutable files but RDMA store would be effectively corrupting the data of
> > the target inode. But we could treat it similarly as swapfiles - those also
> > have to deal with writes to blocks beyond filesystem control. In fact the
> > similarity seems to be quite large there. What do you think?
>
> This sounds so familiar...
>
> https://lwn.net/Articles/726481/
>
> I'm not opposed to trying again, but leases was what crawled out
> smoking crater when this last proposal was nuked.
Umm, don't think this is that similar to daxctl() discussion. We are not
speaking about providing any new userspace API for this. Also I think the
situation about leases has somewhat cleared up with this discussion - ODP
hardware does not need leases since it can use MMU notifiers, for non-ODP
hardware it is difficult to handle leases as such hardware has only one big
kill-everything call and using that would effectively mean lot of work on
the userspace side to resetup everything to make things useful if workable
at all.
So my proposal would be:
1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
its teardown when fs needs it.
2) Hardware not capable of tearing down pins from MMU notifiers will have
to use gup_longterm() (we may actually rename it to a more suitable name).
FS may just refuse such calls (for normal page cache backed file, it will
just return success but for DAX file it will do sanity checks whether the
file is fully allocated etc. like we currently do for swapfiles) but if
gup_longterm() returns success, it will provide the same guarantees as for
swapfiles. So the only thing that we need is some call from gup_longterm()
to a filesystem callback to tell it - this file is going to be used by a
third party as an IO buffer, don't touch it. And we can (and should)
probably refactor the handling to be shared between swapfiles and
gup_longterm().
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 10:24 ` Jan Kara
@ 2019-02-11 17:22 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 17:22 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > One approach that may be a clean way to solve this:
> > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > on the longterm pinned range until the long term pin is removed.
> > > >
> > > > So, ummm, how do we do block allocation then, which is done on
> > > > demand during writes?
> > > >
> > > > IOWs, this requires the application to set up the file in the
> > > > correct state for the filesystem to lock it down so somebody else
> > > > can write to it. That means the file can't be sparse, it can't be
> > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > written to it's full size before being shared because otherwise it
> > > > exposes stale data to the remote client (secure sites are going to
> > > > love that!), they can't be extended, etc.
> > > >
> > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > an immutable for the purposes of local access.
> > > >
> > > > Which, essentially we can already do. Prep the file, map it
> > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > interface which can do the necessary checks.
> > >
> > > Hum, and what will you do if the immutable file that is target for RDMA
> > > will be a source of reflink? That seems to be currently allowed for
> > > immutable files but RDMA store would be effectively corrupting the data of
> > > the target inode. But we could treat it similarly as swapfiles - those also
> > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > similarity seems to be quite large there. What do you think?
> >
> > This sounds so familiar...
> >
> > https://lwn.net/Articles/726481/
> >
> > I'm not opposed to trying again, but leases was what crawled out
> > smoking crater when this last proposal was nuked.
>
> Umm, don't think this is that similar to daxctl() discussion. We are not
> speaking about providing any new userspace API for this.
I thought explicit userspace API was one of the outcomes, i.e. that we
can't depend on this behavior being an implicit side effect of a page
pin?
> Also I think the
> situation about leases has somewhat cleared up with this discussion - ODP
> hardware does not need leases since it can use MMU notifiers, for non-ODP
> hardware it is difficult to handle leases as such hardware has only one big
> kill-everything call and using that would effectively mean lot of work on
> the userspace side to resetup everything to make things useful if workable
> at all.
>
> So my proposal would be:
>
> 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> its teardown when fs needs it.
>
> 2) Hardware not capable of tearing down pins from MMU notifiers will have
> to use gup_longterm() (we may actually rename it to a more suitable name).
> FS may just refuse such calls (for normal page cache backed file, it will
> just return success but for DAX file it will do sanity checks whether the
> file is fully allocated etc. like we currently do for swapfiles) but if
> gup_longterm() returns success, it will provide the same guarantees as for
> swapfiles. So the only thing that we need is some call from gup_longterm()
> to a filesystem callback to tell it - this file is going to be used by a
> third party as an IO buffer, don't touch it. And we can (and should)
> probably refactor the handling to be shared between swapfiles and
> gup_longterm().
Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
solution I thought we dax folks walked away from in the original
MAP_DIRECT discussion [1]. Here is where leases were the response to
MAP_DIRECT [2]. ...and here is where we had tame discussions about
implications of notifying memory-registrations of lease break events
[3].
I honestly don't like the idea that random subsystems can pin down
file blocks as a side effect of gup on the result of mmap. Recall that
it's not just RDMA that wants this guarantee. It seems safer to have
the file be in an explicit block-allocation-immutable-mode so that the
fallocate man page can describe this error case. Otherwise how would
you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
[1]: https://lwn.net/Articles/736333/
[2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html
[3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-11 17:22 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 17:22 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > One approach that may be a clean way to solve this:
> > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > on the longterm pinned range until the long term pin is removed.
> > > >
> > > > So, ummm, how do we do block allocation then, which is done on
> > > > demand during writes?
> > > >
> > > > IOWs, this requires the application to set up the file in the
> > > > correct state for the filesystem to lock it down so somebody else
> > > > can write to it. That means the file can't be sparse, it can't be
> > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > written to it's full size before being shared because otherwise it
> > > > exposes stale data to the remote client (secure sites are going to
> > > > love that!), they can't be extended, etc.
> > > >
> > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > an immutable for the purposes of local access.
> > > >
> > > > Which, essentially we can already do. Prep the file, map it
> > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > interface which can do the necessary checks.
> > >
> > > Hum, and what will you do if the immutable file that is target for RDMA
> > > will be a source of reflink? That seems to be currently allowed for
> > > immutable files but RDMA store would be effectively corrupting the data of
> > > the target inode. But we could treat it similarly as swapfiles - those also
> > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > similarity seems to be quite large there. What do you think?
> >
> > This sounds so familiar...
> >
> > https://lwn.net/Articles/726481/
> >
> > I'm not opposed to trying again, but leases was what crawled out
> > smoking crater when this last proposal was nuked.
>
> Umm, don't think this is that similar to daxctl() discussion. We are not
> speaking about providing any new userspace API for this.
I thought explicit userspace API was one of the outcomes, i.e. that we
can't depend on this behavior being an implicit side effect of a page
pin?
> Also I think the
> situation about leases has somewhat cleared up with this discussion - ODP
> hardware does not need leases since it can use MMU notifiers, for non-ODP
> hardware it is difficult to handle leases as such hardware has only one big
> kill-everything call and using that would effectively mean lot of work on
> the userspace side to resetup everything to make things useful if workable
> at all.
>
> So my proposal would be:
>
> 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> its teardown when fs needs it.
>
> 2) Hardware not capable of tearing down pins from MMU notifiers will have
> to use gup_longterm() (we may actually rename it to a more suitable name).
> FS may just refuse such calls (for normal page cache backed file, it will
> just return success but for DAX file it will do sanity checks whether the
> file is fully allocated etc. like we currently do for swapfiles) but if
> gup_longterm() returns success, it will provide the same guarantees as for
> swapfiles. So the only thing that we need is some call from gup_longterm()
> to a filesystem callback to tell it - this file is going to be used by a
> third party as an IO buffer, don't touch it. And we can (and should)
> probably refactor the handling to be shared between swapfiles and
> gup_longterm().
Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
solution I thought we dax folks walked away from in the original
MAP_DIRECT discussion [1]. Here is where leases were the response to
MAP_DIRECT [2]. ...and here is where we had tame discussions about
implications of notifying memory-registrations of lease break events
[3].
I honestly don't like the idea that random subsystems can pin down
file blocks as a side effect of gup on the result of mmap. Recall that
it's not just RDMA that wants this guarantee. It seems safer to have
the file be in an explicit block-allocation-immutable-mode so that the
fallocate man page can describe this error case. Otherwise how would
you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
[1]: https://lwn.net/Articles/736333/
[2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html
[3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 17:22 ` Dan Williams
(?)
@ 2019-02-11 18:06 ` Jason Gunthorpe
2019-02-11 18:15 ` Dan Williams
` (3 more replies)
-1 siblings, 4 replies; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 18:06 UTC (permalink / raw)
To: Dan Williams
Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> I honestly don't like the idea that random subsystems can pin down
> file blocks as a side effect of gup on the result of mmap. Recall that
> it's not just RDMA that wants this guarantee. It seems safer to have
> the file be in an explicit block-allocation-immutable-mode so that the
> fallocate man page can describe this error case. Otherwise how would
> you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
I rather liked CL's version of this - ftruncate/etc is simply racing
with a parallel pwrite - and it doesn't fail.
But it also doesnt' trucate/create a hole. Another thread wrote to it
right away and the 'hole' was essentially instantly reallocated. This
is an inherent, pre-existing, race in the ftrucate/etc APIs.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:06 ` Jason Gunthorpe
@ 2019-02-11 18:15 ` Dan Williams
2019-02-11 18:19 ` Ira Weiny
` (2 subsequent siblings)
3 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 18:15 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 10:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
>
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.
If options are telling the truth with a potentially unexpected error,
or lying that operation succeeded when it will be immediately undone,
I'd choose the former.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-11 18:15 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 18:15 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 10:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
>
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.
If options are telling the truth with a potentially unexpected error,
or lying that operation succeeded when it will be immediately undone,
I'd choose the former.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:06 ` Jason Gunthorpe
2019-02-11 18:15 ` Dan Williams
@ 2019-02-11 18:19 ` Ira Weiny
2019-02-11 18:26 ` Jason Gunthorpe
` (2 more replies)
2019-02-12 16:28 ` Jan Kara
2019-02-14 20:26 ` Jerome Glisse
3 siblings, 3 replies; 155+ messages in thread
From: Ira Weiny @ 2019-02-11 18:19 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
>
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.
I kind of like it as well, except Christopher did not answer my question:
What if user space then writes to the end of the file with a regular write?
Does that write end up at the point they truncated to or off the end of the
mmaped area (old length)?
To make this work I think it has to be the later. And as you say the semantic
is as if another thread wrote to the file first (but in this case the other
thread is the RDMA device).
In addition I'm not sure what the overall work is for this case?
John's patches will indicate to the FS that the page is gup pinned. But they
will not indicate longterm vs not "shorterm". A shortterm pin could be handled
as a "real truncate". So, are we back to needing a longterm "bit" in struct
page to indicate a longterm pin and allow the FS to perform this "virtual
write" after truncate?
Or is it safe to consider all gup pinned pages this way?
Ira
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:19 ` Ira Weiny
@ 2019-02-11 18:26 ` Jason Gunthorpe
2019-02-11 18:40 ` Matthew Wilcox
2019-02-11 21:08 ` Jerome Glisse
2019-02-11 21:22 ` John Hubbard
2 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 18:26 UTC (permalink / raw)
To: Ira Weiny
Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> >
> > > I honestly don't like the idea that random subsystems can pin down
> > > file blocks as a side effect of gup on the result of mmap. Recall that
> > > it's not just RDMA that wants this guarantee. It seems safer to have
> > > the file be in an explicit block-allocation-immutable-mode so that the
> > > fallocate man page can describe this error case. Otherwise how would
> > > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> >
> > I rather liked CL's version of this - ftruncate/etc is simply racing
> > with a parallel pwrite - and it doesn't fail.
> >
> > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > right away and the 'hole' was essentially instantly reallocated. This
> > is an inherent, pre-existing, race in the ftrucate/etc APIs.
>
> I kind of like it as well, except Christopher did not answer my question:
>
> What if user space then writes to the end of the file with a regular write?
> Does that write end up at the point they truncated to or off the end of the
> mmaped area (old length)?
IIRC it depends how the user does the write..
pwrite() with a given offset will write to that offset, re-extending
the file if needed
A file opened with O_APPEND and a write done with write() should
append to the new end
A normal file with a normal write should write to the FD's current
seek pointer.
I'm not sure what happens if you write via mmap/msync.
RDMA is similar to pwrite() and mmap.
> Or is it safe to consider all gup pinned pages this way?
O_DIRECT still has to work sensibly, and if you ftruncate something
that is currently being written with O_DIRECT it should behave the
same as if the CPU touched the mmap'd memory, IMHO.
The only real change here is that if there is a GUP then ftruncate/etc
races are always resolved as 'GUP user goes last' instead of randomly.
ftrunacte/etc already only work as you'd expect if the operator has
excluded writes. Otherwise blocks are instantly reallocated by another
racing thread.
I'm not sure why RDMA should be so special to earn an error code ..
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:26 ` Jason Gunthorpe
@ 2019-02-11 18:40 ` Matthew Wilcox
2019-02-11 19:58 ` Dan Williams
0 siblings, 1 reply; 155+ messages in thread
From: Matthew Wilcox @ 2019-02-11 18:40 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > What if user space then writes to the end of the file with a regular write?
> > Does that write end up at the point they truncated to or off the end of the
> > mmaped area (old length)?
>
> IIRC it depends how the user does the write..
>
> pwrite() with a given offset will write to that offset, re-extending
> the file if needed
>
> A file opened with O_APPEND and a write done with write() should
> append to the new end
>
> A normal file with a normal write should write to the FD's current
> seek pointer.
>
> I'm not sure what happens if you write via mmap/msync.
>
> RDMA is similar to pwrite() and mmap.
A pertinent point that you didn't mention is that ftruncate() does not change
the file offset. So there's no user-visible change in behaviour.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:40 ` Matthew Wilcox
@ 2019-02-11 19:58 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 19:58 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jason Gunthorpe, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > What if user space then writes to the end of the file with a regular write?
> > > Does that write end up at the point they truncated to or off the end of the
> > > mmaped area (old length)?
> >
> > IIRC it depends how the user does the write..
> >
> > pwrite() with a given offset will write to that offset, re-extending
> > the file if needed
> >
> > A file opened with O_APPEND and a write done with write() should
> > append to the new end
> >
> > A normal file with a normal write should write to the FD's current
> > seek pointer.
> >
> > I'm not sure what happens if you write via mmap/msync.
> >
> > RDMA is similar to pwrite() and mmap.
>
> A pertinent point that you didn't mention is that ftruncate() does not change
> the file offset. So there's no user-visible change in behaviour.
...but there is. The blocks you thought you freed, especially if the
system was under -ENOSPC pressure, won't actually be free after the
successful ftruncate().
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-11 19:58 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 19:58 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jason Gunthorpe, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > What if user space then writes to the end of the file with a regular write?
> > > Does that write end up at the point they truncated to or off the end of the
> > > mmaped area (old length)?
> >
> > IIRC it depends how the user does the write..
> >
> > pwrite() with a given offset will write to that offset, re-extending
> > the file if needed
> >
> > A file opened with O_APPEND and a write done with write() should
> > append to the new end
> >
> > A normal file with a normal write should write to the FD's current
> > seek pointer.
> >
> > I'm not sure what happens if you write via mmap/msync.
> >
> > RDMA is similar to pwrite() and mmap.
>
> A pertinent point that you didn't mention is that ftruncate() does not change
> the file offset. So there's no user-visible change in behaviour.
...but there is. The blocks you thought you freed, especially if the
system was under -ENOSPC pressure, won't actually be free after the
successful ftruncate().
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 19:58 ` Dan Williams
(?)
@ 2019-02-11 20:49 ` Jason Gunthorpe
2019-02-11 21:02 ` Dan Williams
-1 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 20:49 UTC (permalink / raw)
To: Dan Williams
Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > What if user space then writes to the end of the file with a regular write?
> > > > Does that write end up at the point they truncated to or off the end of the
> > > > mmaped area (old length)?
> > >
> > > IIRC it depends how the user does the write..
> > >
> > > pwrite() with a given offset will write to that offset, re-extending
> > > the file if needed
> > >
> > > A file opened with O_APPEND and a write done with write() should
> > > append to the new end
> > >
> > > A normal file with a normal write should write to the FD's current
> > > seek pointer.
> > >
> > > I'm not sure what happens if you write via mmap/msync.
> > >
> > > RDMA is similar to pwrite() and mmap.
> >
> > A pertinent point that you didn't mention is that ftruncate() does not change
> > the file offset. So there's no user-visible change in behaviour.
>
> ...but there is. The blocks you thought you freed, especially if the
> system was under -ENOSPC pressure, won't actually be free after the
> successful ftruncate().
They won't be free after something dirties the existing mmap either.
Blocks also won't be free if you unlink a file that is currently still
open.
This isn't really new behavior for a FS.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 20:49 ` Jason Gunthorpe
@ 2019-02-11 21:02 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 21:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > What if user space then writes to the end of the file with a regular write?
> > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > mmaped area (old length)?
> > > >
> > > > IIRC it depends how the user does the write..
> > > >
> > > > pwrite() with a given offset will write to that offset, re-extending
> > > > the file if needed
> > > >
> > > > A file opened with O_APPEND and a write done with write() should
> > > > append to the new end
> > > >
> > > > A normal file with a normal write should write to the FD's current
> > > > seek pointer.
> > > >
> > > > I'm not sure what happens if you write via mmap/msync.
> > > >
> > > > RDMA is similar to pwrite() and mmap.
> > >
> > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > the file offset. So there's no user-visible change in behaviour.
> >
> > ...but there is. The blocks you thought you freed, especially if the
> > system was under -ENOSPC pressure, won't actually be free after the
> > successful ftruncate().
>
> They won't be free after something dirties the existing mmap either.
>
> Blocks also won't be free if you unlink a file that is currently still
> open.
>
> This isn't really new behavior for a FS.
An mmap write after a fault due to a hole punch is free to trigger
SIGBUS if the subsequent page allocation fails. So no, I don't see
them as the same unless you're allowing for the holder of the MR to
receive a re-fault failure.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-11 21:02 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-11 21:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > What if user space then writes to the end of the file with a regular write?
> > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > mmaped area (old length)?
> > > >
> > > > IIRC it depends how the user does the write..
> > > >
> > > > pwrite() with a given offset will write to that offset, re-extending
> > > > the file if needed
> > > >
> > > > A file opened with O_APPEND and a write done with write() should
> > > > append to the new end
> > > >
> > > > A normal file with a normal write should write to the FD's current
> > > > seek pointer.
> > > >
> > > > I'm not sure what happens if you write via mmap/msync.
> > > >
> > > > RDMA is similar to pwrite() and mmap.
> > >
> > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > the file offset. So there's no user-visible change in behaviour.
> >
> > ...but there is. The blocks you thought you freed, especially if the
> > system was under -ENOSPC pressure, won't actually be free after the
> > successful ftruncate().
>
> They won't be free after something dirties the existing mmap either.
>
> Blocks also won't be free if you unlink a file that is currently still
> open.
>
> This isn't really new behavior for a FS.
An mmap write after a fault due to a hole punch is free to trigger
SIGBUS if the subsequent page allocation fails. So no, I don't see
them as the same unless you're allowing for the holder of the MR to
receive a re-fault failure.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 21:02 ` Dan Williams
(?)
@ 2019-02-11 21:09 ` Jason Gunthorpe
2019-02-12 16:34 ` Jan Kara
-1 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 21:09 UTC (permalink / raw)
To: Dan Williams
Cc: Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 01:02:37PM -0800, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > > What if user space then writes to the end of the file with a regular write?
> > > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > > mmaped area (old length)?
> > > > >
> > > > > IIRC it depends how the user does the write..
> > > > >
> > > > > pwrite() with a given offset will write to that offset, re-extending
> > > > > the file if needed
> > > > >
> > > > > A file opened with O_APPEND and a write done with write() should
> > > > > append to the new end
> > > > >
> > > > > A normal file with a normal write should write to the FD's current
> > > > > seek pointer.
> > > > >
> > > > > I'm not sure what happens if you write via mmap/msync.
> > > > >
> > > > > RDMA is similar to pwrite() and mmap.
> > > >
> > > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > > the file offset. So there's no user-visible change in behaviour.
> > >
> > > ...but there is. The blocks you thought you freed, especially if the
> > > system was under -ENOSPC pressure, won't actually be free after the
> > > successful ftruncate().
> >
> > They won't be free after something dirties the existing mmap either.
> >
> > Blocks also won't be free if you unlink a file that is currently still
> > open.
> >
> > This isn't really new behavior for a FS.
>
> An mmap write after a fault due to a hole punch is free to trigger
> SIGBUS if the subsequent page allocation fails.
Isn't that already racy? If the mmap user is fast enough can't it
prevent the page from becoming freed in the first place today?
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 21:09 ` Jason Gunthorpe
@ 2019-02-12 16:34 ` Jan Kara
2019-02-12 16:55 ` Christopher Lameter
0 siblings, 1 reply; 155+ messages in thread
From: Jan Kara @ 2019-02-12 16:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Dan Williams, Matthew Wilcox, Ira Weiny, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon 11-02-19 14:09:56, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 01:02:37PM -0800, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 12:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Mon, Feb 11, 2019 at 11:58:47AM -0800, Dan Williams wrote:
> > > > On Mon, Feb 11, 2019 at 10:40 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Feb 11, 2019 at 11:26:49AM -0700, Jason Gunthorpe wrote:
> > > > > > On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> > > > > > > What if user space then writes to the end of the file with a regular write?
> > > > > > > Does that write end up at the point they truncated to or off the end of the
> > > > > > > mmaped area (old length)?
> > > > > >
> > > > > > IIRC it depends how the user does the write..
> > > > > >
> > > > > > pwrite() with a given offset will write to that offset, re-extending
> > > > > > the file if needed
> > > > > >
> > > > > > A file opened with O_APPEND and a write done with write() should
> > > > > > append to the new end
> > > > > >
> > > > > > A normal file with a normal write should write to the FD's current
> > > > > > seek pointer.
> > > > > >
> > > > > > I'm not sure what happens if you write via mmap/msync.
> > > > > >
> > > > > > RDMA is similar to pwrite() and mmap.
> > > > >
> > > > > A pertinent point that you didn't mention is that ftruncate() does not change
> > > > > the file offset. So there's no user-visible change in behaviour.
> > > >
> > > > ...but there is. The blocks you thought you freed, especially if the
> > > > system was under -ENOSPC pressure, won't actually be free after the
> > > > successful ftruncate().
> > >
> > > They won't be free after something dirties the existing mmap either.
> > >
> > > Blocks also won't be free if you unlink a file that is currently still
> > > open.
> > >
> > > This isn't really new behavior for a FS.
> >
> > An mmap write after a fault due to a hole punch is free to trigger
> > SIGBUS if the subsequent page allocation fails.
>
> Isn't that already racy? If the mmap user is fast enough can't it
> prevent the page from becoming freed in the first place today?
No, it cannot. We block page faulting for the file (via a lock), tear down
page tables, free pages and blocks. Then we resume faults and return
SIGBUS (if the page ends up being after the new end of file in case of
truncate) or do new page fault and fresh block allocation (which can end
with SIGBUS if the filesystem cannot allocate new block to back the page).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-12 16:34 ` Jan Kara
@ 2019-02-12 16:55 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:55 UTC (permalink / raw)
To: Jan Kara
Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Ira Weiny,
Dave Chinner, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Tue, 12 Feb 2019, Jan Kara wrote:
> > Isn't that already racy? If the mmap user is fast enough can't it
> > prevent the page from becoming freed in the first place today?
>
> No, it cannot. We block page faulting for the file (via a lock), tear down
> page tables, free pages and blocks. Then we resume faults and return
> SIGBUS (if the page ends up being after the new end of file in case of
> truncate) or do new page fault and fresh block allocation (which can end
> with SIGBUS if the filesystem cannot allocate new block to back the page).
Well that is already pretty inconsistent behavior. Under what conditions
is the SIGBUS occurring without the new fault attempt?
If a new fault is attempted then we have resource constraints that could
have caused a SIGBUS independently of the truncate. So that case is not
really something special to be considered for truncation.
So the only concern left is to figure out under what conditions SIGBUS
occurs with a racing truncate (if at all) if there are sufficient
resources to complete the page fault.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-12 16:55 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:55 UTC (permalink / raw)
To: Jan Kara
Cc: Jason Gunthorpe, Dan Williams, Matthew Wilcox, Ira Weiny,
Dave Chinner, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Tue, 12 Feb 2019, Jan Kara wrote:
> > Isn't that already racy? If the mmap user is fast enough can't it
> > prevent the page from becoming freed in the first place today?
>
> No, it cannot. We block page faulting for the file (via a lock), tear down
> page tables, free pages and blocks. Then we resume faults and return
> SIGBUS (if the page ends up being after the new end of file in case of
> truncate) or do new page fault and fresh block allocation (which can end
> with SIGBUS if the filesystem cannot allocate new block to back the page).
Well that is already pretty inconsistent behavior. Under what conditions
is the SIGBUS occurring without the new fault attempt?
If a new fault is attempted then we have resource constraints that could
have caused a SIGBUS independently of the truncate. So that case is not
really something special to be considered for truncation.
So the only concern left is to figure out under what conditions SIGBUS
occurs with a racing truncate (if at all) if there are sufficient
resources to complete the page fault.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-12 16:55 ` Christopher Lameter
(?)
@ 2019-02-13 15:06 ` Jan Kara
-1 siblings, 0 replies; 155+ messages in thread
From: Jan Kara @ 2019-02-13 15:06 UTC (permalink / raw)
To: Christopher Lameter
Cc: Jan Kara, Jason Gunthorpe, Dan Williams, Matthew Wilcox,
Ira Weiny, Dave Chinner, Doug Ledford, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Tue 12-02-19 16:55:21, Christopher Lameter wrote:
> On Tue, 12 Feb 2019, Jan Kara wrote:
>
> > > Isn't that already racy? If the mmap user is fast enough can't it
> > > prevent the page from becoming freed in the first place today?
> >
> > No, it cannot. We block page faulting for the file (via a lock), tear down
> > page tables, free pages and blocks. Then we resume faults and return
> > SIGBUS (if the page ends up being after the new end of file in case of
> > truncate) or do new page fault and fresh block allocation (which can end
> > with SIGBUS if the filesystem cannot allocate new block to back the page).
>
> Well that is already pretty inconsistent behavior. Under what conditions
> is the SIGBUS occurring without the new fault attempt?
I probably didn't express myself clearly enough. I didn't say that SIGBUS
can occur without a page fault. The evaluation of whether a page would be
beyond EOF, page allocation, and block allocation happen only in response
to a page fault...
> If a new fault is attempted then we have resource constraints that could
> have caused a SIGBUS independently of the truncate. So that case is not
> really something special to be considered for truncation.
Agreed. I was just reacting to Jason's question whether an application
cannot prevent page freeing by being aggressive enough.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 21:02 ` Dan Williams
@ 2019-02-12 16:36 ` Christopher Lameter
-1 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:36 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Matthew Wilcox, Ira Weiny, Jan Kara,
Dave Chinner, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, 11 Feb 2019, Dan Williams wrote:
> An mmap write after a fault due to a hole punch is free to trigger
> SIGBUS if the subsequent page allocation fails. So no, I don't see
> them as the same unless you're allowing for the holder of the MR to
> receive a re-fault failure.
Order 0 page allocation failures are generally not possible in that path.
System will reclaim and OOM before that happens.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-12 16:36 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:36 UTC (permalink / raw)
To: Dan Williams
Cc: Jason Gunthorpe, Matthew Wilcox, Ira Weiny, Jan Kara,
Dave Chinner, Doug Ledford, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon, 11 Feb 2019, Dan Williams wrote:
> An mmap write after a fault due to a hole punch is free to trigger
> SIGBUS if the subsequent page allocation fails. So no, I don't see
> them as the same unless you're allowing for the holder of the MR to
> receive a re-fault failure.
Order 0 page allocation failures are generally not possible in that path.
System will reclaim and OOM before that happens.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-12 16:36 ` Christopher Lameter
(?)
@ 2019-02-12 16:44 ` Jan Kara
-1 siblings, 0 replies; 155+ messages in thread
From: Jan Kara @ 2019-02-12 16:44 UTC (permalink / raw)
To: Christopher Lameter
Cc: Dan Williams, Jason Gunthorpe, Matthew Wilcox, Ira Weiny,
Jan Kara, Dave Chinner, Doug Ledford, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Tue 12-02-19 16:36:36, Christopher Lameter wrote:
> On Mon, 11 Feb 2019, Dan Williams wrote:
>
> > An mmap write after a fault due to a hole punch is free to trigger
> > SIGBUS if the subsequent page allocation fails. So no, I don't see
> > them as the same unless you're allowing for the holder of the MR to
> > receive a re-fault failure.
>
> Order 0 page allocation failures are generally not possible in that path.
> System will reclaim and OOM before that happens.
But also block allocation can fail in the filesystem or you can have memcgs
set up that make the page allocation fail, can't you? So in principle Dan
is right. Page faults can and do fail...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:19 ` Ira Weiny
2019-02-11 18:26 ` Jason Gunthorpe
@ 2019-02-11 21:08 ` Jerome Glisse
2019-02-11 21:22 ` John Hubbard
2 siblings, 0 replies; 155+ messages in thread
From: Jerome Glisse @ 2019-02-11 21:08 UTC (permalink / raw)
To: Ira Weiny
Cc: Jason Gunthorpe, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
Michal Hocko
On Mon, Feb 11, 2019 at 10:19:22AM -0800, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
> >
> > > I honestly don't like the idea that random subsystems can pin down
> > > file blocks as a side effect of gup on the result of mmap. Recall that
> > > it's not just RDMA that wants this guarantee. It seems safer to have
> > > the file be in an explicit block-allocation-immutable-mode so that the
> > > fallocate man page can describe this error case. Otherwise how would
> > > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
> >
> > I rather liked CL's version of this - ftruncate/etc is simply racing
> > with a parallel pwrite - and it doesn't fail.
> >
> > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > right away and the 'hole' was essentially instantly reallocated. This
> > is an inherent, pre-existing, race in the ftrucate/etc APIs.
>
> I kind of like it as well, except Christopher did not answer my question:
>
> What if user space then writes to the end of the file with a regular write?
> Does that write end up at the point they truncated to or off the end of the
> mmaped area (old length)?
>
> To make this work I think it has to be the later. And as you say the semantic
> is as if another thread wrote to the file first (but in this case the other
> thread is the RDMA device).
>
> In addition I'm not sure what the overall work is for this case?
>
> John's patches will indicate to the FS that the page is gup pinned. But they
> will not indicate longterm vs not "shorterm". A shortterm pin could be handled
> as a "real truncate". So, are we back to needing a longterm "bit" in struct
> page to indicate a longterm pin and allow the FS to perform this "virtual
> write" after truncate?
>
> Or is it safe to consider all gup pinned pages this way?
So i have been working on several patchset to convert all user that can
abide to mmu notifier to HMM mirror which does not pin pages ie does not
take reference on the page. So all the left over GUP users would be the
long term problematic one with few exceptions: direct I/O, KVM (i
think xen too but i am less familiar with that), virtio.
For direct I/O i believe the ignore the truncate solution would work too.
For KVM and virtio i think it only does GUP on anonymous memory.
So the answer would be that it is safe to consider all pin pages as being
longterm pin.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:19 ` Ira Weiny
@ 2019-02-11 21:22 ` John Hubbard
2019-02-11 21:08 ` Jerome Glisse
2019-02-11 21:22 ` John Hubbard
2 siblings, 0 replies; 155+ messages in thread
From: John Hubbard @ 2019-02-11 21:22 UTC (permalink / raw)
To: Ira Weiny, Jason Gunthorpe
Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, Jerome Glisse, Michal Hocko
On 2/11/19 10:19 AM, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
>> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
[...]
> John's patches will indicate to the FS that the page is gup pinned. But they
> will not indicate longterm vs not "shorterm". A shortterm pin could be handled
> as a "real truncate". So, are we back to needing a longterm "bit" in struct
> page to indicate a longterm pin and allow the FS to perform this "virtual
> write" after truncate?
>
> Or is it safe to consider all gup pinned pages this way?
>
> Ira
>
I mentioned this in another thread, but I'm not great at email threading. :)
Anyway, it seems better to just drop the entire "longterm" concept from the
internal APIs, and just deal in "it's either gup-pinned *at the moment*, or
it's not". And let the filesystem respond appropriately. So for a pinned page
that hits clear_page_dirty_for_io or whatever else care about pinned pages:
-- fire mmu notifiers, revoke leases, generally do everything as if it were a
long term gup pin
-- if it's long term, then you've taken the right actions.
-- if the pin really is short term, everything works great anyway.
The only way that breaks is if longterm pins imply an irreversible action, such
as blocking and waiting in a way that you can't back out of or get interrupted
out of. And the design doesn't seem to be going in that direction, right?
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-11 21:22 ` John Hubbard
0 siblings, 0 replies; 155+ messages in thread
From: John Hubbard @ 2019-02-11 21:22 UTC (permalink / raw)
To: Ira Weiny, Jason Gunthorpe
Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, Jerome Glisse, Michal Hocko
On 2/11/19 10:19 AM, Ira Weiny wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
>> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
[...]
> John's patches will indicate to the FS that the page is gup pinned. But they
> will not indicate longterm vs not "shorterm". A shortterm pin could be handled
> as a "real truncate". So, are we back to needing a longterm "bit" in struct
> page to indicate a longterm pin and allow the FS to perform this "virtual
> write" after truncate?
>
> Or is it safe to consider all gup pinned pages this way?
>
> Ira
>
I mentioned this in another thread, but I'm not great at email threading. :)
Anyway, it seems better to just drop the entire "longterm" concept from the
internal APIs, and just deal in "it's either gup-pinned *at the moment*, or
it's not". And let the filesystem respond appropriately. So for a pinned page
that hits clear_page_dirty_for_io or whatever else care about pinned pages:
-- fire mmu notifiers, revoke leases, generally do everything as if it were a
long term gup pin
-- if it's long term, then you've taken the right actions.
-- if the pin really is short term, everything works great anyway.
The only way that breaks is if longterm pins imply an irreversible action, such
as blocking and waiting in a way that you can't back out of or get interrupted
out of. And the design doesn't seem to be going in that direction, right?
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 21:22 ` John Hubbard
(?)
@ 2019-02-11 22:12 ` Jason Gunthorpe
2019-02-11 22:33 ` John Hubbard
-1 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-11 22:12 UTC (permalink / raw)
To: John Hubbard
Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
linux-rdma, Linux MM, Linux Kernel Mailing List, Jerome Glisse,
Michal Hocko
On Mon, Feb 11, 2019 at 01:22:11PM -0800, John Hubbard wrote:
> The only way that breaks is if longterm pins imply an irreversible action, such
> as blocking and waiting in a way that you can't back out of or get interrupted
> out of. And the design doesn't seem to be going in that direction, right?
RDMA, vfio, etc will always have 'long term' pins that are
irreversible on demand. It is part of the HW capability.
I think the flag is badly named, it is really more of a
GUP_LOCK_PHYSICAL_ADDRESSES flag.
ie indicate to the FS that is should not attempt to remap physical
memory addresses backing this VMA. If the FS can't do that it must
fail.
Short term GUP doesn't need that kind of lock.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 22:12 ` Jason Gunthorpe
@ 2019-02-11 22:33 ` John Hubbard
0 siblings, 0 replies; 155+ messages in thread
From: John Hubbard @ 2019-02-11 22:33 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
linux-rdma, Linux MM, Linux Kernel Mailing List, Jerome Glisse,
Michal Hocko
On 2/11/19 2:12 PM, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 01:22:11PM -0800, John Hubbard wrote:
>
>> The only way that breaks is if longterm pins imply an irreversible action, such
>> as blocking and waiting in a way that you can't back out of or get interrupted
>> out of. And the design doesn't seem to be going in that direction, right?
>
> RDMA, vfio, etc will always have 'long term' pins that are
> irreversible on demand. It is part of the HW capability.
>
Yes, I get that about the HW. But I didn't quite phrase it accurately. What I
meant was, irreversible from the kernel code's point of view; specifically,
the filesystem while in various writeback paths.
But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
my head--if we actually go that way, and prevent the caller from setting up
a problematic gup pin in the first place, then that may make this point sort
of moot.
> I think the flag is badly named, it is really more of a
> GUP_LOCK_PHYSICAL_ADDRESSES flag.
>
> ie indicate to the FS that is should not attempt to remap physical
> memory addresses backing this VMA. If the FS can't do that it must
> fail.
>
Yes. Duration is probably less important than the fact that the page
is specially treated.
[1] https://lore.kernel.org/r/20190211102402.GF19029@quack2.suse.cz
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-11 22:33 ` John Hubbard
0 siblings, 0 replies; 155+ messages in thread
From: John Hubbard @ 2019-02-11 22:33 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, Matthew Wilcox, lsf-pc,
linux-rdma, Linux MM, Linux Kernel Mailing List, Jerome Glisse,
Michal Hocko
On 2/11/19 2:12 PM, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 01:22:11PM -0800, John Hubbard wrote:
>
>> The only way that breaks is if longterm pins imply an irreversible action, such
>> as blocking and waiting in a way that you can't back out of or get interrupted
>> out of. And the design doesn't seem to be going in that direction, right?
>
> RDMA, vfio, etc will always have 'long term' pins that are
> irreversible on demand. It is part of the HW capability.
>
Yes, I get that about the HW. But I didn't quite phrase it accurately. What I
meant was, irreversible from the kernel code's point of view; specifically,
the filesystem while in various writeback paths.
But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
my head--if we actually go that way, and prevent the caller from setting up
a problematic gup pin in the first place, then that may make this point sort
of moot.
> I think the flag is badly named, it is really more of a
> GUP_LOCK_PHYSICAL_ADDRESSES flag.
>
> ie indicate to the FS that is should not attempt to remap physical
> memory addresses backing this VMA. If the FS can't do that it must
> fail.
>
Yes. Duration is probably less important than the fact that the page
is specially treated.
[1] https://lore.kernel.org/r/20190211102402.GF19029@quack2.suse.cz
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 22:33 ` John Hubbard
@ 2019-02-12 16:39 ` Christopher Lameter
-1 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:39 UTC (permalink / raw)
To: John Hubbard
Cc: Jason Gunthorpe, Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, Jerome Glisse, Michal Hocko
On Mon, 11 Feb 2019, John Hubbard wrote:
> But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
> my head--if we actually go that way, and prevent the caller from setting up
> a problematic gup pin in the first place, then that may make this point sort
> of moot.
Ok well can be document how we think it would work somewhere? Long term
mapping a page cache page could a problem and we need to explain that
somewhere.
> > ie indicate to the FS that is should not attempt to remap physical
> > memory addresses backing this VMA. If the FS can't do that it must
> > fail.
> >
>
> Yes. Duration is probably less important than the fact that the page
> is specially treated.
Yup.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-12 16:39 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-12 16:39 UTC (permalink / raw)
To: John Hubbard
Cc: Jason Gunthorpe, Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, Jerome Glisse, Michal Hocko
On Mon, 11 Feb 2019, John Hubbard wrote:
> But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
> my head--if we actually go that way, and prevent the caller from setting up
> a problematic gup pin in the first place, then that may make this point sort
> of moot.
Ok well can be document how we think it would work somewhere? Long term
mapping a page cache page could a problem and we need to explain that
somewhere.
> > ie indicate to the FS that is should not attempt to remap physical
> > memory addresses backing this VMA. If the FS can't do that it must
> > fail.
> >
>
> Yes. Duration is probably less important than the fact that the page
> is specially treated.
Yup.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-12 16:39 ` Christopher Lameter
@ 2019-02-13 2:58 ` John Hubbard
-1 siblings, 0 replies; 155+ messages in thread
From: John Hubbard @ 2019-02-13 2:58 UTC (permalink / raw)
To: Christopher Lameter
Cc: Jason Gunthorpe, Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, Jerome Glisse, Michal Hocko
On 2/12/19 8:39 AM, Christopher Lameter wrote:
> On Mon, 11 Feb 2019, John Hubbard wrote:
>
>> But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
>> my head--if we actually go that way, and prevent the caller from setting up
>> a problematic gup pin in the first place, then that may make this point sort
>> of moot.
>
> Ok well can be document how we think it would work somewhere? Long term
> mapping a page cache page could a problem and we need to explain that
> somewhere.
>
Yes, once the dust settles, I think Documentation/vm/get_user_pages.rst is the
right place. I started to create that file, but someone observed that my initial
content was entirely backward-looking (described the original problem, instead
of describing how the new system would work). So I'll use this opportunity for
a do-over. :)
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-13 2:58 ` John Hubbard
0 siblings, 0 replies; 155+ messages in thread
From: John Hubbard @ 2019-02-13 2:58 UTC (permalink / raw)
To: Christopher Lameter
Cc: Jason Gunthorpe, Ira Weiny, Dan Williams, Jan Kara, Dave Chinner,
Doug Ledford, Matthew Wilcox, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, Jerome Glisse, Michal Hocko
On 2/12/19 8:39 AM, Christopher Lameter wrote:
> On Mon, 11 Feb 2019, John Hubbard wrote:
>
>> But anyway, Jan's proposal a bit earlier today [1] is finally sinking into
>> my head--if we actually go that way, and prevent the caller from setting up
>> a problematic gup pin in the first place, then that may make this point sort
>> of moot.
>
> Ok well can be document how we think it would work somewhere? Long term
> mapping a page cache page could a problem and we need to explain that
> somewhere.
>
Yes, once the dust settles, I think Documentation/vm/get_user_pages.rst is the
right place. I started to create that file, but someone observed that my initial
content was entirely backward-looking (described the original problem, instead
of describing how the new system would work). So I'll use this opportunity for
a do-over. :)
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:06 ` Jason Gunthorpe
2019-02-11 18:15 ` Dan Williams
2019-02-11 18:19 ` Ira Weiny
@ 2019-02-12 16:28 ` Jan Kara
2019-02-14 20:26 ` Jerome Glisse
3 siblings, 0 replies; 155+ messages in thread
From: Jan Kara @ 2019-02-12 16:28 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
Doug Ledford, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon 11-02-19 11:06:54, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
The problem is page pins are not really like pwrite(). They are more like
mmap access. And that will just SIGBUS after truncate. So from user point
of view I agree the result may not be that surprising (it would seem just
as if somebody did additional pwrite) but from filesystem point of view it
is very different and it would mean a special handling in lots of places.
So I think that locking down the file before allowing gup_longterm() looks
like a more viable alternative.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 18:06 ` Jason Gunthorpe
` (2 preceding siblings ...)
2019-02-12 16:28 ` Jan Kara
@ 2019-02-14 20:26 ` Jerome Glisse
2019-02-14 20:50 ` Matthew Wilcox
3 siblings, 1 reply; 155+ messages in thread
From: Jerome Glisse @ 2019-02-14 20:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Dan Williams, Jan Kara, Dave Chinner, Christopher Lameter,
Doug Ledford, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 09:22:58AM -0800, Dan Williams wrote:
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> I rather liked CL's version of this - ftruncate/etc is simply racing
> with a parallel pwrite - and it doesn't fail.
>
> But it also doesnt' trucate/create a hole. Another thread wrote to it
> right away and the 'hole' was essentially instantly reallocated. This
> is an inherent, pre-existing, race in the ftrucate/etc APIs.
So it is kind of a // point to this, but direct I/O do "truncate" pages
or more exactly after a write direct I/O invalidate_inode_pages2_range()
is call and it will try to unmap and remove from page cache all pages
that have been written too.
So we probably want to think about what we want to do here if a device
like RDMA has also pin those pages. Do we want to abort the invalidate ?
Which would mean that then the direct I/O write was just a pointless
exercise. Do we want to not direct I/O but instead memcpy into page
cache memory ? Then you are just ignoring the direct I/O property of
the write. Or do we want to both direct I/O to the block and also
do a memcopy to the page so that we preserve the direct I/O semantics ?
I would probably go with the last one. In any cases we will need to
update the direct I/O code to handle GUPed page cache page.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-14 20:26 ` Jerome Glisse
@ 2019-02-14 20:50 ` Matthew Wilcox
2019-02-14 21:39 ` Jerome Glisse
0 siblings, 1 reply; 155+ messages in thread
From: Matthew Wilcox @ 2019-02-14 20:50 UTC (permalink / raw)
To: Jerome Glisse
Cc: Jason Gunthorpe, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Thu, Feb 14, 2019 at 03:26:22PM -0500, Jerome Glisse wrote:
> On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > right away and the 'hole' was essentially instantly reallocated. This
> > is an inherent, pre-existing, race in the ftrucate/etc APIs.
>
> So it is kind of a // point to this, but direct I/O do "truncate" pages
> or more exactly after a write direct I/O invalidate_inode_pages2_range()
> is call and it will try to unmap and remove from page cache all pages
> that have been written too.
Hang on. Pages are tossed out of the page cache _before_ an O_DIRECT
write starts. The only way what you're describing can happen is if
there's a race between an O_DIRECT writer and an mmap. Which is either
an incredibly badly written application or someone trying an exploit.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-14 20:50 ` Matthew Wilcox
@ 2019-02-14 21:39 ` Jerome Glisse
2019-02-15 1:19 ` Dave Chinner
0 siblings, 1 reply; 155+ messages in thread
From: Jerome Glisse @ 2019-02-14 21:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jason Gunthorpe, Dan Williams, Jan Kara, Dave Chinner,
Christopher Lameter, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Thu, Feb 14, 2019 at 12:50:49PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 14, 2019 at 03:26:22PM -0500, Jerome Glisse wrote:
> > On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > > right away and the 'hole' was essentially instantly reallocated. This
> > > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> >
> > So it is kind of a // point to this, but direct I/O do "truncate" pages
> > or more exactly after a write direct I/O invalidate_inode_pages2_range()
> > is call and it will try to unmap and remove from page cache all pages
> > that have been written too.
>
> Hang on. Pages are tossed out of the page cache _before_ an O_DIRECT
> write starts. The only way what you're describing can happen is if
> there's a race between an O_DIRECT writer and an mmap. Which is either
> an incredibly badly written application or someone trying an exploit.
I believe they are tossed after O_DIRECT starts (dio_complete). But
regardless the issues is that an RDMA can have pin the page long
before the DIO in which case the page can not be toss from the page
cache and what ever is written to the block device will be discarded
once the RDMA unpin the pages. So we would end up in the code path
that spit out big error message in the kernel log.
Cheers,
Jérôme
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-14 21:39 ` Jerome Glisse
@ 2019-02-15 1:19 ` Dave Chinner
2019-02-15 15:42 ` Christopher Lameter
0 siblings, 1 reply; 155+ messages in thread
From: Dave Chinner @ 2019-02-15 1:19 UTC (permalink / raw)
To: Jerome Glisse
Cc: Matthew Wilcox, Jason Gunthorpe, Dan Williams, Jan Kara,
Christopher Lameter, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Thu, Feb 14, 2019 at 04:39:22PM -0500, Jerome Glisse wrote:
> On Thu, Feb 14, 2019 at 12:50:49PM -0800, Matthew Wilcox wrote:
> > On Thu, Feb 14, 2019 at 03:26:22PM -0500, Jerome Glisse wrote:
> > > On Mon, Feb 11, 2019 at 11:06:54AM -0700, Jason Gunthorpe wrote:
> > > > But it also doesnt' trucate/create a hole. Another thread wrote to it
> > > > right away and the 'hole' was essentially instantly reallocated. This
> > > > is an inherent, pre-existing, race in the ftrucate/etc APIs.
> > >
> > > So it is kind of a // point to this, but direct I/O do "truncate" pages
> > > or more exactly after a write direct I/O invalidate_inode_pages2_range()
> > > is call and it will try to unmap and remove from page cache all pages
> > > that have been written too.
> >
> > Hang on. Pages are tossed out of the page cache _before_ an O_DIRECT
> > write starts. The only way what you're describing can happen is if
> > there's a race between an O_DIRECT writer and an mmap. Which is either
> > an incredibly badly written application or someone trying an exploit.
>
> I believe they are tossed after O_DIRECT starts (dio_complete). But
Yes, but also before. See iomap_dio_rw() and
generic_file_direct_write().
> regardless the issues is that an RDMA can have pin the page long
> before the DIO in which case the page can not be toss from the page
> cache and what ever is written to the block device will be discarded
> once the RDMA unpin the pages. So we would end up in the code path
> that spit out big error message in the kernel log.
Which tells us filesystem people that the applications are doing
something that _will_ cause data corruption and hence not to spend
any time triaging data corruption reports because it's not a
filesystem bug that caused it.
See open(2):
Applications should avoid mixing O_DIRECT and normal I/O to
the same file, and especially to overlapping byte regions in
the same file. Even when the filesystem correctly handles
the coherency issues in this situation, overall I/O
throughput is likely to be slower than using either mode
alone. Likewise, applications should avoid mixing mmap(2)
of files with direct I/O to the same files.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 1:19 ` Dave Chinner
@ 2019-02-15 15:42 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-15 15:42 UTC (permalink / raw)
To: Dave Chinner
Cc: Jerome Glisse, Matthew Wilcox, Jason Gunthorpe, Dan Williams,
Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, 15 Feb 2019, Dave Chinner wrote:
> Which tells us filesystem people that the applications are doing
> something that _will_ cause data corruption and hence not to spend
> any time triaging data corruption reports because it's not a
> filesystem bug that caused it.
>
> See open(2):
>
> Applications should avoid mixing O_DIRECT and normal I/O to
> the same file, and especially to overlapping byte regions in
> the same file. Even when the filesystem correctly handles
> the coherency issues in this situation, overall I/O
> throughput is likely to be slower than using either mode
> alone. Likewise, applications should avoid mixing mmap(2)
> of files with direct I/O to the same files.
Since RDMA is something similar: Can we say that a file that is used for
RDMA should not use the page cache?
And can we enforce this in the future? I.e. have some file state that says
that this file is direct/RDMA or contains long term pinning and thus
allows only a certain type of operations to ensure data consistency?
If we cannot enforce it then we may want to spit out some warning?
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-15 15:42 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-15 15:42 UTC (permalink / raw)
To: Dave Chinner
Cc: Jerome Glisse, Matthew Wilcox, Jason Gunthorpe, Dan Williams,
Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, 15 Feb 2019, Dave Chinner wrote:
> Which tells us filesystem people that the applications are doing
> something that _will_ cause data corruption and hence not to spend
> any time triaging data corruption reports because it's not a
> filesystem bug that caused it.
>
> See open(2):
>
> Applications should avoid mixing O_DIRECT and normal I/O to
> the same file, and especially to overlapping byte regions in
> the same file. Even when the filesystem correctly handles
> the coherency issues in this situation, overall I/O
> throughput is likely to be slower than using either mode
> alone. Likewise, applications should avoid mixing mmap(2)
> of files with direct I/O to the same files.
Since RDMA is something similar: Can we say that a file that is used for
RDMA should not use the page cache?
And can we enforce this in the future? I.e. have some file state that says
that this file is direct/RDMA or contains long term pinning and thus
allows only a certain type of operations to ensure data consistency?
If we cannot enforce it then we may want to spit out some warning?
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 15:42 ` Christopher Lameter
(?)
@ 2019-02-15 18:08 ` Matthew Wilcox
2019-02-15 18:31 ` Christopher Lameter
-1 siblings, 1 reply; 155+ messages in thread
From: Matthew Wilcox @ 2019-02-15 18:08 UTC (permalink / raw)
To: Christopher Lameter
Cc: Dave Chinner, Jerome Glisse, Jason Gunthorpe, Dan Williams,
Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, Feb 15, 2019 at 03:42:02PM +0000, Christopher Lameter wrote:
> On Fri, 15 Feb 2019, Dave Chinner wrote:
>
> > Which tells us filesystem people that the applications are doing
> > something that _will_ cause data corruption and hence not to spend
> > any time triaging data corruption reports because it's not a
> > filesystem bug that caused it.
> >
> > See open(2):
> >
> > Applications should avoid mixing O_DIRECT and normal I/O to
> > the same file, and especially to overlapping byte regions in
> > the same file. Even when the filesystem correctly handles
> > the coherency issues in this situation, overall I/O
> > throughput is likely to be slower than using either mode
> > alone. Likewise, applications should avoid mixing mmap(2)
> > of files with direct I/O to the same files.
>
> Since RDMA is something similar: Can we say that a file that is used for
> RDMA should not use the page cache?
That makes no sense. The page cache is the standard synchronisation point
for filesystems and processes. The only problems come in for the things
which bypass the page cache like O_DIRECT and DAX.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 18:08 ` Matthew Wilcox
@ 2019-02-15 18:31 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-15 18:31 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Dave Chinner, Jerome Glisse, Jason Gunthorpe, Dan Williams,
Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> > Since RDMA is something similar: Can we say that a file that is used for
> > RDMA should not use the page cache?
>
> That makes no sense. The page cache is the standard synchronisation point
> for filesystems and processes. The only problems come in for the things
> which bypass the page cache like O_DIRECT and DAX.
It makes a lot of sense since the filesystems play COW etc games with the
pages and RDMA is very much like O_DIRECT in that the pages are modified
directly under I/O. It also bypasses the page cache in case you have
not noticed yet.
Both filesysetms and RDMA acting on a page cache at
the same time lead to the mess that we are trying to solve.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-15 18:31 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-15 18:31 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Dave Chinner, Jerome Glisse, Jason Gunthorpe, Dan Williams,
Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> > Since RDMA is something similar: Can we say that a file that is used for
> > RDMA should not use the page cache?
>
> That makes no sense. The page cache is the standard synchronisation point
> for filesystems and processes. The only problems come in for the things
> which bypass the page cache like O_DIRECT and DAX.
It makes a lot of sense since the filesystems play COW etc games with the
pages and RDMA is very much like O_DIRECT in that the pages are modified
directly under I/O. It also bypasses the page cache in case you have
not noticed yet.
Both filesysetms and RDMA acting on a page cache at
the same time lead to the mess that we are trying to solve.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 18:31 ` Christopher Lameter
(?)
@ 2019-02-15 22:00 ` Jason Gunthorpe
2019-02-15 23:38 ` Ira Weiny
-1 siblings, 1 reply; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-15 22:00 UTC (permalink / raw)
To: Christopher Lameter
Cc: Matthew Wilcox, Dave Chinner, Jerome Glisse, Dan Williams,
Jan Kara, Doug Ledford, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, Feb 15, 2019 at 06:31:36PM +0000, Christopher Lameter wrote:
> On Fri, 15 Feb 2019, Matthew Wilcox wrote:
>
> > > Since RDMA is something similar: Can we say that a file that is used for
> > > RDMA should not use the page cache?
> >
> > That makes no sense. The page cache is the standard synchronisation point
> > for filesystems and processes. The only problems come in for the things
> > which bypass the page cache like O_DIRECT and DAX.
>
> It makes a lot of sense since the filesystems play COW etc games with the
> pages and RDMA is very much like O_DIRECT in that the pages are modified
> directly under I/O. It also bypasses the page cache in case you have
> not noticed yet.
It is quite different, O_DIRECT modifies the physical blocks on the
storage, bypassing the memory copy. RDMA modifies the memory copy.
pages are necessary to do RDMA, and those pages have to be flushed to
disk.. So I'm not seeing how it can be disconnected from the page
cache?
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 22:00 ` Jason Gunthorpe
@ 2019-02-15 23:38 ` Ira Weiny
2019-02-16 22:42 ` Dave Chinner
2019-02-17 2:54 ` Christopher Lameter
0 siblings, 2 replies; 155+ messages in thread
From: Ira Weiny @ 2019-02-15 23:38 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christopher Lameter, Matthew Wilcox, Dave Chinner, Jerome Glisse,
Dan Williams, Jan Kara, Doug Ledford, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, Feb 15, 2019 at 03:00:31PM -0700, Jason Gunthorpe wrote:
> On Fri, Feb 15, 2019 at 06:31:36PM +0000, Christopher Lameter wrote:
> > On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> >
> > > > Since RDMA is something similar: Can we say that a file that is used for
> > > > RDMA should not use the page cache?
> > >
> > > That makes no sense. The page cache is the standard synchronisation point
> > > for filesystems and processes. The only problems come in for the things
> > > which bypass the page cache like O_DIRECT and DAX.
> >
> > It makes a lot of sense since the filesystems play COW etc games with the
> > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > directly under I/O. It also bypasses the page cache in case you have
> > not noticed yet.
>
> It is quite different, O_DIRECT modifies the physical blocks on the
> storage, bypassing the memory copy.
>
Really? I thought O_DIRECT allowed the block drivers to write to/from user
space buffers. But the _storage_ was still under the control of the block
drivers?
>
> RDMA modifies the memory copy.
>
> pages are necessary to do RDMA, and those pages have to be flushed to
> disk.. So I'm not seeing how it can be disconnected from the page
> cache?
I don't disagree with this.
Ira
>
> Jason
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 23:38 ` Ira Weiny
@ 2019-02-16 22:42 ` Dave Chinner
2019-02-17 2:54 ` Christopher Lameter
1 sibling, 0 replies; 155+ messages in thread
From: Dave Chinner @ 2019-02-16 22:42 UTC (permalink / raw)
To: Ira Weiny
Cc: Jason Gunthorpe, Christopher Lameter, Matthew Wilcox,
Jerome Glisse, Dan Williams, Jan Kara, Doug Ledford, lsf-pc,
linux-rdma, Linux MM, Linux Kernel Mailing List, John Hubbard,
Michal Hocko
On Fri, Feb 15, 2019 at 03:38:29PM -0800, Ira Weiny wrote:
> On Fri, Feb 15, 2019 at 03:00:31PM -0700, Jason Gunthorpe wrote:
> > On Fri, Feb 15, 2019 at 06:31:36PM +0000, Christopher Lameter wrote:
> > > On Fri, 15 Feb 2019, Matthew Wilcox wrote:
> > >
> > > > > Since RDMA is something similar: Can we say that a file that is used for
> > > > > RDMA should not use the page cache?
> > > >
> > > > That makes no sense. The page cache is the standard synchronisation point
> > > > for filesystems and processes. The only problems come in for the things
> > > > which bypass the page cache like O_DIRECT and DAX.
> > >
> > > It makes a lot of sense since the filesystems play COW etc games with the
> > > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > > directly under I/O. It also bypasses the page cache in case you have
> > > not noticed yet.
> >
> > It is quite different, O_DIRECT modifies the physical blocks on the
> > storage, bypassing the memory copy.
> >
>
> Really? I thought O_DIRECT allowed the block drivers to write to/from user
> space buffers. But the _storage_ was still under the control of the block
> drivers?
Yup, in a nutshell. Even O_DIRECT on DAX doesn't modify the physical
storage directly - it ends up in the pmem driver and it does a
memcpy() to move the data to/from the physical storage and the user
space buffer. It's exactly the same IO path as moving data to/from
the physical storage into the page cache pages....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-15 23:38 ` Ira Weiny
@ 2019-02-17 2:54 ` Christopher Lameter
2019-02-17 2:54 ` Christopher Lameter
1 sibling, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-17 2:54 UTC (permalink / raw)
To: Ira Weiny
Cc: Jason Gunthorpe, Matthew Wilcox, Dave Chinner, Jerome Glisse,
Dan Williams, Jan Kara, Doug Ledford, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, 15 Feb 2019, Ira Weiny wrote:
> > > > for filesystems and processes. The only problems come in for the things
> > > > which bypass the page cache like O_DIRECT and DAX.
> > >
> > > It makes a lot of sense since the filesystems play COW etc games with the
> > > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > > directly under I/O. It also bypasses the page cache in case you have
> > > not noticed yet.
> >
> > It is quite different, O_DIRECT modifies the physical blocks on the
> > storage, bypassing the memory copy.
> >
>
> Really? I thought O_DIRECT allowed the block drivers to write to/from user
> space buffers. But the _storage_ was still under the control of the block
> drivers?
It depends on what you see as the modification target. O_DIRECT uses
memory as a target and source like RDMA. The block device is at the other
end of the handling.
> > RDMA modifies the memory copy.
> >
> > pages are necessary to do RDMA, and those pages have to be flushed to
> > disk.. So I'm not seeing how it can be disconnected from the page
> > cache?
>
> I don't disagree with this.
RDMA does direct access to memory. If that memmory is a mmmap of a regular
block device then we have a problem (this has not been a standard use case to my
knowledge). The semantics are simmply different. RDMA expects memory to be
pinned and always to be able to read and write from it. The block
device/filesystem expects memory access to be controllable via the page
permission. In particular access to be page need to be able to be stopped.
This is fundamentally incompatible. RDMA access to such an mmapped section
must preserve the RDMA semantics while the pinning is done and can only
provide the access control after RDMA is finished. Pages in the RDMA range
cannot be handled like normal page cache pages.
This is in particular evident in the DAX case in which we have direct pass
through even to the storage medium. And in this case write through can
replace the page cache.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-17 2:54 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-17 2:54 UTC (permalink / raw)
To: Ira Weiny
Cc: Jason Gunthorpe, Matthew Wilcox, Dave Chinner, Jerome Glisse,
Dan Williams, Jan Kara, Doug Ledford, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Michal Hocko
On Fri, 15 Feb 2019, Ira Weiny wrote:
> > > > for filesystems and processes. The only problems come in for the things
> > > > which bypass the page cache like O_DIRECT and DAX.
> > >
> > > It makes a lot of sense since the filesystems play COW etc games with the
> > > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > > directly under I/O. It also bypasses the page cache in case you have
> > > not noticed yet.
> >
> > It is quite different, O_DIRECT modifies the physical blocks on the
> > storage, bypassing the memory copy.
> >
>
> Really? I thought O_DIRECT allowed the block drivers to write to/from user
> space buffers. But the _storage_ was still under the control of the block
> drivers?
It depends on what you see as the modification target. O_DIRECT uses
memory as a target and source like RDMA. The block device is at the other
end of the handling.
> > RDMA modifies the memory copy.
> >
> > pages are necessary to do RDMA, and those pages have to be flushed to
> > disk.. So I'm not seeing how it can be disconnected from the page
> > cache?
>
> I don't disagree with this.
RDMA does direct access to memory. If that memmory is a mmmap of a regular
block device then we have a problem (this has not been a standard use case to my
knowledge). The semantics are simmply different. RDMA expects memory to be
pinned and always to be able to read and write from it. The block
device/filesystem expects memory access to be controllable via the page
permission. In particular access to be page need to be able to be stopped.
This is fundamentally incompatible. RDMA access to such an mmapped section
must preserve the RDMA semantics while the pinning is done and can only
provide the access control after RDMA is finished. Pages in the RDMA range
cannot be handled like normal page cache pages.
This is in particular evident in the DAX case in which we have direct pass
through even to the storage medium. And in this case write through can
replace the page cache.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-11 17:22 ` Dan Williams
(?)
(?)
@ 2019-02-12 16:07 ` Jan Kara
2019-02-12 21:53 ` Dan Williams
-1 siblings, 1 reply; 155+ messages in thread
From: Jan Kara @ 2019-02-12 16:07 UTC (permalink / raw)
To: Dan Williams
Cc: Jan Kara, Dave Chinner, Christopher Lameter, Doug Ledford,
Jason Gunthorpe, Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma,
Linux MM, Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Mon 11-02-19 09:22:58, Dan Williams wrote:
> On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > > One approach that may be a clean way to solve this:
> > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > > on the longterm pinned range until the long term pin is removed.
> > > > >
> > > > > So, ummm, how do we do block allocation then, which is done on
> > > > > demand during writes?
> > > > >
> > > > > IOWs, this requires the application to set up the file in the
> > > > > correct state for the filesystem to lock it down so somebody else
> > > > > can write to it. That means the file can't be sparse, it can't be
> > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > > written to it's full size before being shared because otherwise it
> > > > > exposes stale data to the remote client (secure sites are going to
> > > > > love that!), they can't be extended, etc.
> > > > >
> > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > > an immutable for the purposes of local access.
> > > > >
> > > > > Which, essentially we can already do. Prep the file, map it
> > > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > > interface which can do the necessary checks.
> > > >
> > > > Hum, and what will you do if the immutable file that is target for RDMA
> > > > will be a source of reflink? That seems to be currently allowed for
> > > > immutable files but RDMA store would be effectively corrupting the data of
> > > > the target inode. But we could treat it similarly as swapfiles - those also
> > > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > > similarity seems to be quite large there. What do you think?
> > >
> > > This sounds so familiar...
> > >
> > > https://lwn.net/Articles/726481/
> > >
> > > I'm not opposed to trying again, but leases was what crawled out
> > > smoking crater when this last proposal was nuked.
> >
> > Umm, don't think this is that similar to daxctl() discussion. We are not
> > speaking about providing any new userspace API for this.
>
> I thought explicit userspace API was one of the outcomes, i.e. that we
> can't depend on this behavior being an implicit side effect of a page
> pin?
I was thinking an implicit sideeffect of gup_longterm() call. Similarly as
swapon(2) does not require the file to be marked in any special way. But
OTOH I agree that RDMA is a less controlled usage than swapon so it is
questionable. I'd still require something like CAP_LINUX_IMMUTABLE at least
for gup_longterm() calls that end up pinning the file.
Inspired by Christoph's idea you reference in [2], maybe gup_longterm()
will succeed only if there is FL_LAYOUT lease for the range being pinned
and we don't allow the lease to be released until there's a pinned page in
the range. And we make the file protected (i.e. treat it like swapfile) if
there's any such lease in it. But this is just a rough sketch and needs more
thinking.
> > Also I think the
> > situation about leases has somewhat cleared up with this discussion - ODP
> > hardware does not need leases since it can use MMU notifiers, for non-ODP
> > hardware it is difficult to handle leases as such hardware has only one big
> > kill-everything call and using that would effectively mean lot of work on
> > the userspace side to resetup everything to make things useful if workable
> > at all.
> >
> > So my proposal would be:
> >
> > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> > its teardown when fs needs it.
> >
> > 2) Hardware not capable of tearing down pins from MMU notifiers will have
> > to use gup_longterm() (we may actually rename it to a more suitable name).
> > FS may just refuse such calls (for normal page cache backed file, it will
> > just return success but for DAX file it will do sanity checks whether the
> > file is fully allocated etc. like we currently do for swapfiles) but if
> > gup_longterm() returns success, it will provide the same guarantees as for
> > swapfiles. So the only thing that we need is some call from gup_longterm()
> > to a filesystem callback to tell it - this file is going to be used by a
> > third party as an IO buffer, don't touch it. And we can (and should)
> > probably refactor the handling to be shared between swapfiles and
> > gup_longterm().
>
> Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
> solution I thought we dax folks walked away from in the original
> MAP_DIRECT discussion [1]. Here is where leases were the response to
> MAP_DIRECT [2]. ...and here is where we had tame discussions about
> implications of notifying memory-registrations of lease break events
> [3].
Yeah, thanks for the references.
> I honestly don't like the idea that random subsystems can pin down
> file blocks as a side effect of gup on the result of mmap. Recall that
> it's not just RDMA that wants this guarantee. It seems safer to have
> the file be in an explicit block-allocation-immutable-mode so that the
> fallocate man page can describe this error case. Otherwise how would
> you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
So with requiring lease for gup_longterm() to succeed (and the
FALLOC_FL_PUNCH_HOLE failure being keyed from the existence of such lease),
does it look more reasonable to you?
> [1]: https://lwn.net/Articles/736333/
> [2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html
> [3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-12 16:07 ` Jan Kara
@ 2019-02-12 21:53 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-12 21:53 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Tue, Feb 12, 2019 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 11-02-19 09:22:58, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > > > One approach that may be a clean way to solve this:
> > > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > > > on the longterm pinned range until the long term pin is removed.
> > > > > >
> > > > > > So, ummm, how do we do block allocation then, which is done on
> > > > > > demand during writes?
> > > > > >
> > > > > > IOWs, this requires the application to set up the file in the
> > > > > > correct state for the filesystem to lock it down so somebody else
> > > > > > can write to it. That means the file can't be sparse, it can't be
> > > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > > > written to it's full size before being shared because otherwise it
> > > > > > exposes stale data to the remote client (secure sites are going to
> > > > > > love that!), they can't be extended, etc.
> > > > > >
> > > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > > > an immutable for the purposes of local access.
> > > > > >
> > > > > > Which, essentially we can already do. Prep the file, map it
> > > > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > > > interface which can do the necessary checks.
> > > > >
> > > > > Hum, and what will you do if the immutable file that is target for RDMA
> > > > > will be a source of reflink? That seems to be currently allowed for
> > > > > immutable files but RDMA store would be effectively corrupting the data of
> > > > > the target inode. But we could treat it similarly as swapfiles - those also
> > > > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > > > similarity seems to be quite large there. What do you think?
> > > >
> > > > This sounds so familiar...
> > > >
> > > > https://lwn.net/Articles/726481/
> > > >
> > > > I'm not opposed to trying again, but leases was what crawled out
> > > > smoking crater when this last proposal was nuked.
> > >
> > > Umm, don't think this is that similar to daxctl() discussion. We are not
> > > speaking about providing any new userspace API for this.
> >
> > I thought explicit userspace API was one of the outcomes, i.e. that we
> > can't depend on this behavior being an implicit side effect of a page
> > pin?
>
> I was thinking an implicit sideeffect of gup_longterm() call. Similarly as
> swapon(2) does not require the file to be marked in any special way. But
> OTOH I agree that RDMA is a less controlled usage than swapon so it is
> questionable. I'd still require something like CAP_LINUX_IMMUTABLE at least
> for gup_longterm() calls that end up pinning the file.
>
> Inspired by Christoph's idea you reference in [2], maybe gup_longterm()
> will succeed only if there is FL_LAYOUT lease for the range being pinned
> and we don't allow the lease to be released until there's a pinned page in
> the range. And we make the file protected (i.e. treat it like swapfile) if
> there's any such lease in it. But this is just a rough sketch and needs more
> thinking.
>
> > > Also I think the
> > > situation about leases has somewhat cleared up with this discussion - ODP
> > > hardware does not need leases since it can use MMU notifiers, for non-ODP
> > > hardware it is difficult to handle leases as such hardware has only one big
> > > kill-everything call and using that would effectively mean lot of work on
> > > the userspace side to resetup everything to make things useful if workable
> > > at all.
> > >
> > > So my proposal would be:
> > >
> > > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> > > its teardown when fs needs it.
> > >
> > > 2) Hardware not capable of tearing down pins from MMU notifiers will have
> > > to use gup_longterm() (we may actually rename it to a more suitable name).
> > > FS may just refuse such calls (for normal page cache backed file, it will
> > > just return success but for DAX file it will do sanity checks whether the
> > > file is fully allocated etc. like we currently do for swapfiles) but if
> > > gup_longterm() returns success, it will provide the same guarantees as for
> > > swapfiles. So the only thing that we need is some call from gup_longterm()
> > > to a filesystem callback to tell it - this file is going to be used by a
> > > third party as an IO buffer, don't touch it. And we can (and should)
> > > probably refactor the handling to be shared between swapfiles and
> > > gup_longterm().
> >
> > Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
> > solution I thought we dax folks walked away from in the original
> > MAP_DIRECT discussion [1]. Here is where leases were the response to
> > MAP_DIRECT [2]. ...and here is where we had tame discussions about
> > implications of notifying memory-registrations of lease break events
> > [3].
>
> Yeah, thanks for the references.
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> So with requiring lease for gup_longterm() to succeed (and the
> FALLOC_FL_PUNCH_HOLE failure being keyed from the existence of such lease),
> does it look more reasonable to you?
That sounds reasonable to me, just the small matter of teaching the
non-ODP RDMA ecosystem to take out FL_LAYOUT leases and do something
reasonable when the lease needs to be recalled.
I would hope that RDMA-to-FSDAX-PMEM support is enough motivation to
either make the necessary application changes, or switch to an
ODP-capable adapter.
Note that I think we need FL_LAYOUT regardless of whether the
legacy-RDMA stack ever takes advantage of it. VFIO device passthrough
to a guest that has a host DAX file mapped as physical PMEM in the
guest needs guarantees that the guest will be killed and DMA force
blocked by the IOMMU if someone punches a hole in memory in use by a
guest, or otherwise have a paravirtualized driver in the guest to
coordinate what effectively looks like a physical memory unplug event.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-12 21:53 ` Dan Williams
0 siblings, 0 replies; 155+ messages in thread
From: Dan Williams @ 2019-02-12 21:53 UTC (permalink / raw)
To: Jan Kara
Cc: Dave Chinner, Christopher Lameter, Doug Ledford, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Tue, Feb 12, 2019 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 11-02-19 09:22:58, Dan Williams wrote:
> > On Mon, Feb 11, 2019 at 2:24 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-02-19 12:50:37, Dan Williams wrote:
> > > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > > > > > One approach that may be a clean way to solve this:
> > > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > > > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > > > > > on the longterm pinned range until the long term pin is removed.
> > > > > >
> > > > > > So, ummm, how do we do block allocation then, which is done on
> > > > > > demand during writes?
> > > > > >
> > > > > > IOWs, this requires the application to set up the file in the
> > > > > > correct state for the filesystem to lock it down so somebody else
> > > > > > can write to it. That means the file can't be sparse, it can't be
> > > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > > > > > written to it's full size before being shared because otherwise it
> > > > > > exposes stale data to the remote client (secure sites are going to
> > > > > > love that!), they can't be extended, etc.
> > > > > >
> > > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > > > > > an immutable for the purposes of local access.
> > > > > >
> > > > > > Which, essentially we can already do. Prep the file, map it
> > > > > > read/write, mark it immutable, then pin it via the longterm gup
> > > > > > interface which can do the necessary checks.
> > > > >
> > > > > Hum, and what will you do if the immutable file that is target for RDMA
> > > > > will be a source of reflink? That seems to be currently allowed for
> > > > > immutable files but RDMA store would be effectively corrupting the data of
> > > > > the target inode. But we could treat it similarly as swapfiles - those also
> > > > > have to deal with writes to blocks beyond filesystem control. In fact the
> > > > > similarity seems to be quite large there. What do you think?
> > > >
> > > > This sounds so familiar...
> > > >
> > > > https://lwn.net/Articles/726481/
> > > >
> > > > I'm not opposed to trying again, but leases was what crawled out
> > > > smoking crater when this last proposal was nuked.
> > >
> > > Umm, don't think this is that similar to daxctl() discussion. We are not
> > > speaking about providing any new userspace API for this.
> >
> > I thought explicit userspace API was one of the outcomes, i.e. that we
> > can't depend on this behavior being an implicit side effect of a page
> > pin?
>
> I was thinking an implicit sideeffect of gup_longterm() call. Similarly as
> swapon(2) does not require the file to be marked in any special way. But
> OTOH I agree that RDMA is a less controlled usage than swapon so it is
> questionable. I'd still require something like CAP_LINUX_IMMUTABLE at least
> for gup_longterm() calls that end up pinning the file.
>
> Inspired by Christoph's idea you reference in [2], maybe gup_longterm()
> will succeed only if there is FL_LAYOUT lease for the range being pinned
> and we don't allow the lease to be released until there's a pinned page in
> the range. And we make the file protected (i.e. treat it like swapfile) if
> there's any such lease in it. But this is just a rough sketch and needs more
> thinking.
>
> > > Also I think the
> > > situation about leases has somewhat cleared up with this discussion - ODP
> > > hardware does not need leases since it can use MMU notifiers, for non-ODP
> > > hardware it is difficult to handle leases as such hardware has only one big
> > > kill-everything call and using that would effectively mean lot of work on
> > > the userspace side to resetup everything to make things useful if workable
> > > at all.
> > >
> > > So my proposal would be:
> > >
> > > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do
> > > its teardown when fs needs it.
> > >
> > > 2) Hardware not capable of tearing down pins from MMU notifiers will have
> > > to use gup_longterm() (we may actually rename it to a more suitable name).
> > > FS may just refuse such calls (for normal page cache backed file, it will
> > > just return success but for DAX file it will do sanity checks whether the
> > > file is fully allocated etc. like we currently do for swapfiles) but if
> > > gup_longterm() returns success, it will provide the same guarantees as for
> > > swapfiles. So the only thing that we need is some call from gup_longterm()
> > > to a filesystem callback to tell it - this file is going to be used by a
> > > third party as an IO buffer, don't touch it. And we can (and should)
> > > probably refactor the handling to be shared between swapfiles and
> > > gup_longterm().
> >
> > Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a
> > solution I thought we dax folks walked away from in the original
> > MAP_DIRECT discussion [1]. Here is where leases were the response to
> > MAP_DIRECT [2]. ...and here is where we had tame discussions about
> > implications of notifying memory-registrations of lease break events
> > [3].
>
> Yeah, thanks for the references.
>
> > I honestly don't like the idea that random subsystems can pin down
> > file blocks as a side effect of gup on the result of mmap. Recall that
> > it's not just RDMA that wants this guarantee. It seems safer to have
> > the file be in an explicit block-allocation-immutable-mode so that the
> > fallocate man page can describe this error case. Otherwise how would
> > you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails?
>
> So with requiring lease for gup_longterm() to succeed (and the
> FALLOC_FL_PUNCH_HOLE failure being keyed from the existence of such lease),
> does it look more reasonable to you?
That sounds reasonable to me, just the small matter of teaching the
non-ODP RDMA ecosystem to take out FL_LAYOUT leases and do something
reasonable when the lease needs to be recalled.
I would hope that RDMA-to-FSDAX-PMEM support is enough motivation to
either make the necessary application changes, or switch to an
ODP-capable adapter.
Note that I think we need FL_LAYOUT regardless of whether the
legacy-RDMA stack ever takes advantage of it. VFIO device passthrough
to a guest that has a host DAX file mapped as physical PMEM in the
guest needs guarantees that the guest will be killed and DMA force
blocked by the IOMMU if someone punches a hole in memory in use by a
guest, or otherwise have a paravirtualized driver in the guest to
coordinate what effectively looks like a physical memory unplug event.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-08 11:10 ` Jan Kara
2019-02-08 20:50 ` Dan Williams
@ 2019-02-08 21:20 ` Dave Chinner
1 sibling, 0 replies; 155+ messages in thread
From: Dave Chinner @ 2019-02-08 21:20 UTC (permalink / raw)
To: Jan Kara
Cc: Christopher Lameter, Doug Ledford, Dan Williams, Jason Gunthorpe,
Matthew Wilcox, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri, Feb 08, 2019 at 12:10:28PM +0100, Jan Kara wrote:
> On Fri 08-02-19 15:43:02, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > > One approach that may be a clean way to solve this:
> > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > > on the longterm pinned range until the long term pin is removed.
> >
> > So, ummm, how do we do block allocation then, which is done on
> > demand during writes?
> >
> > IOWs, this requires the application to set up the file in the
> > correct state for the filesystem to lock it down so somebody else
> > can write to it. That means the file can't be sparse, it can't be
> > preallocated (i.e. can't contain unwritten extents), it must have zeroes
> > written to it's full size before being shared because otherwise it
> > exposes stale data to the remote client (secure sites are going to
> > love that!), they can't be extended, etc.
> >
> > IOWs, once the file is prepped and leased out for RDMA, it becomes
> > an immutable for the purposes of local access.
> >
> > Which, essentially we can already do. Prep the file, map it
> > read/write, mark it immutable, then pin it via the longterm gup
> > interface which can do the necessary checks.
>
> Hum, and what will you do if the immutable file that is target for RDMA
> will be a source of reflink?
I think we'd have to disallow it. reflink does actually change the
source inode on XFS (adds an inode flag to say it has shared
extents)...
Similarly, we'd have to make sure the inode is pinned in memory
but the gup_longterm operation, not jus thave it's pages pinned...
> That seems to be currently allowed for
> immutable files but RDMA store would be effectively corrupting the data of
> the target inode. But we could treat it similarly as swapfiles - those also
> have to deal with writes to blocks beyond filesystem control. In fact the
> similarity seems to be quite large there. What do you think?
Yes, swapfiles are probably a better analogy as the mm subsystem
pins them, maps them checking the layout is appropriate (i.e. no
holes) and then writes straight through them without the filesystem
being aware of the IO....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-08 4:43 ` Dave Chinner
@ 2019-02-08 15:33 ` Christopher Lameter
2019-02-08 15:33 ` Christopher Lameter
1 sibling, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-08 15:33 UTC (permalink / raw)
To: Dave Chinner
Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri, 8 Feb 2019, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > on the longterm pinned range until the long term pin is removed.
>
> So, ummm, how do we do block allocation then, which is done on
> demand during writes?
If a memory region is mapped by RDMA then this is essentially a long
write. The allocation needs to happen then.
> IOWs, this requires the application to set up the file in the
> correct state for the filesystem to lock it down so somebody else
> can write to it. That means the file can't be sparse, it can't be
> preallocated (i.e. can't contain unwritten extents), it must have zeroes
> written to it's full size before being shared because otherwise it
> exposes stale data to the remote client (secure sites are going to
> love that!), they can't be extended, etc.
Yes. That is required.
> IOWs, once the file is prepped and leased out for RDMA, it becomes
> an immutable for the purposes of local access.
The contents are mutable but the mapping to the physical medium is
immutable.
> Which, essentially we can already do. Prep the file, map it
> read/write, mark it immutable, then pin it via the longterm gup
> interface which can do the necessary checks.
>
> Simple to implement, the reasons for errors trying to modify the
> file are already documented and queriable, and it's hard for
> applications to get wrong.
Yup. Why not do it this way? Just make the sections actually long term GUP
mapped inmutable and not subject to the other page cache things.
This is basically a straight through bypass of the page cache for a file.
HEY! It may be used to map huge pages in the future too!!
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
@ 2019-02-08 15:33 ` Christopher Lameter
0 siblings, 0 replies; 155+ messages in thread
From: Christopher Lameter @ 2019-02-08 15:33 UTC (permalink / raw)
To: Dave Chinner
Cc: Doug Ledford, Dan Williams, Jason Gunthorpe, Matthew Wilcox,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Fri, 8 Feb 2019, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote:
> > One approach that may be a clean way to solve this:
> > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will
> > provide the virtual mapping when the PIN is done and DO NO OPERATIONS
> > on the longterm pinned range until the long term pin is removed.
>
> So, ummm, how do we do block allocation then, which is done on
> demand during writes?
If a memory region is mapped by RDMA then this is essentially a long
write. The allocation needs to happen then.
> IOWs, this requires the application to set up the file in the
> correct state for the filesystem to lock it down so somebody else
> can write to it. That means the file can't be sparse, it can't be
> preallocated (i.e. can't contain unwritten extents), it must have zeroes
> written to it's full size before being shared because otherwise it
> exposes stale data to the remote client (secure sites are going to
> love that!), they can't be extended, etc.
Yes. That is required.
> IOWs, once the file is prepped and leased out for RDMA, it becomes
> an immutable for the purposes of local access.
The contents are mutable but the mapping to the physical medium is
immutable.
> Which, essentially we can already do. Prep the file, map it
> read/write, mark it immutable, then pin it via the longterm gup
> interface which can do the necessary checks.
>
> Simple to implement, the reasons for errors trying to modify the
> file are already documented and queriable, and it's hard for
> applications to get wrong.
Yup. Why not do it this way? Just make the sections actually long term GUP
mapped inmutable and not subject to the other page cache things.
This is basically a straight through bypass of the page cache for a file.
HEY! It may be used to map huge pages in the future too!!
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 16:25 ` Doug Ledford
(?)
(?)
@ 2019-02-07 17:24 ` Matthew Wilcox
2019-02-07 17:26 ` Jason Gunthorpe
-1 siblings, 1 reply; 155+ messages in thread
From: Matthew Wilcox @ 2019-02-07 17:24 UTC (permalink / raw)
To: Doug Ledford
Cc: Dan Williams, Jason Gunthorpe, Dave Chinner, Christopher Lameter,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Thu, Feb 07, 2019 at 11:25:35AM -0500, Doug Ledford wrote:
> * Really though, as I said in my email to Tom Talpey, this entire
> situation is simply screaming that we are doing DAX networking wrong.
> We shouldn't be writing the networking code once in every single
> application that wants to do this. If we had a memory segment that we
> shared from server to client(s), and in that memory segment we
> implemented a clustered filesystem, then applications would simply mmap
> local files and be done with it. If the file needed to move, the kernel
> would update the mmap in the application, done. If you ask me, it is
> the attempt to do this the wrong way that is resulting in all this
> heartache. That said, for today, my recommendation would be to require
> ODP hardware for XFS filesystem with the DAX option, but allow ext2
> filesystems to mount DAX filesystems on non-ODP hardware, and go in and
> modify the ext2 filesystem so that on DAX mounts, it disables hole punch
> and ftrunctate any time they would result in the forced removal of an
> established mmap.
I agree that something's wrong, but I think the fundamental problem is
that there's no concept in RDMA of having an STag for storage rather
than for memory.
Imagine if we could associate an STag with a file descriptor on the
server. The client could then perform an RDMA to that STag. On the
server, we'd need lots of smarts in the card and in the OS to know how
to treat that packet on arrival -- depending on what the file descriptor
referred to, it might only have to write into the page cache, or it
might set up an NVMe DMA, or it might resolve the underlying physical
address and DMA directly to an NV-DIMM.
^ permalink raw reply [flat|nested] 155+ messages in thread
* Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
2019-02-07 17:24 ` Matthew Wilcox
@ 2019-02-07 17:26 ` Jason Gunthorpe
0 siblings, 0 replies; 155+ messages in thread
From: Jason Gunthorpe @ 2019-02-07 17:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Doug Ledford, Dan Williams, Dave Chinner, Christopher Lameter,
Jan Kara, Ira Weiny, lsf-pc, linux-rdma, Linux MM,
Linux Kernel Mailing List, John Hubbard, Jerome Glisse,
Michal Hocko
On Thu, Feb 07, 2019 at 09:24:05AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 07, 2019 at 11:25:35AM -0500, Doug Ledford wrote:
> > * Really though, as I said in my email to Tom Talpey, this entire
> > situation is simply screaming that we are doing DAX networking wrong.
> > We shouldn't be writing the networking code once in every single
> > application that wants to do this. If we had a memory segment that we
> > shared from server to client(s), and in that memory segment we
> > implemented a clustered filesystem, then applications would simply mmap
> > local files and be done with it. If the file needed to move, the kernel
> > would update the mmap in the application, done. If you ask me, it is
> > the attempt to do this the wrong way that is resulting in all this
> > heartache. That said, for today, my recommendation would be to require
> > ODP hardware for XFS filesystem with the DAX option, but allow ext2
> > filesystems to mount DAX filesystems on non-ODP hardware, and go in and
> > modify the ext2 filesystem so that on DAX mounts, it disables hole punch
> > and ftrunctate any time they would result in the forced removal of an
> > established mmap.
>
> I agree that something's wrong, but I think the fundamental problem is
> that there's no concept in RDMA of having an STag for storage rather
> than for memory.
>
> Imagine if we could associate an STag with a file descriptor on the
> server. The client could then perform an RDMA to that STag. On the
> server, we'd need lots of smarts in the card and in the OS to know how
> to treat that packet on arrival -- depending on what the file descriptor
> referred to, it might only have to write into the page cache, or it
> might set up an NVMe DMA, or it might resolve the underlying physical
> address and DMA directly to an NV-DIMM.
I think you just described ODP MRs.
Jason
^ permalink raw reply [flat|nested] 155+ messages in thread