All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSFMM] RDMA data corruption potential during FS writeback
@ 2018-05-18 14:37 Christopher Lameter
  2018-05-18 15:49 ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Christopher Lameter @ 2018-05-18 14:37 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-mm, Michal Hocko, Jason Gunthorpe

There was a session at the Linux Filesystem and Memory Management summit
on issues that are caused by devices using get_user_pages() or elevated
refcounts to pin pages and then do I/O on them.

See https://lwn.net/Articles/753027/

Basically filesystems need to mark the pages readonly during writeback.
Concurrent DMA into the page while it is written by a filesystem can cause
corrupted data being written to the disk, cause incorrect checksums etc
etc.

The solution that was proposed at the meeting was that mmu notifiers can
remedy that situation by allowing callbacks to the RDMA device to ensure
that the RDMA device and the filesystem do not do concurrent writeback.

This issue has been around for a long time and so far not caused too much
grief it seems. Doing I/O to two devices from the same memory location is
naturally a bit inconsistent in itself.

But could we do more to prevent issues here? I think what may be useful is
to not allow the memory registrations of file back writable mappings
unless the device driver provides mmu callbacks or something like that.

There is also the longstanding issue of the refcounts that are held over
long time periods. If we require mmu notifier callbacks then we may as
well go to on demand paging mode for RDMA memory registrations. This
avoids increasing the refcounts long term and allows easy access control /
page removal for memory management.

There may even be more issues if DAX is being used but the FS writeback
has the potential of biting anyone at this point it seems.

I think we need to put some thought into these issues and we need some
coordination between the RDMA developers and memory management. RDMA seems
to be more and more important and thus its likely that issues like this
will become more important.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-18 14:37 [LSFMM] RDMA data corruption potential during FS writeback Christopher Lameter
@ 2018-05-18 15:49 ` Jason Gunthorpe
  2018-05-18 16:47   ` Christopher Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2018-05-18 15:49 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: linux-rdma, linux-mm, Michal Hocko

On Fri, May 18, 2018 at 02:37:52PM +0000, Christopher Lameter wrote:
> There was a session at the Linux Filesystem and Memory Management summit
> on issues that are caused by devices using get_user_pages() or elevated
> refcounts to pin pages and then do I/O on them.
> 
> See https://lwn.net/Articles/753027/
> 
> Basically filesystems need to mark the pages readonly during writeback.
> Concurrent DMA into the page while it is written by a filesystem can cause
> corrupted data being written to the disk, cause incorrect checksums etc
> etc.
> 
> The solution that was proposed at the meeting was that mmu notifiers can
> remedy that situation by allowing callbacks to the RDMA device to ensure
> that the RDMA device and the filesystem do not do concurrent writeback.

This keeps coming up, and I understand why it seems appealing from the
MM side, but the reality is that very little RDMA hardware supports
this, and it carries with it a fairly big performance penalty so many
users don't like using it.

> This issue has been around for a long time and so far not caused too much
> grief it seems. Doing I/O to two devices from the same memory location is
> naturally a bit inconsistent in itself.

Well, I've seen various reports of FS's oopsing and what not, as the
LWN article points out.. So it is breaking stuff for some users.

> But could we do more to prevent issues here? I think what may be useful is
> to not allow the memory registrations of file back writable mappings
> unless the device driver provides mmu callbacks or something like that.

Why does every proposed solution to this involve crippling RDMA? Are
there really no ideas no ideas to allow the FS side to accommodate
this use case??

> There may even be more issues if DAX is being used but the FS writeback
> has the potential of biting anyone at this point it seems.

I think Dan already 'solved' this via get_user_pages_longterm which
just fails for DAX backed pages.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-18 15:49 ` Jason Gunthorpe
@ 2018-05-18 16:47   ` Christopher Lameter
  2018-05-18 17:36     ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Christopher Lameter @ 2018-05-18 16:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Michal Hocko, Dan Williams

On Fri, 18 May 2018, Jason Gunthorpe wrote:

> > The solution that was proposed at the meeting was that mmu notifiers can
> > remedy that situation by allowing callbacks to the RDMA device to ensure
> > that the RDMA device and the filesystem do not do concurrent writeback.
>
> This keeps coming up, and I understand why it seems appealing from the
> MM side, but the reality is that very little RDMA hardware supports
> this, and it carries with it a fairly big performance penalty so many
> users don't like using it.

Ok so we have a latent data corruption issue that is not being addressed.

> > But could we do more to prevent issues here? I think what may be useful is
> > to not allow the memory registrations of file back writable mappings
> > unless the device driver provides mmu callbacks or something like that.
>
> Why does every proposed solution to this involve crippling RDMA? Are
> there really no ideas no ideas to allow the FS side to accommodate
> this use case??

The newcomer here is RDMA. The FS side is the mainstream use case and has
been there since Unix learned to do paging.

> > There may even be more issues if DAX is being used but the FS writeback
> > has the potential of biting anyone at this point it seems.
>
> I think Dan already 'solved' this via get_user_pages_longterm which
> just fails for DAX backed pages.

That is indeed crippling and would be killing the ideas that we had around
here for using DAX. There needs to be an ability to shove large amounts of
data into memory via RDMA and from there onto a disk without too much of a
fuss and without copying. In the case of DAX this trivially should avoid
the copying to disk since its already in memory. If this does not work
then the whole thing is really not that high performant anymore since it
requires a copy operation.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-18 16:47   ` Christopher Lameter
@ 2018-05-18 17:36     ` Jason Gunthorpe
  2018-05-18 20:23       ` Dan Williams
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2018-05-18 17:36 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: linux-rdma, linux-mm, Michal Hocko, Dan Williams

On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote:
> On Fri, 18 May 2018, Jason Gunthorpe wrote:
> 
> > > The solution that was proposed at the meeting was that mmu notifiers can
> > > remedy that situation by allowing callbacks to the RDMA device to ensure
> > > that the RDMA device and the filesystem do not do concurrent writeback.
> >
> > This keeps coming up, and I understand why it seems appealing from the
> > MM side, but the reality is that very little RDMA hardware supports
> > this, and it carries with it a fairly big performance penalty so many
> > users don't like using it.
> 
> Ok so we have a latent data corruption issue that is not being addressed.
> 
> > > But could we do more to prevent issues here? I think what may be useful is
> > > to not allow the memory registrations of file back writable mappings
> > > unless the device driver provides mmu callbacks or something like that.
> >
> > Why does every proposed solution to this involve crippling RDMA? Are
> > there really no ideas no ideas to allow the FS side to accommodate
> > this use case??
> 
> The newcomer here is RDMA. The FS side is the mainstream use case and has
> been there since Unix learned to do paging.

Well, it has been this way for 12 years, so it isn't that new.

Honestly it sounds like get_user_pages is just a broken Linux
API??

Nothing can use it to write to pages because the FS could explode -
RDMA makes it particularly easy to trigger this due to the longer time
windows, but presumably any get_user_pages could generate a race and
hit this? Is that right?

I am left with the impression that solving it in the FS is too
performance costly so FS doesn't want that overheard? Was that also
the conclusion?

Could we take another crack at this during Linux Plumbers? Will the MM
parties be there too? I'm sorry I wasn't able to attend LSFMM this
year!

> > > There may even be more issues if DAX is being used but the FS writeback
> > > has the potential of biting anyone at this point it seems.
> >
> > I think Dan already 'solved' this via get_user_pages_longterm which
> > just fails for DAX backed pages.
> 
> That is indeed crippling and would be killing the ideas that we had around
> here for using DAX. There needs to be an ability to shove large amounts of
> data into memory via RDMA and from there onto a disk without too much of a
> fuss and without copying. In the case of DAX this trivially should avoid
> the copying to disk since its already in memory. If this does not work
> then the whole thing is really not that high performant anymore since it
> requires a copy operation.

AFIAK, if you enable ODP on your MR then DAX will work as you want,
but you take lower network performance to get it. You might be the
first person to test this though ;)

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-18 17:36     ` Jason Gunthorpe
@ 2018-05-18 20:23       ` Dan Williams
  2018-05-19  2:33         ` John Hubbard
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Williams @ 2018-05-18 20:23 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Christopher Lameter, linux-rdma, Linux MM, Michal Hocko

On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote:
>> On Fri, 18 May 2018, Jason Gunthorpe wrote:
>>
>> > > The solution that was proposed at the meeting was that mmu notifiers can
>> > > remedy that situation by allowing callbacks to the RDMA device to ensure
>> > > that the RDMA device and the filesystem do not do concurrent writeback.
>> >
>> > This keeps coming up, and I understand why it seems appealing from the
>> > MM side, but the reality is that very little RDMA hardware supports
>> > this, and it carries with it a fairly big performance penalty so many
>> > users don't like using it.
>>
>> Ok so we have a latent data corruption issue that is not being addressed.
>>
>> > > But could we do more to prevent issues here? I think what may be useful is
>> > > to not allow the memory registrations of file back writable mappings
>> > > unless the device driver provides mmu callbacks or something like that.
>> >
>> > Why does every proposed solution to this involve crippling RDMA? Are
>> > there really no ideas no ideas to allow the FS side to accommodate
>> > this use case??
>>
>> The newcomer here is RDMA. The FS side is the mainstream use case and has
>> been there since Unix learned to do paging.
>
> Well, it has been this way for 12 years, so it isn't that new.
>
> Honestly it sounds like get_user_pages is just a broken Linux
> API??
>
> Nothing can use it to write to pages because the FS could explode -
> RDMA makes it particularly easy to trigger this due to the longer time
> windows, but presumably any get_user_pages could generate a race and
> hit this? Is that right?
>
> I am left with the impression that solving it in the FS is too
> performance costly so FS doesn't want that overheard? Was that also
> the conclusion?
>
> Could we take another crack at this during Linux Plumbers? Will the MM
> parties be there too? I'm sorry I wasn't able to attend LSFMM this
> year!

Yes, you and hch were missed, and I had to skip the last day due to a
family emergency.

Plumbers sounds good to resync on this topic, but we already have a
plan, use "break_layouts()" to coordinate a filesystem's need to move
dax blocks around relative to an active RDMA memory registration. If
you never punch a hole in the middle of your RDMA registration then
you never incur any performance penalty. Otherwise the layout break
notification is just there to tell the application "hey man, talk to
your friend that punched a hole in the middle of your mapping, but the
filesystem wants this block back now. Sorry, I'm kicking you out. Ok,
bye.".

In other words, get_user_pages_longterm() is just a short term
band-aid for RDMA until we can get that infrastructure built. We don't
need to go down any mmu-notifier rabbit holes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-18 20:23       ` Dan Williams
@ 2018-05-19  2:33         ` John Hubbard
  2018-05-19  3:24           ` Jason Gunthorpe
  2018-05-21 13:59           ` Christopher Lameter
  0 siblings, 2 replies; 13+ messages in thread
From: John Hubbard @ 2018-05-19  2:33 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Christopher Lameter, linux-rdma, Linux MM, Michal Hocko

On 05/18/2018 01:23 PM, Dan Williams wrote:
> On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote:
>>> On Fri, 18 May 2018, Jason Gunthorpe wrote:
>>>
---8<---------------------------------
>>>
>>> The newcomer here is RDMA. The FS side is the mainstream use case and has
>>> been there since Unix learned to do paging.
>>
>> Well, it has been this way for 12 years, so it isn't that new.
>>
>> Honestly it sounds like get_user_pages is just a broken Linux
>> API??
>>
>> Nothing can use it to write to pages because the FS could explode -
>> RDMA makes it particularly easy to trigger this due to the longer time
>> windows, but presumably any get_user_pages could generate a race and
>> hit this? Is that right?

+1, and I am now super-interested in this conversation, because
after tracking down a kernel BUG to this classic mistaken pattern:

    get_user_pages (on file-backed memory from ext4)
    ...do some DMA
    set_pages_dirty
    put_page(s)

...there is (rarely!) a backtrace from ext4, that disavows ownership of
any such pages. It happens rarely enough that people have come to believe
that the pattern is OK, from what I can tell. But some new, cutting edge
systems with zillions of threads and lots of memory are able to expose the
problem.

Anyway, I've been dividing my time between trying to prove exactly 
which FS action is disconnecting the page from ext4 in this particular
bug (even though it's lately becoming well-documented that the design itself
is not correct), and casting about for the most proper place to fix this. 

Because the obvious "fix" in device driver land is to use a dedicated
buffer for DMA, and copy to the filesystem buffer, and of course I will
get *killed* if I propose such a performance-killing approach. But a
core kernel fix really is starting to sound attractive.

>>
>> I am left with the impression that solving it in the FS is too
>> performance costly so FS doesn't want that overheard? Was that also
>> the conclusion?
>>
>> Could we take another crack at this during Linux Plumbers? Will the MM
>> parties be there too? I'm sorry I wasn't able to attend LSFMM this
>> year!
> 
> Yes, you and hch were missed, and I had to skip the last day due to a
> family emergency.
> 
> Plumbers sounds good to resync on this topic, but we already have a
> plan, use "break_layouts()" to coordinate a filesystem's need to move
> dax blocks around relative to an active RDMA memory registration. If
> you never punch a hole in the middle of your RDMA registration then
> you never incur any performance penalty. Otherwise the layout break
> notification is just there to tell the application "hey man, talk to
> your friend that punched a hole in the middle of your mapping, but the
> filesystem wants this block back now. Sorry, I'm kicking you out. Ok,
> bye.".
> 
> In other words, get_user_pages_longterm() is just a short term
> band-aid for RDMA until we can get that infrastructure built. We don't
> need to go down any mmu-notifier rabbit holes.
> 

git grep claims that break_layouts is so far an XFS-only feature, though. 
Were there plans to fix this for all filesystems?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-19  2:33         ` John Hubbard
@ 2018-05-19  3:24           ` Jason Gunthorpe
  2018-05-19  3:51             ` Dan Williams
  2018-05-21 13:37             ` Christopher Lameter
  2018-05-21 13:59           ` Christopher Lameter
  1 sibling, 2 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2018-05-19  3:24 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko

On Fri, May 18, 2018 at 07:33:41PM -0700, John Hubbard wrote:
> On 05/18/2018 01:23 PM, Dan Williams wrote:
> > On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote:
> >>> On Fri, 18 May 2018, Jason Gunthorpe wrote:
> >>>
> >>>
> >>> The newcomer here is RDMA. The FS side is the mainstream use case and has
> >>> been there since Unix learned to do paging.
> >>
> >> Well, it has been this way for 12 years, so it isn't that new.
> >>
> >> Honestly it sounds like get_user_pages is just a broken Linux
> >> API??
> >>
> >> Nothing can use it to write to pages because the FS could explode -
> >> RDMA makes it particularly easy to trigger this due to the longer time
> >> windows, but presumably any get_user_pages could generate a race and
> >> hit this? Is that right?
> 
> +1, and I am now super-interested in this conversation, because
> after tracking down a kernel BUG to this classic mistaken pattern:
> 
>     get_user_pages (on file-backed memory from ext4)
>     ...do some DMA
>     set_pages_dirty
>     put_page(s)

Ummm, RDMA has done essentially that since 2005, since when did it
become wrong? Do you have some references? Is there some alternative?

See __ib_umem_release

> ...there is (rarely!) a backtrace from ext4, that disavows ownership of
> any such pages.

Yes, I've seen that oops with RDMA, apparently isn't actually that
rare if you tweak things just right.

I thought it was an obscure ext4 bug :(

> Because the obvious "fix" in device driver land is to use a dedicated
> buffer for DMA, and copy to the filesystem buffer, and of course I will
> get *killed* if I propose such a performance-killing approach. But a
> core kernel fix really is starting to sound attractive.

Yeah, killed is right. That idea totally cripples RDMA.

What is the point of get_user_pages FOLL_WRITE if you can't write to
and dirty the pages!?!

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-19  3:24           ` Jason Gunthorpe
@ 2018-05-19  3:51             ` Dan Williams
  2018-05-19  5:38               ` John Hubbard
  2018-05-21 14:38               ` Matthew Wilcox
  2018-05-21 13:37             ` Christopher Lameter
  1 sibling, 2 replies; 13+ messages in thread
From: Dan Williams @ 2018-05-19  3:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko

On Fri, May 18, 2018 at 8:24 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Fri, May 18, 2018 at 07:33:41PM -0700, John Hubbard wrote:
>> On 05/18/2018 01:23 PM, Dan Williams wrote:
>> > On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> >> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote:
>> >>> On Fri, 18 May 2018, Jason Gunthorpe wrote:
>> >>>
>> >>>
>> >>> The newcomer here is RDMA. The FS side is the mainstream use case and has
>> >>> been there since Unix learned to do paging.
>> >>
>> >> Well, it has been this way for 12 years, so it isn't that new.
>> >>
>> >> Honestly it sounds like get_user_pages is just a broken Linux
>> >> API??
>> >>
>> >> Nothing can use it to write to pages because the FS could explode -
>> >> RDMA makes it particularly easy to trigger this due to the longer time
>> >> windows, but presumably any get_user_pages could generate a race and
>> >> hit this? Is that right?
>>
>> +1, and I am now super-interested in this conversation, because
>> after tracking down a kernel BUG to this classic mistaken pattern:
>>
>>     get_user_pages (on file-backed memory from ext4)
>>     ...do some DMA
>>     set_pages_dirty
>>     put_page(s)
>
> Ummm, RDMA has done essentially that since 2005, since when did it
> become wrong? Do you have some references? Is there some alternative?
>
> See __ib_umem_release
>
>> ...there is (rarely!) a backtrace from ext4, that disavows ownership of
>> any such pages.
>
> Yes, I've seen that oops with RDMA, apparently isn't actually that
> rare if you tweak things just right.
>
> I thought it was an obscure ext4 bug :(
>
>> Because the obvious "fix" in device driver land is to use a dedicated
>> buffer for DMA, and copy to the filesystem buffer, and of course I will
>> get *killed* if I propose such a performance-killing approach. But a
>> core kernel fix really is starting to sound attractive.
>
> Yeah, killed is right. That idea totally cripples RDMA.
>
> What is the point of get_user_pages FOLL_WRITE if you can't write to
> and dirty the pages!?!
>

You're oversimplifying the problem, here are the details:

https://www.spinics.net/lists/linux-mm/msg142700.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-19  3:51             ` Dan Williams
@ 2018-05-19  5:38               ` John Hubbard
  2018-05-21 14:38               ` Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: John Hubbard @ 2018-05-19  5:38 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Christopher Lameter, linux-rdma, Linux MM, Michal Hocko

On 05/18/2018 08:51 PM, Dan Williams wrote:
> On Fri, May 18, 2018 at 8:24 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Fri, May 18, 2018 at 07:33:41PM -0700, John Hubbard wrote:
>>> On 05/18/2018 01:23 PM, Dan Williams wrote:
>>>> On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>>> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote:
>>>>>> On Fri, 18 May 2018, Jason Gunthorpe wrote:
>>>>>>
>>>>>>
>>>>>> The newcomer here is RDMA. The FS side is the mainstream use case and has
>>>>>> been there since Unix learned to do paging.
>>>>>
>>>>> Well, it has been this way for 12 years, so it isn't that new.
>>>>>
>>>>> Honestly it sounds like get_user_pages is just a broken Linux
>>>>> API??
>>>>>
>>>>> Nothing can use it to write to pages because the FS could explode -
>>>>> RDMA makes it particularly easy to trigger this due to the longer time
>>>>> windows, but presumably any get_user_pages could generate a race and
>>>>> hit this? Is that right?
>>>
>>> +1, and I am now super-interested in this conversation, because
>>> after tracking down a kernel BUG to this classic mistaken pattern:
>>>
>>>     get_user_pages (on file-backed memory from ext4)
>>>     ...do some DMA
>>>     set_pages_dirty
>>>     put_page(s)
>>
>> Ummm, RDMA has done essentially that since 2005, since when did it
>> become wrong? Do you have some references? Is there some alternative?
>>
>> See __ib_umem_release
>>
>>> ...there is (rarely!) a backtrace from ext4, that disavows ownership of
>>> any such pages.
>>
>> Yes, I've seen that oops with RDMA, apparently isn't actually that
>> rare if you tweak things just right.
>>
>> I thought it was an obscure ext4 bug :(
>>
>>> Because the obvious "fix" in device driver land is to use a dedicated
>>> buffer for DMA, and copy to the filesystem buffer, and of course I will
>>> get *killed* if I propose such a performance-killing approach. But a
>>> core kernel fix really is starting to sound attractive.
>>
>> Yeah, killed is right. That idea totally cripples RDMA.
>>
>> What is the point of get_user_pages FOLL_WRITE if you can't write to
>> and dirty the pages!?!
>>
> 
> You're oversimplifying the problem, here are the details:
> 
> https://www.spinics.net/lists/linux-mm/msg142700.html
> 

Hi Dan,

The thing is, the above still leads to a fairly simple conclusion, which
is: unless something is changed, then no, you cannot do the standard,
simple RDMA thing. Because someone can hit the case that Jan Kara describes
in the link above.

So I think we're all saying the same thing, right? We need to fix something so
that this pattern actually works in all cases? 

I just want to be sure that this doesn't get characterized as "oh this is 
just a special case that you might need to avoid". From my view (which at
the moment is sort of RDMA-centric, for this bug), I don't see a way to 
avoid the problem, other than avoiding calling get_user_pages on file-backed
memory--and even *that* is darned difficult.

Please tell me I'm wildly wrong, though--I'd really love to be wildly wrong
here. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-19  3:24           ` Jason Gunthorpe
  2018-05-19  3:51             ` Dan Williams
@ 2018-05-21 13:37             ` Christopher Lameter
  1 sibling, 0 replies; 13+ messages in thread
From: Christopher Lameter @ 2018-05-21 13:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Dan Williams, linux-rdma, Linux MM, Michal Hocko

On Fri, 18 May 2018, Jason Gunthorpe wrote:

> Ummm, RDMA has done essentially that since 2005, since when did it
> become wrong? Do you have some references? Is there some alternative?

It was wrong from the start. It became much more evident with widespread
use of RDMA. The inability to scale processor performance at this point
but the huge increase in network bandwidth available forces users into
RDMA solution. Thus they will try to do RDMA to file backed mappings where
in the past we only used anonymous memory for RDMA.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-19  2:33         ` John Hubbard
  2018-05-19  3:24           ` Jason Gunthorpe
@ 2018-05-21 13:59           ` Christopher Lameter
  1 sibling, 0 replies; 13+ messages in thread
From: Christopher Lameter @ 2018-05-21 13:59 UTC (permalink / raw)
  To: John Hubbard
  Cc: Dan Williams, Jason Gunthorpe, linux-rdma, Linux MM, Michal Hocko

On Fri, 18 May 2018, John Hubbard wrote:

> > In other words, get_user_pages_longterm() is just a short term
> > band-aid for RDMA until we can get that infrastructure built. We don't
> > need to go down any mmu-notifier rabbit holes.
> >
>
> git grep claims that break_layouts is so far an XFS-only feature, though.
> Were there plans to fix this for all filesystems?

break_layouts? This sounds from my perspective like a mmu notifier
callback with slightly different semantics. Maybe add another callback to
generalize that functionality?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-19  3:51             ` Dan Williams
  2018-05-19  5:38               ` John Hubbard
@ 2018-05-21 14:38               ` Matthew Wilcox
  2018-05-23 23:03                 ` John Hubbard
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2018-05-21 14:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, John Hubbard, Christopher Lameter, linux-rdma,
	Linux MM, Michal Hocko

On Fri, May 18, 2018 at 08:51:38PM -0700, Dan Williams wrote:
> >> +1, and I am now super-interested in this conversation, because
> >> after tracking down a kernel BUG to this classic mistaken pattern:
> >>
> >>     get_user_pages (on file-backed memory from ext4)
> >>     ...do some DMA
> >>     set_pages_dirty
> >>     put_page(s)
> >
> > Ummm, RDMA has done essentially that since 2005, since when did it
> > become wrong? Do you have some references? Is there some alternative?
> >
> > See __ib_umem_release
> >
> >> ...there is (rarely!) a backtrace from ext4, that disavows ownership of
> >> any such pages.
> >
> > Yes, I've seen that oops with RDMA, apparently isn't actually that
> > rare if you tweak things just right.
> >
> > I thought it was an obscure ext4 bug :(
> >
> >> Because the obvious "fix" in device driver land is to use a dedicated
> >> buffer for DMA, and copy to the filesystem buffer, and of course I will
> >> get *killed* if I propose such a performance-killing approach. But a
> >> core kernel fix really is starting to sound attractive.
> >
> > Yeah, killed is right. That idea totally cripples RDMA.
> >
> > What is the point of get_user_pages FOLL_WRITE if you can't write to
> > and dirty the pages!?!
> 
> You're oversimplifying the problem, here are the details:
> 
> https://www.spinics.net/lists/linux-mm/msg142700.html

Suggestion 1:

in get_user_pages_fast(), mark the page as dirty, but don't tag the radix
tree entry as dirty.  Then vmscan() won't find it when it's looking to
write out dirty pages.  Only mark it as dirty in the radix tree once we
call set_page_dirty_lock().

Suggestion 2:

in get_user_pages_fast(), replace the page in the radix tree with a special
entry that means "page under io".  In set_page_dirty_lock(), replace the
"page under io" entry with the struct page pointer.

Both of these suggestions have trouble with simultaneous sub-page IOs to the
same page.  Do we care?  I suspect we might as pages get larger (see also:
supporting THP pages in the page cache).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSFMM] RDMA data corruption potential during FS writeback
  2018-05-21 14:38               ` Matthew Wilcox
@ 2018-05-23 23:03                 ` John Hubbard
  0 siblings, 0 replies; 13+ messages in thread
From: John Hubbard @ 2018-05-23 23:03 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: Jason Gunthorpe, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko

On 05/21/2018 07:38 AM, Matthew Wilcox wrote:
> On Fri, May 18, 2018 at 08:51:38PM -0700, Dan Williams wrote:
-------------8<------------------------------------------------
> 
> Suggestion 1:
> 
> in get_user_pages_fast(), mark the page as dirty, but don't tag the radix
> tree entry as dirty.  Then vmscan() won't find it when it's looking to
> write out dirty pages.  Only mark it as dirty in the radix tree once we
> call set_page_dirty_lock().
> 
> Suggestion 2:
> 
> in get_user_pages_fast(), replace the page in the radix tree with a special
> entry that means "page under io".  In set_page_dirty_lock(), replace the
> "page under io" entry with the struct page pointer.

This second one feels a simpler to me. If no one sees huge problems with this,
I can put this together and try it out, because I have a few nicely reproducible
bugs that I can test this on.

But with either approach, a quick question first: will this do the right thing
for the other two use cases below?

    a) ftruncate

    b) deleting the inode and dropping all references to it (only the 
       get_user_pages reference remains)

...or is some other way to sneak in and try_to_free_buffers() on a 
page in this state?

Also, just to be sure I'm on the same page, is it accurate to claim that we
would then have the following updated guidelines for device drivers and
user space?

1. You can safely DMA to file-backed memory that you've pinned via
get_user_pages (with the usual caveats about getting the pages you think
you're getting), if you are careful to avoid truncating or deleting the 
file out from under get_user_pages.

In other words, this pattern is supported:

    get_user_pages (on file-backed memory from a persistent storage filesystem)
    ...do some DMA
    set_page_dirty_lock
    put_page

2. Furthermore, even if you are less careful, you still won't crash the kernel,
The worst that could happen is to corrupt your data, due to interrupting the
writeback.

The possibility of data corruption is bad, but it's also arguably both
self-inflicted and avoidable. Anyway, even so, it's an improvement: the bugs
I'm seeing would definitely get fixed with this.
    
> 
> Both of these suggestions have trouble with simultaneous sub-page IOs to the
> same page.  Do we care?  I suspect we might as pages get larger (see also:
> supporting THP pages in the page cache).
> 

I don't *think* we care. At least, no examples occur to me where this would
cause a problem.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-05-23 23:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-18 14:37 [LSFMM] RDMA data corruption potential during FS writeback Christopher Lameter
2018-05-18 15:49 ` Jason Gunthorpe
2018-05-18 16:47   ` Christopher Lameter
2018-05-18 17:36     ` Jason Gunthorpe
2018-05-18 20:23       ` Dan Williams
2018-05-19  2:33         ` John Hubbard
2018-05-19  3:24           ` Jason Gunthorpe
2018-05-19  3:51             ` Dan Williams
2018-05-19  5:38               ` John Hubbard
2018-05-21 14:38               ` Matthew Wilcox
2018-05-23 23:03                 ` John Hubbard
2018-05-21 13:37             ` Christopher Lameter
2018-05-21 13:59           ` Christopher Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.