* [LSFMM] RDMA data corruption potential during FS writeback @ 2018-05-18 14:37 Christopher Lameter 2018-05-18 15:49 ` Jason Gunthorpe 0 siblings, 1 reply; 13+ messages in thread From: Christopher Lameter @ 2018-05-18 14:37 UTC (permalink / raw) To: linux-rdma; +Cc: linux-mm, Michal Hocko, Jason Gunthorpe There was a session at the Linux Filesystem and Memory Management summit on issues that are caused by devices using get_user_pages() or elevated refcounts to pin pages and then do I/O on them. See https://lwn.net/Articles/753027/ Basically filesystems need to mark the pages readonly during writeback. Concurrent DMA into the page while it is written by a filesystem can cause corrupted data being written to the disk, cause incorrect checksums etc etc. The solution that was proposed at the meeting was that mmu notifiers can remedy that situation by allowing callbacks to the RDMA device to ensure that the RDMA device and the filesystem do not do concurrent writeback. This issue has been around for a long time and so far not caused too much grief it seems. Doing I/O to two devices from the same memory location is naturally a bit inconsistent in itself. But could we do more to prevent issues here? I think what may be useful is to not allow the memory registrations of file back writable mappings unless the device driver provides mmu callbacks or something like that. There is also the longstanding issue of the refcounts that are held over long time periods. If we require mmu notifier callbacks then we may as well go to on demand paging mode for RDMA memory registrations. This avoids increasing the refcounts long term and allows easy access control / page removal for memory management. There may even be more issues if DAX is being used but the FS writeback has the potential of biting anyone at this point it seems. I think we need to put some thought into these issues and we need some coordination between the RDMA developers and memory management. RDMA seems to be more and more important and thus its likely that issues like this will become more important. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-18 14:37 [LSFMM] RDMA data corruption potential during FS writeback Christopher Lameter @ 2018-05-18 15:49 ` Jason Gunthorpe 2018-05-18 16:47 ` Christopher Lameter 0 siblings, 1 reply; 13+ messages in thread From: Jason Gunthorpe @ 2018-05-18 15:49 UTC (permalink / raw) To: Christopher Lameter; +Cc: linux-rdma, linux-mm, Michal Hocko On Fri, May 18, 2018 at 02:37:52PM +0000, Christopher Lameter wrote: > There was a session at the Linux Filesystem and Memory Management summit > on issues that are caused by devices using get_user_pages() or elevated > refcounts to pin pages and then do I/O on them. > > See https://lwn.net/Articles/753027/ > > Basically filesystems need to mark the pages readonly during writeback. > Concurrent DMA into the page while it is written by a filesystem can cause > corrupted data being written to the disk, cause incorrect checksums etc > etc. > > The solution that was proposed at the meeting was that mmu notifiers can > remedy that situation by allowing callbacks to the RDMA device to ensure > that the RDMA device and the filesystem do not do concurrent writeback. This keeps coming up, and I understand why it seems appealing from the MM side, but the reality is that very little RDMA hardware supports this, and it carries with it a fairly big performance penalty so many users don't like using it. > This issue has been around for a long time and so far not caused too much > grief it seems. Doing I/O to two devices from the same memory location is > naturally a bit inconsistent in itself. Well, I've seen various reports of FS's oopsing and what not, as the LWN article points out.. So it is breaking stuff for some users. > But could we do more to prevent issues here? I think what may be useful is > to not allow the memory registrations of file back writable mappings > unless the device driver provides mmu callbacks or something like that. Why does every proposed solution to this involve crippling RDMA? Are there really no ideas no ideas to allow the FS side to accommodate this use case?? > There may even be more issues if DAX is being used but the FS writeback > has the potential of biting anyone at this point it seems. I think Dan already 'solved' this via get_user_pages_longterm which just fails for DAX backed pages. Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-18 15:49 ` Jason Gunthorpe @ 2018-05-18 16:47 ` Christopher Lameter 2018-05-18 17:36 ` Jason Gunthorpe 0 siblings, 1 reply; 13+ messages in thread From: Christopher Lameter @ 2018-05-18 16:47 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Michal Hocko, Dan Williams On Fri, 18 May 2018, Jason Gunthorpe wrote: > > The solution that was proposed at the meeting was that mmu notifiers can > > remedy that situation by allowing callbacks to the RDMA device to ensure > > that the RDMA device and the filesystem do not do concurrent writeback. > > This keeps coming up, and I understand why it seems appealing from the > MM side, but the reality is that very little RDMA hardware supports > this, and it carries with it a fairly big performance penalty so many > users don't like using it. Ok so we have a latent data corruption issue that is not being addressed. > > But could we do more to prevent issues here? I think what may be useful is > > to not allow the memory registrations of file back writable mappings > > unless the device driver provides mmu callbacks or something like that. > > Why does every proposed solution to this involve crippling RDMA? Are > there really no ideas no ideas to allow the FS side to accommodate > this use case?? The newcomer here is RDMA. The FS side is the mainstream use case and has been there since Unix learned to do paging. > > There may even be more issues if DAX is being used but the FS writeback > > has the potential of biting anyone at this point it seems. > > I think Dan already 'solved' this via get_user_pages_longterm which > just fails for DAX backed pages. That is indeed crippling and would be killing the ideas that we had around here for using DAX. There needs to be an ability to shove large amounts of data into memory via RDMA and from there onto a disk without too much of a fuss and without copying. In the case of DAX this trivially should avoid the copying to disk since its already in memory. If this does not work then the whole thing is really not that high performant anymore since it requires a copy operation. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-18 16:47 ` Christopher Lameter @ 2018-05-18 17:36 ` Jason Gunthorpe 2018-05-18 20:23 ` Dan Williams 0 siblings, 1 reply; 13+ messages in thread From: Jason Gunthorpe @ 2018-05-18 17:36 UTC (permalink / raw) To: Christopher Lameter; +Cc: linux-rdma, linux-mm, Michal Hocko, Dan Williams On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: > On Fri, 18 May 2018, Jason Gunthorpe wrote: > > > > The solution that was proposed at the meeting was that mmu notifiers can > > > remedy that situation by allowing callbacks to the RDMA device to ensure > > > that the RDMA device and the filesystem do not do concurrent writeback. > > > > This keeps coming up, and I understand why it seems appealing from the > > MM side, but the reality is that very little RDMA hardware supports > > this, and it carries with it a fairly big performance penalty so many > > users don't like using it. > > Ok so we have a latent data corruption issue that is not being addressed. > > > > But could we do more to prevent issues here? I think what may be useful is > > > to not allow the memory registrations of file back writable mappings > > > unless the device driver provides mmu callbacks or something like that. > > > > Why does every proposed solution to this involve crippling RDMA? Are > > there really no ideas no ideas to allow the FS side to accommodate > > this use case?? > > The newcomer here is RDMA. The FS side is the mainstream use case and has > been there since Unix learned to do paging. Well, it has been this way for 12 years, so it isn't that new. Honestly it sounds like get_user_pages is just a broken Linux API?? Nothing can use it to write to pages because the FS could explode - RDMA makes it particularly easy to trigger this due to the longer time windows, but presumably any get_user_pages could generate a race and hit this? Is that right? I am left with the impression that solving it in the FS is too performance costly so FS doesn't want that overheard? Was that also the conclusion? Could we take another crack at this during Linux Plumbers? Will the MM parties be there too? I'm sorry I wasn't able to attend LSFMM this year! > > > There may even be more issues if DAX is being used but the FS writeback > > > has the potential of biting anyone at this point it seems. > > > > I think Dan already 'solved' this via get_user_pages_longterm which > > just fails for DAX backed pages. > > That is indeed crippling and would be killing the ideas that we had around > here for using DAX. There needs to be an ability to shove large amounts of > data into memory via RDMA and from there onto a disk without too much of a > fuss and without copying. In the case of DAX this trivially should avoid > the copying to disk since its already in memory. If this does not work > then the whole thing is really not that high performant anymore since it > requires a copy operation. AFIAK, if you enable ODP on your MR then DAX will work as you want, but you take lower network performance to get it. You might be the first person to test this though ;) Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-18 17:36 ` Jason Gunthorpe @ 2018-05-18 20:23 ` Dan Williams 2018-05-19 2:33 ` John Hubbard 0 siblings, 1 reply; 13+ messages in thread From: Dan Williams @ 2018-05-18 20:23 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: >> On Fri, 18 May 2018, Jason Gunthorpe wrote: >> >> > > The solution that was proposed at the meeting was that mmu notifiers can >> > > remedy that situation by allowing callbacks to the RDMA device to ensure >> > > that the RDMA device and the filesystem do not do concurrent writeback. >> > >> > This keeps coming up, and I understand why it seems appealing from the >> > MM side, but the reality is that very little RDMA hardware supports >> > this, and it carries with it a fairly big performance penalty so many >> > users don't like using it. >> >> Ok so we have a latent data corruption issue that is not being addressed. >> >> > > But could we do more to prevent issues here? I think what may be useful is >> > > to not allow the memory registrations of file back writable mappings >> > > unless the device driver provides mmu callbacks or something like that. >> > >> > Why does every proposed solution to this involve crippling RDMA? Are >> > there really no ideas no ideas to allow the FS side to accommodate >> > this use case?? >> >> The newcomer here is RDMA. The FS side is the mainstream use case and has >> been there since Unix learned to do paging. > > Well, it has been this way for 12 years, so it isn't that new. > > Honestly it sounds like get_user_pages is just a broken Linux > API?? > > Nothing can use it to write to pages because the FS could explode - > RDMA makes it particularly easy to trigger this due to the longer time > windows, but presumably any get_user_pages could generate a race and > hit this? Is that right? > > I am left with the impression that solving it in the FS is too > performance costly so FS doesn't want that overheard? Was that also > the conclusion? > > Could we take another crack at this during Linux Plumbers? Will the MM > parties be there too? I'm sorry I wasn't able to attend LSFMM this > year! Yes, you and hch were missed, and I had to skip the last day due to a family emergency. Plumbers sounds good to resync on this topic, but we already have a plan, use "break_layouts()" to coordinate a filesystem's need to move dax blocks around relative to an active RDMA memory registration. If you never punch a hole in the middle of your RDMA registration then you never incur any performance penalty. Otherwise the layout break notification is just there to tell the application "hey man, talk to your friend that punched a hole in the middle of your mapping, but the filesystem wants this block back now. Sorry, I'm kicking you out. Ok, bye.". In other words, get_user_pages_longterm() is just a short term band-aid for RDMA until we can get that infrastructure built. We don't need to go down any mmu-notifier rabbit holes. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-18 20:23 ` Dan Williams @ 2018-05-19 2:33 ` John Hubbard 2018-05-19 3:24 ` Jason Gunthorpe 2018-05-21 13:59 ` Christopher Lameter 0 siblings, 2 replies; 13+ messages in thread From: John Hubbard @ 2018-05-19 2:33 UTC (permalink / raw) To: Dan Williams, Jason Gunthorpe Cc: Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On 05/18/2018 01:23 PM, Dan Williams wrote: > On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: >>> On Fri, 18 May 2018, Jason Gunthorpe wrote: >>> ---8<--------------------------------- >>> >>> The newcomer here is RDMA. The FS side is the mainstream use case and has >>> been there since Unix learned to do paging. >> >> Well, it has been this way for 12 years, so it isn't that new. >> >> Honestly it sounds like get_user_pages is just a broken Linux >> API?? >> >> Nothing can use it to write to pages because the FS could explode - >> RDMA makes it particularly easy to trigger this due to the longer time >> windows, but presumably any get_user_pages could generate a race and >> hit this? Is that right? +1, and I am now super-interested in this conversation, because after tracking down a kernel BUG to this classic mistaken pattern: get_user_pages (on file-backed memory from ext4) ...do some DMA set_pages_dirty put_page(s) ...there is (rarely!) a backtrace from ext4, that disavows ownership of any such pages. It happens rarely enough that people have come to believe that the pattern is OK, from what I can tell. But some new, cutting edge systems with zillions of threads and lots of memory are able to expose the problem. Anyway, I've been dividing my time between trying to prove exactly which FS action is disconnecting the page from ext4 in this particular bug (even though it's lately becoming well-documented that the design itself is not correct), and casting about for the most proper place to fix this. Because the obvious "fix" in device driver land is to use a dedicated buffer for DMA, and copy to the filesystem buffer, and of course I will get *killed* if I propose such a performance-killing approach. But a core kernel fix really is starting to sound attractive. >> >> I am left with the impression that solving it in the FS is too >> performance costly so FS doesn't want that overheard? Was that also >> the conclusion? >> >> Could we take another crack at this during Linux Plumbers? Will the MM >> parties be there too? I'm sorry I wasn't able to attend LSFMM this >> year! > > Yes, you and hch were missed, and I had to skip the last day due to a > family emergency. > > Plumbers sounds good to resync on this topic, but we already have a > plan, use "break_layouts()" to coordinate a filesystem's need to move > dax blocks around relative to an active RDMA memory registration. If > you never punch a hole in the middle of your RDMA registration then > you never incur any performance penalty. Otherwise the layout break > notification is just there to tell the application "hey man, talk to > your friend that punched a hole in the middle of your mapping, but the > filesystem wants this block back now. Sorry, I'm kicking you out. Ok, > bye.". > > In other words, get_user_pages_longterm() is just a short term > band-aid for RDMA until we can get that infrastructure built. We don't > need to go down any mmu-notifier rabbit holes. > git grep claims that break_layouts is so far an XFS-only feature, though. Were there plans to fix this for all filesystems? thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-19 2:33 ` John Hubbard @ 2018-05-19 3:24 ` Jason Gunthorpe 2018-05-19 3:51 ` Dan Williams 2018-05-21 13:37 ` Christopher Lameter 2018-05-21 13:59 ` Christopher Lameter 1 sibling, 2 replies; 13+ messages in thread From: Jason Gunthorpe @ 2018-05-19 3:24 UTC (permalink / raw) To: John Hubbard Cc: Dan Williams, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On Fri, May 18, 2018 at 07:33:41PM -0700, John Hubbard wrote: > On 05/18/2018 01:23 PM, Dan Williams wrote: > > On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: > >>> On Fri, 18 May 2018, Jason Gunthorpe wrote: > >>> > >>> > >>> The newcomer here is RDMA. The FS side is the mainstream use case and has > >>> been there since Unix learned to do paging. > >> > >> Well, it has been this way for 12 years, so it isn't that new. > >> > >> Honestly it sounds like get_user_pages is just a broken Linux > >> API?? > >> > >> Nothing can use it to write to pages because the FS could explode - > >> RDMA makes it particularly easy to trigger this due to the longer time > >> windows, but presumably any get_user_pages could generate a race and > >> hit this? Is that right? > > +1, and I am now super-interested in this conversation, because > after tracking down a kernel BUG to this classic mistaken pattern: > > get_user_pages (on file-backed memory from ext4) > ...do some DMA > set_pages_dirty > put_page(s) Ummm, RDMA has done essentially that since 2005, since when did it become wrong? Do you have some references? Is there some alternative? See __ib_umem_release > ...there is (rarely!) a backtrace from ext4, that disavows ownership of > any such pages. Yes, I've seen that oops with RDMA, apparently isn't actually that rare if you tweak things just right. I thought it was an obscure ext4 bug :( > Because the obvious "fix" in device driver land is to use a dedicated > buffer for DMA, and copy to the filesystem buffer, and of course I will > get *killed* if I propose such a performance-killing approach. But a > core kernel fix really is starting to sound attractive. Yeah, killed is right. That idea totally cripples RDMA. What is the point of get_user_pages FOLL_WRITE if you can't write to and dirty the pages!?! Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-19 3:24 ` Jason Gunthorpe @ 2018-05-19 3:51 ` Dan Williams 2018-05-19 5:38 ` John Hubbard 2018-05-21 14:38 ` Matthew Wilcox 2018-05-21 13:37 ` Christopher Lameter 1 sibling, 2 replies; 13+ messages in thread From: Dan Williams @ 2018-05-19 3:51 UTC (permalink / raw) To: Jason Gunthorpe Cc: John Hubbard, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On Fri, May 18, 2018 at 8:24 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Fri, May 18, 2018 at 07:33:41PM -0700, John Hubbard wrote: >> On 05/18/2018 01:23 PM, Dan Williams wrote: >> > On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: >> >>> On Fri, 18 May 2018, Jason Gunthorpe wrote: >> >>> >> >>> >> >>> The newcomer here is RDMA. The FS side is the mainstream use case and has >> >>> been there since Unix learned to do paging. >> >> >> >> Well, it has been this way for 12 years, so it isn't that new. >> >> >> >> Honestly it sounds like get_user_pages is just a broken Linux >> >> API?? >> >> >> >> Nothing can use it to write to pages because the FS could explode - >> >> RDMA makes it particularly easy to trigger this due to the longer time >> >> windows, but presumably any get_user_pages could generate a race and >> >> hit this? Is that right? >> >> +1, and I am now super-interested in this conversation, because >> after tracking down a kernel BUG to this classic mistaken pattern: >> >> get_user_pages (on file-backed memory from ext4) >> ...do some DMA >> set_pages_dirty >> put_page(s) > > Ummm, RDMA has done essentially that since 2005, since when did it > become wrong? Do you have some references? Is there some alternative? > > See __ib_umem_release > >> ...there is (rarely!) a backtrace from ext4, that disavows ownership of >> any such pages. > > Yes, I've seen that oops with RDMA, apparently isn't actually that > rare if you tweak things just right. > > I thought it was an obscure ext4 bug :( > >> Because the obvious "fix" in device driver land is to use a dedicated >> buffer for DMA, and copy to the filesystem buffer, and of course I will >> get *killed* if I propose such a performance-killing approach. But a >> core kernel fix really is starting to sound attractive. > > Yeah, killed is right. That idea totally cripples RDMA. > > What is the point of get_user_pages FOLL_WRITE if you can't write to > and dirty the pages!?! > You're oversimplifying the problem, here are the details: https://www.spinics.net/lists/linux-mm/msg142700.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-19 3:51 ` Dan Williams @ 2018-05-19 5:38 ` John Hubbard 2018-05-21 14:38 ` Matthew Wilcox 1 sibling, 0 replies; 13+ messages in thread From: John Hubbard @ 2018-05-19 5:38 UTC (permalink / raw) To: Dan Williams, Jason Gunthorpe Cc: Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On 05/18/2018 08:51 PM, Dan Williams wrote: > On Fri, May 18, 2018 at 8:24 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> On Fri, May 18, 2018 at 07:33:41PM -0700, John Hubbard wrote: >>> On 05/18/2018 01:23 PM, Dan Williams wrote: >>>> On Fri, May 18, 2018 at 10:36 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >>>>> On Fri, May 18, 2018 at 04:47:48PM +0000, Christopher Lameter wrote: >>>>>> On Fri, 18 May 2018, Jason Gunthorpe wrote: >>>>>> >>>>>> >>>>>> The newcomer here is RDMA. The FS side is the mainstream use case and has >>>>>> been there since Unix learned to do paging. >>>>> >>>>> Well, it has been this way for 12 years, so it isn't that new. >>>>> >>>>> Honestly it sounds like get_user_pages is just a broken Linux >>>>> API?? >>>>> >>>>> Nothing can use it to write to pages because the FS could explode - >>>>> RDMA makes it particularly easy to trigger this due to the longer time >>>>> windows, but presumably any get_user_pages could generate a race and >>>>> hit this? Is that right? >>> >>> +1, and I am now super-interested in this conversation, because >>> after tracking down a kernel BUG to this classic mistaken pattern: >>> >>> get_user_pages (on file-backed memory from ext4) >>> ...do some DMA >>> set_pages_dirty >>> put_page(s) >> >> Ummm, RDMA has done essentially that since 2005, since when did it >> become wrong? Do you have some references? Is there some alternative? >> >> See __ib_umem_release >> >>> ...there is (rarely!) a backtrace from ext4, that disavows ownership of >>> any such pages. >> >> Yes, I've seen that oops with RDMA, apparently isn't actually that >> rare if you tweak things just right. >> >> I thought it was an obscure ext4 bug :( >> >>> Because the obvious "fix" in device driver land is to use a dedicated >>> buffer for DMA, and copy to the filesystem buffer, and of course I will >>> get *killed* if I propose such a performance-killing approach. But a >>> core kernel fix really is starting to sound attractive. >> >> Yeah, killed is right. That idea totally cripples RDMA. >> >> What is the point of get_user_pages FOLL_WRITE if you can't write to >> and dirty the pages!?! >> > > You're oversimplifying the problem, here are the details: > > https://www.spinics.net/lists/linux-mm/msg142700.html > Hi Dan, The thing is, the above still leads to a fairly simple conclusion, which is: unless something is changed, then no, you cannot do the standard, simple RDMA thing. Because someone can hit the case that Jan Kara describes in the link above. So I think we're all saying the same thing, right? We need to fix something so that this pattern actually works in all cases? I just want to be sure that this doesn't get characterized as "oh this is just a special case that you might need to avoid". From my view (which at the moment is sort of RDMA-centric, for this bug), I don't see a way to avoid the problem, other than avoiding calling get_user_pages on file-backed memory--and even *that* is darned difficult. Please tell me I'm wildly wrong, though--I'd really love to be wildly wrong here. :) thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-19 3:51 ` Dan Williams 2018-05-19 5:38 ` John Hubbard @ 2018-05-21 14:38 ` Matthew Wilcox 2018-05-23 23:03 ` John Hubbard 1 sibling, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2018-05-21 14:38 UTC (permalink / raw) To: Dan Williams Cc: Jason Gunthorpe, John Hubbard, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On Fri, May 18, 2018 at 08:51:38PM -0700, Dan Williams wrote: > >> +1, and I am now super-interested in this conversation, because > >> after tracking down a kernel BUG to this classic mistaken pattern: > >> > >> get_user_pages (on file-backed memory from ext4) > >> ...do some DMA > >> set_pages_dirty > >> put_page(s) > > > > Ummm, RDMA has done essentially that since 2005, since when did it > > become wrong? Do you have some references? Is there some alternative? > > > > See __ib_umem_release > > > >> ...there is (rarely!) a backtrace from ext4, that disavows ownership of > >> any such pages. > > > > Yes, I've seen that oops with RDMA, apparently isn't actually that > > rare if you tweak things just right. > > > > I thought it was an obscure ext4 bug :( > > > >> Because the obvious "fix" in device driver land is to use a dedicated > >> buffer for DMA, and copy to the filesystem buffer, and of course I will > >> get *killed* if I propose such a performance-killing approach. But a > >> core kernel fix really is starting to sound attractive. > > > > Yeah, killed is right. That idea totally cripples RDMA. > > > > What is the point of get_user_pages FOLL_WRITE if you can't write to > > and dirty the pages!?! > > You're oversimplifying the problem, here are the details: > > https://www.spinics.net/lists/linux-mm/msg142700.html Suggestion 1: in get_user_pages_fast(), mark the page as dirty, but don't tag the radix tree entry as dirty. Then vmscan() won't find it when it's looking to write out dirty pages. Only mark it as dirty in the radix tree once we call set_page_dirty_lock(). Suggestion 2: in get_user_pages_fast(), replace the page in the radix tree with a special entry that means "page under io". In set_page_dirty_lock(), replace the "page under io" entry with the struct page pointer. Both of these suggestions have trouble with simultaneous sub-page IOs to the same page. Do we care? I suspect we might as pages get larger (see also: supporting THP pages in the page cache). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-21 14:38 ` Matthew Wilcox @ 2018-05-23 23:03 ` John Hubbard 0 siblings, 0 replies; 13+ messages in thread From: John Hubbard @ 2018-05-23 23:03 UTC (permalink / raw) To: Matthew Wilcox, Dan Williams Cc: Jason Gunthorpe, Christopher Lameter, linux-rdma, Linux MM, Michal Hocko On 05/21/2018 07:38 AM, Matthew Wilcox wrote: > On Fri, May 18, 2018 at 08:51:38PM -0700, Dan Williams wrote: -------------8<------------------------------------------------ > > Suggestion 1: > > in get_user_pages_fast(), mark the page as dirty, but don't tag the radix > tree entry as dirty. Then vmscan() won't find it when it's looking to > write out dirty pages. Only mark it as dirty in the radix tree once we > call set_page_dirty_lock(). > > Suggestion 2: > > in get_user_pages_fast(), replace the page in the radix tree with a special > entry that means "page under io". In set_page_dirty_lock(), replace the > "page under io" entry with the struct page pointer. This second one feels a simpler to me. If no one sees huge problems with this, I can put this together and try it out, because I have a few nicely reproducible bugs that I can test this on. But with either approach, a quick question first: will this do the right thing for the other two use cases below? a) ftruncate b) deleting the inode and dropping all references to it (only the get_user_pages reference remains) ...or is some other way to sneak in and try_to_free_buffers() on a page in this state? Also, just to be sure I'm on the same page, is it accurate to claim that we would then have the following updated guidelines for device drivers and user space? 1. You can safely DMA to file-backed memory that you've pinned via get_user_pages (with the usual caveats about getting the pages you think you're getting), if you are careful to avoid truncating or deleting the file out from under get_user_pages. In other words, this pattern is supported: get_user_pages (on file-backed memory from a persistent storage filesystem) ...do some DMA set_page_dirty_lock put_page 2. Furthermore, even if you are less careful, you still won't crash the kernel, The worst that could happen is to corrupt your data, due to interrupting the writeback. The possibility of data corruption is bad, but it's also arguably both self-inflicted and avoidable. Anyway, even so, it's an improvement: the bugs I'm seeing would definitely get fixed with this. > > Both of these suggestions have trouble with simultaneous sub-page IOs to the > same page. Do we care? I suspect we might as pages get larger (see also: > supporting THP pages in the page cache). > I don't *think* we care. At least, no examples occur to me where this would cause a problem. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-19 3:24 ` Jason Gunthorpe 2018-05-19 3:51 ` Dan Williams @ 2018-05-21 13:37 ` Christopher Lameter 1 sibling, 0 replies; 13+ messages in thread From: Christopher Lameter @ 2018-05-21 13:37 UTC (permalink / raw) To: Jason Gunthorpe Cc: John Hubbard, Dan Williams, linux-rdma, Linux MM, Michal Hocko On Fri, 18 May 2018, Jason Gunthorpe wrote: > Ummm, RDMA has done essentially that since 2005, since when did it > become wrong? Do you have some references? Is there some alternative? It was wrong from the start. It became much more evident with widespread use of RDMA. The inability to scale processor performance at this point but the huge increase in network bandwidth available forces users into RDMA solution. Thus they will try to do RDMA to file backed mappings where in the past we only used anonymous memory for RDMA. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LSFMM] RDMA data corruption potential during FS writeback 2018-05-19 2:33 ` John Hubbard 2018-05-19 3:24 ` Jason Gunthorpe @ 2018-05-21 13:59 ` Christopher Lameter 1 sibling, 0 replies; 13+ messages in thread From: Christopher Lameter @ 2018-05-21 13:59 UTC (permalink / raw) To: John Hubbard Cc: Dan Williams, Jason Gunthorpe, linux-rdma, Linux MM, Michal Hocko On Fri, 18 May 2018, John Hubbard wrote: > > In other words, get_user_pages_longterm() is just a short term > > band-aid for RDMA until we can get that infrastructure built. We don't > > need to go down any mmu-notifier rabbit holes. > > > > git grep claims that break_layouts is so far an XFS-only feature, though. > Were there plans to fix this for all filesystems? break_layouts? This sounds from my perspective like a mmu notifier callback with slightly different semantics. Maybe add another callback to generalize that functionality? ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-05-23 23:04 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-18 14:37 [LSFMM] RDMA data corruption potential during FS writeback Christopher Lameter 2018-05-18 15:49 ` Jason Gunthorpe 2018-05-18 16:47 ` Christopher Lameter 2018-05-18 17:36 ` Jason Gunthorpe 2018-05-18 20:23 ` Dan Williams 2018-05-19 2:33 ` John Hubbard 2018-05-19 3:24 ` Jason Gunthorpe 2018-05-19 3:51 ` Dan Williams 2018-05-19 5:38 ` John Hubbard 2018-05-21 14:38 ` Matthew Wilcox 2018-05-23 23:03 ` John Hubbard 2018-05-21 13:37 ` Christopher Lameter 2018-05-21 13:59 ` Christopher Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).