* [PATCH 0/4] Some more lock_page work.. @ 2020-10-13 19:59 Linus Torvalds 2020-10-13 20:03 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Linus Torvalds @ 2020-10-13 19:59 UTC (permalink / raw) To: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov Cc: Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein So this is all just preliminary, but I'd really like to have people think more about the page fault handling of the page lock, and I have a small experimental series of patches for people to look at and maybe get the discussion started. The three first patches are really just fairly trivial cleanups. They also likely don't really matter, because the *bulk* of all faults - particularly the ones that really shouldn't need any page locking games - should be all about just "filemap_map_pages()". Which is that optimistic "let's insert pages from the page cache as efficiently as possible" case. That's how all the normal private pages that don't actually get modified (so executables, but also any load that uses mmap as a zero-copy read) should generally get populated. That code doesn't actually do "lock_page()" itself (because it all runs under the RCU read lock), but it does to do a trylock, and give up if the page was locked. Which is fine as long as you don't get any contention, but any concurrent faults of the same page in different address spaces will then just mess with other faulters and cause it to fall out of the fast path. And looking at that code, I'm pretty sure it doesn't actually *need* the page lock. It wants the page lock for two reasons: - the truncation worries (which may or may not be relevant - xfs wraps the map_pages with xfs_ilock) - compound page worries (sub-page mapcount updates and page splitting issues) The compound page case I'm not sure about, but it's probably fine to just lock the page in that case - once we end up actually just mapping a hugepage, the number of page faults should be small enough that it probably doesn't matter. The truncation thing could be handled like xfs does, but honestly, I think it can equally well be handled by just doing some operations in the right order, and double-checking that we don't race with truncate. IOW, first increasing the page mapcount, and then re-checking that the page still isn't locked and the mapping is still valid, and reachable in the xarray. Because we can still just drop out of this loop and not populate the page table if we see anything odd going on, but if *this* code doesn't bother getting the page lock (and we make the COW code not do it either), then in all the normal cases you will never have that "fall out of the common case". IOW, I think right now the thing that makes us fall back to the actual page lock is this code itself: by doing the 'trylock", it will make other users of the same page not able to do the fast-case. And I think the trylock is unnecessary. ANYWAY. The patch I have here isn't actually that "just do the checks in the right order" patch. No, it's a dirty nasty "a private mapping doesn't need to be so careful" patch. Ugly, brutish and nasty. Not the right thing at all. But I'm doing it here simply because I wanted to test it out and get people to look at this. This code is "tested" in the sense that it builds for me, and I'm actually running it right now. But I haven't actually stress-tested it or tried to see if it gets rid of some page lock heavy cases. Comments? Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds @ 2020-10-13 20:03 ` Linus Torvalds 2020-10-14 13:05 ` Kirill A. Shutemov 2020-10-14 5:50 ` Hugh Dickins [not found] ` <4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw> 2 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2020-10-13 20:03 UTC (permalink / raw) To: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov Cc: Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein [-- Attachment #1: Type: text/plain, Size: 362 bytes --] On Tue, Oct 13, 2020 at 12:59 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Comments? Gaah. My alpine gmail setup has gotten broken by more fancy gmail security features, so sending the actual patches that way broke down. So here they are as attachments instead. I'll fix my alpine configuration after the merge window. Linus [-- Attachment #2: 0001-mm-move-final-page-locking-out-of-__do_fault-helper-.patch --] [-- Type: text/x-patch, Size: 2744 bytes --] From c6f074f1758e233965495f589863b75ab0e1609d Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 13 Oct 2020 10:22:00 -0700 Subject: [PATCH 1/4] mm: move final page locking out of __do_fault() helper into callers The old semantics of our __do_fault() helper was that it always locked the page unless there was an error (or unless the faulting had already handled a COW event). That turns out to be a mistake. Not all callers actually want the page locked at all, and they might as well check the same VM_FAULT_LOCKED bit that __do_fault() itself checked whether the page is already locked or not. This change only moves that final page locking out into the callers, but intentionally does not actually change any of the locking semantics: the callers will not just do that final page locking themselves instead. That means that future patches may then decide to not lock the page after all, but this is just preparation for any such future change. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/memory.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index eeae590e526a..b4a7b81dcc7a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3616,11 +3616,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return VM_FAULT_HWPOISON; } - if (unlikely(!(ret & VM_FAULT_LOCKED))) - lock_page(vmf->page); - else - VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); - return ret; } @@ -4000,6 +3995,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf) if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + ret |= finish_fault(vmf); unlock_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) @@ -4031,6 +4031,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) if (ret & VM_FAULT_DONE_COW) return ret; + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); __SetPageUptodate(vmf->cow_page); @@ -4054,6 +4059,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf) if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + /* * Check if the backing address space wants to know that the page is * about to become writable -- 2.28.0.218.gc12ef3d349 [-- Attachment #3: 0002-mm-don-t-lock-the-page-only-to-immediately-unlock-it.patch --] [-- Type: text/x-patch, Size: 2276 bytes --] From dcc752881236fa914d0286da71e46b31956aa0dc Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 13 Oct 2020 10:43:05 -0700 Subject: [PATCH 2/4] mm: don't lock the page, only to immediately unlock it again for do_page_mkwrite() Our page locking during fault handling a bit messy, and the shared fault code in particular was locking the page only to immediately unlock it again because do_page_mkwrite() wanted it unlocked. We keep the "did we lock it" state around in the VM_FAULT_LOCKED bit, so let's just use that knowledge, and not first lock it if it wasn't locked, only to then unlock it again. It would be even better to transfer the "did we already lock this page" information into do_page_mkwrite(), because that function will actually want to lock it eventually anyway, but let's just clean up one thing at a time. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/memory.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b4a7b81dcc7a..5c93b4bec063 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4059,25 +4059,33 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf) if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; - if (unlikely(!(ret & VM_FAULT_LOCKED))) - lock_page(vmf->page); - else - VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); - /* * Check if the backing address space wants to know that the page is * about to become writable */ if (vma->vm_ops->page_mkwrite) { - unlock_page(vmf->page); + /* do_page_mkwrite() wants the page unlocked */ + if (ret & VM_FAULT_LOCKED) { + unlock_page(vmf->page); + ret &= ~VM_FAULT_LOCKED; + } + tmp = do_page_mkwrite(vmf); if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { put_page(vmf->page); return tmp; } + + /* Did do_page_mkwrite() lock the page again? */ + ret |= tmp & VM_FAULT_LOCKED; } + if (unlikely(!(ret & VM_FAULT_LOCKED))) + lock_page(vmf->page); + else + VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); + ret |= finish_fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) { -- 2.28.0.218.gc12ef3d349 [-- Attachment #4: 0003-mm-do_cow_fault-does-not-need-the-source-page-to-be-.patch --] [-- Type: text/x-patch, Size: 1984 bytes --] From 969e62f9784dcd3083ba3fe32b87bdea8319aba9 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 13 Oct 2020 11:00:33 -0700 Subject: [PATCH 3/4] mm: do_cow_fault() does not need the source page to be locked This removes the "lock if it wasn't locked" logic from do_cow_fault(), since we're not even going to install that page into the destination address space (finish_fault() will use ->cow_page rather than ->page), and copying the source page does not need the source to be locked. So instead of doing "lock if it wasn't locked" followed by an unconditional unlock of the page, just do "unlock if it was locked". Of course, since all the normal file mapping ->fault() handlers currently lock the page they return (see filemap_fault() for details), all of this is pretty much theoretical. But this is the right thing to do - making sure we hold the page lock when we really don't is just confusing and wrong. And this prepares the way for any future changes to filemap_fault() where we go "Oh, we actually _don't_ need to lock the page for this case at all". Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/memory.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 5c93b4bec063..d4d32d0c33c7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4031,16 +4031,13 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) if (ret & VM_FAULT_DONE_COW) return ret; - if (unlikely(!(ret & VM_FAULT_LOCKED))) - lock_page(vmf->page); - else - VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); - copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); __SetPageUptodate(vmf->cow_page); + if (ret & VM_FAULT_LOCKED) + unlock_page(vmf->page); + ret |= finish_fault(vmf); - unlock_page(vmf->page); put_page(vmf->page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; -- 2.28.0.218.gc12ef3d349 [-- Attachment #5: 0004-mm-make-filemap_map_pages-avoid-the-page-lock-if-pos.patch --] [-- Type: text/x-patch, Size: 3529 bytes --] From 107014f091622dcb411c0ae38c99a95704a62f3f Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 13 Oct 2020 12:03:40 -0700 Subject: [PATCH 4/4] mm: make filemap_map_pages() avoid the page lock if possible Private mappings don't need to be 100% serialized with file truncation etc, and act more like an optimized "read()" call. So we can avoid taking the page lock for them. NOTE! This is a trial patch. I'm not entirely happy about this, because I think we can avoid the page lock for shared mappings too, by just changing the order in which we do some of the checks. In particular, once we have the page table lock (which we need anyway), we could increment the page mapping count. That - together with re-checking that the page still isn't locked - should be a sufficient guarantee that nobody has finished truncating that page yet, and any future truncation will end up being serialized on the page table lock. The compund page case probably needs some thinking about too. Needs-review-by: Matthew Wilcox <willy@infradead.org> Needs-review-by: Kirill A. Shutemov <kirill@shutemov.name> Needs-review-by: Hugh Dickins <hughd@google.com> Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/filemap.c | 38 ++++++++++++++++++++++++++++++++++---- 1 file changed, 34 insertions(+), 4 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 748b7b1b4f6d..6accb7905a36 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2788,6 +2788,7 @@ EXPORT_SYMBOL(filemap_fault); void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { + bool shared = vmf->vma->vm_flags & VM_SHARED; struct file *file = vmf->vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; @@ -2798,6 +2799,8 @@ void filemap_map_pages(struct vm_fault *vmf, rcu_read_lock(); xas_for_each(&xas, page, end_pgoff) { + bool locked = false; + if (xas_retry(&xas, page)) continue; if (xa_is_value(page)) @@ -2815,15 +2818,40 @@ void filemap_map_pages(struct vm_fault *vmf, /* Has the page moved or been split? */ if (unlikely(page != xas_reload(&xas))) goto skip; + /* + * also recheck the page lock after getting the reference, + * so that any page lockers will have seen us incrementing + * it or not see us at all. + */ + if (PageLocked(page)) + goto skip; + page = find_subpage(page, xas.xa_index); if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page)) goto skip; - if (!trylock_page(page)) - goto skip; + /* + * We only need to be really careful about races + * with truncate etc for shared mappings. + * + * But we also need to lock the page for compound + * pages (see alloc_set_pte -> page_add_file_rmap). + */ + if (shared || PageTransCompound(page)) { + if (!trylock_page(page)) + goto skip; + locked = true; + } + + /* + * Even if we don't get the page lock, we'll re-check + * the page mapping and the mapping size. + * + * It won't hurt, even if it's racy. + */ if (page->mapping != mapping || !PageUptodate(page)) goto unlock; @@ -2840,10 +2868,12 @@ void filemap_map_pages(struct vm_fault *vmf, last_pgoff = xas.xa_index; if (alloc_set_pte(vmf, page)) goto unlock; - unlock_page(page); + if (locked) + unlock_page(page); goto next; unlock: - unlock_page(page); + if (locked) + unlock_page(page); skip: put_page(page); next: -- 2.28.0.218.gc12ef3d349 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-13 20:03 ` Linus Torvalds @ 2020-10-14 13:05 ` Kirill A. Shutemov 2020-10-14 16:53 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: Kirill A. Shutemov @ 2020-10-14 13:05 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Tue, Oct 13, 2020 at 01:03:30PM -0700, Linus Torvalds wrote: > On Tue, Oct 13, 2020 at 12:59 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Comments? So we remove strict serialization against truncation in filemap_map_pages() if the mapping is private. IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for such mappings. The "page->mapping != mapping" check makes the race window smaller, but doesn't remove it. I'm not sure all codepaths are fine with this. For instance, looks like migration will back off such pages: __unmap_and_move() doesn't know how to deal with mapped pages with ->mapping == NULL. Yes, it is not crash, but still... Do I miss something? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-14 13:05 ` Kirill A. Shutemov @ 2020-10-14 16:53 ` Linus Torvalds 2020-10-14 18:15 ` Matthew Wilcox 2020-10-15 9:43 ` Kirill A. Shutemov 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2020-10-14 16:53 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Wed, Oct 14, 2020 at 6:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > So we remove strict serialization against truncation in > filemap_map_pages() if the mapping is private. Well, that particular patch does. But as mentioned, that's the wrong way of doing it. It's just that the right way - which keeps the serialization - requires some changes to how we actually install the pte. Right now that goes through alloc_set_pte(), and that one is basically written with the page lock in mind. So the current logic is - lock page - check page->mapping etc that is now stable - lock page tables if required (alloc_set_pte -> pte_alloc_one_map -> pte_offset_map_lock) - insert pte but that whole alloc_set_pte() thing is actually much too generic for what "map_pages()" really wants, and I think we could re-organize it. In particular, what I _think_ we could do is: - lock the page tables - check that the page isn't locked - increment the page mapcount (atomic and ordered) - check that the page still isn't locked - insert pte without taking the page lock. And the reason that's safe is that *if* we're racing with something that is about the remove the page mapping, then *that* code will (a) hold the page lock (b) before it removes the mapping, it has to remove it from any existing mappings, which involves checking the mapcount and going off and getting the page table locks to remove any ptes. but my patch did *not* do that, because you have to re-organize things a bit. Note that the above still keeps the "optimistic" model - if the page is locked, we'll skip that pte, and we'll end up doing the heavy thing (which will then take the pte lock if required). Which is why we basically get away with that "only test page lock, don't take it" approach with some ordering. But my theory is that the actual truncate case is *so* rare that we basically don't need to even worry about it, and that once the (common) page fault case doesn't need the page lock, then that whole "fall back to slow case" simply doesn't happen in practice. > IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for > such mappings. The "page->mapping != mapping" check makes the race window > smaller, but doesn't remove it. Yes, that patch I posted as an RFC does exactly that. But I think we can keep the serialization if we just make slightly more involved changes. And they aren't necessarily a _lot_ more involved. In fact, I think we may already hold the page table lock due to doing that "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I think the only _real_ problem is that I think we increment the page_mapcount() too late in alloc_set_pte(). And yeah, I realize this is a bit handwavy. But that's why I sent these things out as an RFC. To see if people can shoot holes in this, and maybe do that proper patch instead of my "let's get discussion going" one. Also, I really do think that our current "private mappings are coherent with fiel changes" is a big QoI issue: it's the right thing to do. So I wouldn't actually really want to take advantage of "private mappings don't really _have_ to be coherent according to POSIX". That's a lazy cop-out. So I dislike my patch, and don't think it's really acceptable, but as a "let's try this" I think it's a good first step. > I'm not sure all codepaths are fine with this. For instance, looks like > migration will back off such pages: __unmap_and_move() doesn't know how to > deal with mapped pages with ->mapping == NULL. So as outlined above, I think we can avoid the ->mapping == NULL case, and we can remove all the races. But it would still do new things, like incremnt the page mapcount without holding the page lock. I think that's ok - we already _decrement_ it without holding the lock, so it's not that the count is stable without locking - but at a minimum we'd have that "need to make sure the memory ordering between testing the page lock initially, incrementing the count, and re-testing is ok". And maybe some code that expects things to be stable. So I'm not at all claiming that things are obviously fine. I'm happy to see that Hugh started some practical stress-tests on that RFC patch, and I think we can improve on the situation. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-14 16:53 ` Linus Torvalds @ 2020-10-14 18:15 ` Matthew Wilcox 2020-10-15 10:41 ` Kirill A. Shutemov 2020-10-15 9:43 ` Kirill A. Shutemov 1 sibling, 1 reply; 25+ messages in thread From: Matthew Wilcox @ 2020-10-14 18:15 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hugh Dickins, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Wed, Oct 14, 2020 at 09:53:35AM -0700, Linus Torvalds wrote: > In particular, what I _think_ we could do is: > > - lock the page tables > > - check that the page isn't locked > > - increment the page mapcount (atomic and ordered) > > - check that the page still isn't locked > > - insert pte > > without taking the page lock. And the reason that's safe is that *if* [...] > And they aren't necessarily a _lot_ more involved. In fact, I think we > may already hold the page table lock due to doing that > "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I > think the only _real_ problem is that I think we increment the > page_mapcount() too late in alloc_set_pte(). I'm not entirely sure why we require the page lock to be held in page_add_file_rmap(): } else { if (PageTransCompound(page) && page_mapping(page)) { VM_WARN_ON_ONCE(!PageLocked(page)); SetPageDoubleMap(compound_head(page)); if (PageMlocked(page)) clear_page_mlock(compound_head(page)); } We have a reference to the page, so compound_head() isn't going to change. SetPageDoubleMap() is atomic. PageMlocked() is atomic. clear_page_mlock() does TestClearPageMlocked() as its first thing, so that's atomic too. What am I missing? (Kirill added it, so I assume he remembers ;-) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-14 18:15 ` Matthew Wilcox @ 2020-10-15 10:41 ` Kirill A. Shutemov 0 siblings, 0 replies; 25+ messages in thread From: Kirill A. Shutemov @ 2020-10-15 10:41 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Hugh Dickins, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Wed, Oct 14, 2020 at 07:15:09PM +0100, Matthew Wilcox wrote: > On Wed, Oct 14, 2020 at 09:53:35AM -0700, Linus Torvalds wrote: > > In particular, what I _think_ we could do is: > > > > - lock the page tables > > > > - check that the page isn't locked > > > > - increment the page mapcount (atomic and ordered) > > > > - check that the page still isn't locked > > > > - insert pte > > > > without taking the page lock. And the reason that's safe is that *if* > [...] > > And they aren't necessarily a _lot_ more involved. In fact, I think we > > may already hold the page table lock due to doing that > > "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I > > think the only _real_ problem is that I think we increment the > > page_mapcount() too late in alloc_set_pte(). > > I'm not entirely sure why we require the page lock to be held in > page_add_file_rmap(): > > } else { > if (PageTransCompound(page) && page_mapping(page)) { > VM_WARN_ON_ONCE(!PageLocked(page)); > > SetPageDoubleMap(compound_head(page)); > if (PageMlocked(page)) > clear_page_mlock(compound_head(page)); > } > > We have a reference to the page, so compound_head() isn't going > to change. SetPageDoubleMap() is atomic. PageMlocked() is atomic. > clear_page_mlock() does TestClearPageMlocked() as its first thing, > so that's atomic too. What am I missing? (Kirill added it, so I > assume he remembers ;-) In general (with current scheme), we need page lock here to serialize against try_to_unmap() and other rmap walkers. For file THP, we also serialize against follow_trans_huge_pmd(FOLL_MLOCK). -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-14 16:53 ` Linus Torvalds 2020-10-14 18:15 ` Matthew Wilcox @ 2020-10-15 9:43 ` Kirill A. Shutemov 2020-10-15 16:44 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Kirill A. Shutemov @ 2020-10-15 9:43 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Wed, Oct 14, 2020 at 09:53:35AM -0700, Linus Torvalds wrote: > In particular, what I _think_ we could do is: > > - lock the page tables > > - check that the page isn't locked > > - increment the page mapcount (atomic and ordered) > > - check that the page still isn't locked > > - insert pte > > without taking the page lock. And the reason that's safe is that *if* > we're racing with something that is about the remove the page mapping, > then *that* code will > > (a) hold the page lock > > (b) before it removes the mapping, it has to remove it from any > existing mappings, which involves checking the mapcount and going off > and getting the page table locks to remove any ptes. > > but my patch did *not* do that, because you have to re-organize things a bit. Okay, I see what you propose. But I don't think it addresses race with try_to_unmap(): CPU0 CPU1 filemap_map_pages() take ptl PageLocked() == false lock_page() try_to_unmap() rwc->done() page_mapcount_is_zero() == true rwc->done() == true, skip full rmap walk never take ptl taken by CPU0 try_to_unmap() == true ... unlock_page() increment mapcount PageLocked() == false insert PTE Are we willing to give up rwc->done() optimization? Or do I miss some other serialization point? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-15 9:43 ` Kirill A. Shutemov @ 2020-10-15 16:44 ` Linus Torvalds 0 siblings, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2020-10-15 16:44 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Thu, Oct 15, 2020 at 2:43 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Okay, I see what you propose. > > But I don't think it addresses race with try_to_unmap(): I don't think it needs to. Remember: the map_pages() thing is called only for when the page tables are empty. So try_to_unmap() will never see the pte entry, and there is nothing to race with. So sure, it can "race" with try_to_unmap like you say, but who cares? The "race" is no different from taking the page lock _after_ try_to_unmap() already ran (and didn't see anything because the page hadn't been mapped yet). IOW I don't think try_to_unmap() really matters. The race you outline can already happen with the "trylock()" - no different from the trylock just succeeding after the unlock_page(). So you can think of map_pages() as all happening after try_to_unmap() has already succeeded - and didn't see the new pte that hasn't been filled in yet. I do think there is a real race, but it is is with "__remove_mapping()". That still happens under the page lock, but it doesn't actually _depend_ on the page lock as far as I can tell. Because I think the real protection there is that if (!page_ref_freeze(page, refcount)) goto cannot_free; it that code sees "oh, somebody else has a reference to the page, we can't remove the mapping". But it's entirely possible that I don't understand your worry, and I overlooked something. If so, can you explain using smaller words, please ;) Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/4] Some more lock_page work.. 2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds 2020-10-13 20:03 ` Linus Torvalds @ 2020-10-14 5:50 ` Hugh Dickins [not found] ` <4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw> 2 siblings, 0 replies; 25+ messages in thread From: Hugh Dickins @ 2020-10-14 5:50 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Tue, 13 Oct 2020, Linus Torvalds wrote: > So this is all just preliminary, but I'd really like to have people > think more about the page fault handling of the page lock, and I have > a small experimental series of patches for people to look at and maybe > get the discussion started. > > The three first patches are really just fairly trivial cleanups. They > also likely don't really matter, because the *bulk* of all faults - > particularly the ones that really shouldn't need any page locking > games - should be all about just "filemap_map_pages()". Which is that > optimistic "let's insert pages from the page cache as efficiently as > possible" case. > > That's how all the normal private pages that don't actually get > modified (so executables, but also any load that uses mmap as a > zero-copy read) should generally get populated. > > That code doesn't actually do "lock_page()" itself (because it all > runs under the RCU read lock), but it does to do a trylock, and give > up if the page was locked. Which is fine as long as you don't get any > contention, but any concurrent faults of the same page in different > address spaces will then just mess with other faulters and cause it to > fall out of the fast path. > > And looking at that code, I'm pretty sure it doesn't actually *need* > the page lock. It wants the page lock for two reasons: > > - the truncation worries (which may or may not be relevant - xfs > wraps the map_pages with xfs_ilock) > > - compound page worries (sub-page mapcount updates and page splitting issues) > > The compound page case I'm not sure about, but it's probably fine to > just lock the page in that case - once we end up actually just mapping > a hugepage, the number of page faults should be small enough that it > probably doesn't matter. > > The truncation thing could be handled like xfs does, but honestly, I > think it can equally well be handled by just doing some operations in > the right order, and double-checking that we don't race with truncate. > IOW, first increasing the page mapcount, and then re-checking that the > page still isn't locked and the mapping is still valid, and reachable > in the xarray. > > Because we can still just drop out of this loop and not populate the > page table if we see anything odd going on, but if *this* code doesn't > bother getting the page lock (and we make the COW code not do it > either), then in all the normal cases you will never have that "fall > out of the common case". > > IOW, I think right now the thing that makes us fall back to the actual > page lock is this code itself: by doing the 'trylock", it will make > other users of the same page not able to do the fast-case. And I think > the trylock is unnecessary. > > ANYWAY. The patch I have here isn't actually that "just do the checks > in the right order" patch. No, it's a dirty nasty "a private mapping > doesn't need to be so careful" patch. Ugly, brutish and nasty. Not the > right thing at all. But I'm doing it here simply because I wanted to > test it out and get people to look at this. > > This code is "tested" in the sense that it builds for me, and I'm > actually running it right now. But I haven't actually stress-tested it > or tried to see if it gets rid of some page lock heavy cases. > > Comments? I haven't even read a word you wrote yet (okay, apart from "Comments?"), nor studied the patches; but have put them under my usual load, and the only adjustment I've needed so far is --- 5.9.0/mm/khugepaged.c 2020-10-11 14:15:50.000000000 -0700 +++ linux/mm/khugepaged.c 2020-10-13 21:44:26.000000000 -0700 @@ -1814,7 +1814,7 @@ static void collapse_file(struct mm_stru xas_set(&xas, index); VM_BUG_ON_PAGE(page != xas_load(&xas), page); - VM_BUG_ON_PAGE(page_mapped(page), page); +// VM_BUG_ON_PAGE(page_mapped(page), page); /* * The page is expected to have page_count() == 3: But it's going to take a lot of diligence to get confident with them: I have no grip on all the places in which we do assume page lock held. The place to start looking will be 2.6.23's git history, in which Nick IIRC introduced VM_FAULT_LOCKED: I thought it was an essential step on the way to speculative page cache, but may be misremembering. Or maybe when I actually read what you've said, I'll find that you have already done that research. Hugh ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw>]
* Re: [PATCH 0/4] Some more lock_page work.. [not found] ` <4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw> @ 2020-10-15 2:44 ` Linus Torvalds 2020-10-15 15:16 ` Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) Vivek Goyal 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2020-10-15 2:44 UTC (permalink / raw) To: Qian Cai Cc: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein, Vivek Goyal On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@lca.pw> wrote: > > While on this topic, I just want to bring up a bug report that we are chasing an > issue that a process is stuck in the loop of wait_on_page_bit_common() for more > than 10 minutes before I gave up. Judging by call trace, that looks like a deadlock rather than a missed wakeup. The trace isn't reliable, but I find it suspicious that the call trace just before the fault contains that "iov_iter_copy_from_user_atomic()". IOW, I think you're in fuse_fill_write_pages(), which has allocated the page, locked it, and then it takes a page fault. And the page fault waits on a page that is locked. This is a classic deadlock. The *intent* is that iov_iter_copy_from_user_atomic() returns zero, and you retry without the page lock held. HOWEVER. That's not what fuse actually does. Fuse will do multiple pages, and it will unlock only the _last_ page. It keeps the other pages locked, and puts them in an array: ap->pages[ap->num_pages] = page; And after the iov_iter_copy_from_user_atomic() fails, it does that "unlock" and repeat. But while the _last_ page was unlocked, the *previous* pages are still locked in that array. Deadlock. I really don't think this has anything at all to do with page locking, and everything to do with fuse_fill_write_pages() having a deadlock if the source of data is a mmap of one of the pages it is trying to write to (just with an offset, so that it's not the last page). See a similar code sequence in generic_perform_write(), but notice how that code only has *one* page that it locks, and never holds an array of pages around over that iov_iter_fault_in_readable() thing. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-15 2:44 ` Linus Torvalds @ 2020-10-15 15:16 ` Vivek Goyal 2020-10-15 19:55 ` Vivek Goyal 0 siblings, 1 reply; 25+ messages in thread From: Vivek Goyal @ 2020-10-15 15:16 UTC (permalink / raw) To: Linus Torvalds Cc: Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein, Miklos Szeredi On Wed, Oct 14, 2020 at 07:44:16PM -0700, Linus Torvalds wrote: > On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@lca.pw> wrote: > > > > While on this topic, I just want to bring up a bug report that we are chasing an > > issue that a process is stuck in the loop of wait_on_page_bit_common() for more > > than 10 minutes before I gave up. > > Judging by call trace, that looks like a deadlock rather than a missed wakeup. > > The trace isn't reliable, but I find it suspicious that the call trace > just before the fault contains that > "iov_iter_copy_from_user_atomic()". > > IOW, I think you're in fuse_fill_write_pages(), which has allocated > the page, locked it, and then it takes a page fault. > > And the page fault waits on a page that is locked. > > This is a classic deadlock. > > The *intent* is that iov_iter_copy_from_user_atomic() returns zero, > and you retry without the page lock held. > > HOWEVER. > > That's not what fuse actually does. Fuse will do multiple pages, and > it will unlock only the _last_ page. It keeps the other pages locked, > and puts them in an array: > > ap->pages[ap->num_pages] = page; > > And after the iov_iter_copy_from_user_atomic() fails, it does that > "unlock" and repeat. > > But while the _last_ page was unlocked, the *previous* pages are still > locked in that array. Deadlock. > > I really don't think this has anything at all to do with page locking, > and everything to do with fuse_fill_write_pages() having a deadlock if > the source of data is a mmap of one of the pages it is trying to write > to (just with an offset, so that it's not the last page). > > See a similar code sequence in generic_perform_write(), but notice how > that code only has *one* page that it locks, and never holds an array > of pages around over that iov_iter_fault_in_readable() thing. Indeed. This is a deadlock in fuse. Thanks for the analysis. I can now trivially reproduce it with following program. Thanks Vivek #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/mman.h> #include <sys/uio.h> int main(int argc, char *argv[]) { int fd, ret; void *map_addr; size_t map_length = 2 * 4096; char *buf_out = "Hello World"; struct iovec iov[2]; if (argc != 2 ) { printf("Usage:%s <file-to-write> \n", argv[0]); exit(1); } fd = open(argv[1], O_RDWR | O_TRUNC); if (fd == -1) { fprintf(stderr, "Failed to open file %s:%s, errorno=%d\n", argv[1], strerror(errno), errno); exit(1); } map_addr = mmap(NULL, map_length, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (map_addr == MAP_FAILED) { fprintf(stderr, "mmap failed %s, errorno=%d\n", strerror(errno), errno); exit(1); } /* Write first page and second page */ pwrite(fd, buf_out, strlen(buf_out), 0); pwrite(fd, buf_out, strlen(buf_out), 4096); /* Copy from first page and then second page */ iov[0].iov_base = map_addr; iov[0].iov_len = strlen(buf_out) + 1; iov[1].iov_base = map_addr + 4096; iov[1].iov_len = strlen(buf_out) + 1; /* * Write second page offset (4K - 12), reading from first page and * then second page. In first iteration we should fault in first * page and lock second page. And in second iteration we should * try fault in second page which is locked. Deadlock. */ ret = pwritev(fd, iov, 2, 4096 + 4096 - 12); printf("write() returned=%d\n", ret); munmap(map_addr, map_length); close(fd); } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-15 15:16 ` Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) Vivek Goyal @ 2020-10-15 19:55 ` Vivek Goyal 2020-10-15 21:21 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: Vivek Goyal @ 2020-10-15 19:55 UTC (permalink / raw) To: Linus Torvalds, Miklos Szeredi Cc: Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Thu, Oct 15, 2020 at 11:16:06AM -0400, Vivek Goyal wrote: > On Wed, Oct 14, 2020 at 07:44:16PM -0700, Linus Torvalds wrote: > > On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@lca.pw> wrote: > > > > > > While on this topic, I just want to bring up a bug report that we are chasing an > > > issue that a process is stuck in the loop of wait_on_page_bit_common() for more > > > than 10 minutes before I gave up. > > > > Judging by call trace, that looks like a deadlock rather than a missed wakeup. > > > > The trace isn't reliable, but I find it suspicious that the call trace > > just before the fault contains that > > "iov_iter_copy_from_user_atomic()". > > > > IOW, I think you're in fuse_fill_write_pages(), which has allocated > > the page, locked it, and then it takes a page fault. > > > > And the page fault waits on a page that is locked. > > > > This is a classic deadlock. > > > > The *intent* is that iov_iter_copy_from_user_atomic() returns zero, > > and you retry without the page lock held. > > > > HOWEVER. > > > > That's not what fuse actually does. Fuse will do multiple pages, and > > it will unlock only the _last_ page. It keeps the other pages locked, > > and puts them in an array: > > > > ap->pages[ap->num_pages] = page; > > > > And after the iov_iter_copy_from_user_atomic() fails, it does that > > "unlock" and repeat. > > > > But while the _last_ page was unlocked, the *previous* pages are still > > locked in that array. Deadlock. > > > > I really don't think this has anything at all to do with page locking, > > and everything to do with fuse_fill_write_pages() having a deadlock if > > the source of data is a mmap of one of the pages it is trying to write > > to (just with an offset, so that it's not the last page). > > > > See a similar code sequence in generic_perform_write(), but notice how > > that code only has *one* page that it locks, and never holds an array > > of pages around over that iov_iter_fault_in_readable() thing. > > Indeed. This is a deadlock in fuse. Thanks for the analysis. I can now > trivially reproduce it with following program. I am wondering how should I fix this issue. Is it enough that I drop the page lock (but keep the reference) inside the loop. And once copying from user space is done, acquire page locks for all pages (Attached a patch below). Or dropping page lock means that there are no guarantees that this page did not get written back and removed from address space and a new page has been placed at same offset. Does that mean I should instead be looking up page cache again after copying from user space is done. --- fs/fuse/file.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) Index: redhat-linux/fs/fuse/file.c =================================================================== --- redhat-linux.orig/fs/fuse/file.c 2020-10-15 14:11:23.464720742 -0400 +++ redhat-linux/fs/fuse/file.c 2020-10-15 15:23:17.716569698 -0400 @@ -1117,7 +1117,7 @@ static ssize_t fuse_fill_write_pages(str struct fuse_conn *fc = get_fuse_conn(mapping->host); unsigned offset = pos & (PAGE_SIZE - 1); size_t count = 0; - int err; + int err, i; ap->args.in_pages = true; ap->descs[0].offset = offset; @@ -1155,6 +1155,14 @@ static ssize_t fuse_fill_write_pages(str goto again; } + /* + * Unlock page but retain reference count. Unlock is requied + * otherwise it is possible that next loop iteration tries + * to fault in page (source address) we have lock on and we + * will deadlock. So drop lock but keep a reference on the + * page. Re-acquire page lock after breaking out of the loop + */ + unlock_page(page); err = 0; ap->pages[ap->num_pages] = page; ap->descs[ap->num_pages].length = tmp; @@ -1171,7 +1179,15 @@ static ssize_t fuse_fill_write_pages(str } while (iov_iter_count(ii) && count < fc->max_write && ap->num_pages < max_pages && offset == 0); - return count > 0 ? count : err; + if (count <= 0) + return err; + + for (i = 0; i < ap->num_pages; i++) { + /* Take page lock */ + lock_page(ap->pages[i]); + } + + return count; } static inline unsigned int fuse_wr_pages(loff_t pos, size_t len, ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-15 19:55 ` Vivek Goyal @ 2020-10-15 21:21 ` Linus Torvalds 2020-10-16 10:02 ` Miklos Szeredi 2020-10-16 18:19 ` Vivek Goyal 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2020-10-15 21:21 UTC (permalink / raw) To: Vivek Goyal Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Thu, Oct 15, 2020 at 12:55 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > I am wondering how should I fix this issue. Is it enough that I drop > the page lock (but keep the reference) inside the loop. And once copying > from user space is done, acquire page locks for all pages (Attached > a patch below). What is the page lock supposed to protect? Because whatever it protects, dropping the lock drops, and you'd need to re-check whatever the page lock was there for. > Or dropping page lock means that there are no guarantees that this > page did not get written back and removed from address space and > a new page has been placed at same offset. Does that mean I should > instead be looking up page cache again after copying from user > space is done. I don't know why fuse does multiple pages to begin with. Why can't it do whatever it does just one page at a time? But yes, you probably should look the page up again whenever you've unlocked it, because it might have been truncated or whatever. Not that this is purely about unlocking the page, not about "after copying from user space". The iov_iter_copy_from_user_atomic() part is safe - if that takes a page fault, it will just do a partial copy, it won't deadlock. So you can potentially do multiple pages, and keep them all locked, but only as long as the copies are all done with that "from_user_atomic()" case. Which normally works fine, since normal users will write stuff that they just generated, so it will all be there. It's only when that returns zero, and you do the fallback to pre-fault in any data with iov_iter_fault_in_readable() that you need to unlock _all_ pages (and once you do that, I don't see what possible advantage the multi-page array can have). Of course, the way that code is written, it always does the iov_iter_fault_in_readable() for each page - it's not written like some kind of "special case fallback thing". I suspect the code was copied from the generic write code, but without understanding why the generic write code was ok. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-15 21:21 ` Linus Torvalds @ 2020-10-16 10:02 ` Miklos Szeredi 2020-10-16 12:27 ` Matthew Wilcox 2020-10-20 20:42 ` Vivek Goyal 2020-10-16 18:19 ` Vivek Goyal 1 sibling, 2 replies; 25+ messages in thread From: Miklos Szeredi @ 2020-10-16 10:02 UTC (permalink / raw) To: Linus Torvalds Cc: Vivek Goyal, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Thu, Oct 15, 2020 at 11:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Oct 15, 2020 at 12:55 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > I am wondering how should I fix this issue. Is it enough that I drop > > the page lock (but keep the reference) inside the loop. And once copying > > from user space is done, acquire page locks for all pages (Attached > > a patch below). > > What is the page lock supposed to protect? > > Because whatever it protects, dropping the lock drops, and you'd need > to re-check whatever the page lock was there for. > > > Or dropping page lock means that there are no guarantees that this > > page did not get written back and removed from address space and > > a new page has been placed at same offset. Does that mean I should > > instead be looking up page cache again after copying from user > > space is done. > > I don't know why fuse does multiple pages to begin with. Why can't it > do whatever it does just one page at a time? > > But yes, you probably should look the page up again whenever you've > unlocked it, because it might have been truncated or whatever. > > Not that this is purely about unlocking the page, not about "after > copying from user space". The iov_iter_copy_from_user_atomic() part is > safe - if that takes a page fault, it will just do a partial copy, it > won't deadlock. > > So you can potentially do multiple pages, and keep them all locked, > but only as long as the copies are all done with that > "from_user_atomic()" case. Which normally works fine, since normal > users will write stuff that they just generated, so it will all be > there. > > It's only when that returns zero, and you do the fallback to pre-fault > in any data with iov_iter_fault_in_readable() that you need to unlock > _all_ pages (and once you do that, I don't see what possible advantage > the multi-page array can have). > > Of course, the way that code is written, it always does the > iov_iter_fault_in_readable() for each page - it's not written like > some kind of "special case fallback thing". This was added by commit ea9b9907b82a ("fuse: implement perform_write") in v2.6.26 and remains essentially unchanged, AFAICS. So this is an old bug indeed. So what is the page lock protecting? I think not truncation, because inode_lock should be sufficient protection. What it does after sending a synchronous WRITE and before unlocking the pages is set the PG_uptodate flag, but only if the complete page was really written, which is what the uptodate flag really says: this page is in sync with the underlying fs. So I think the page lock here is trying to protect against concurrent reads/faults on not uptodate pages. I.e. until the WRITE request completes it is unknown whether the page was really written or not, so any reads must block until this state becomes known. This logic falls down on already cached pages, since they start out uptodate and the write does not clear this flag. So keeping the pages locked has dubious value: short writes don't seem to work correctly anyway. Which means that we can probably just set the page uptodate right after filling it from the user buffer, and unlock the page immediately. Am I missing something? Thanks, Miklos ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-16 10:02 ` Miklos Szeredi @ 2020-10-16 12:27 ` Matthew Wilcox 2020-10-20 20:42 ` Vivek Goyal 1 sibling, 0 replies; 25+ messages in thread From: Matthew Wilcox @ 2020-10-16 12:27 UTC (permalink / raw) To: Miklos Szeredi Cc: Linus Torvalds, Vivek Goyal, Qian Cai, Hugh Dickins, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Fri, Oct 16, 2020 at 12:02:21PM +0200, Miklos Szeredi wrote: > This was added by commit ea9b9907b82a ("fuse: implement > perform_write") in v2.6.26 and remains essentially unchanged, AFAICS. > So this is an old bug indeed. > > So what is the page lock protecting? I think not truncation, because > inode_lock should be sufficient protection. > > What it does after sending a synchronous WRITE and before unlocking > the pages is set the PG_uptodate flag, but only if the complete page > was really written, which is what the uptodate flag really says: this > page is in sync with the underlying fs. Not in sync with. Uptodate means "every byte on this page is at least as new as the bytes on storage". Dirty means "at least one byte is newer than the bytes on storage". > So I think the page lock here is trying to protect against concurrent > reads/faults on not uptodate pages. I.e. until the WRITE request > completes it is unknown whether the page was really written or not, so > any reads must block until this state becomes known. This logic falls > down on already cached pages, since they start out uptodate and the > write does not clear this flag. That's not how the page cache should work -- if a write() has touched every byte in a page, then the page should be marked as Uptodate, and it can immediately satisfy read()s without even touching the backing store. > So keeping the pages locked has dubious value: short writes don't seem > to work correctly anyway. Which means that we can probably just set > the page uptodate right after filling it from the user buffer, and > unlock the page immediately. > > Am I missing something? I haven't looked at fuse in detail -- are you handling partial page writes? That is, if someone writes to bytes 5-15 of a file, are you first reading bytes 0-4 and 16-4095 from the backing store? If so, you can mark the page Uptodate as soon as you've copied bytes 5-15 from the user. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-16 10:02 ` Miklos Szeredi 2020-10-16 12:27 ` Matthew Wilcox @ 2020-10-20 20:42 ` Vivek Goyal 2020-10-21 7:40 ` Miklos Szeredi 1 sibling, 1 reply; 25+ messages in thread From: Vivek Goyal @ 2020-10-20 20:42 UTC (permalink / raw) To: Miklos Szeredi Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Fri, Oct 16, 2020 at 12:02:21PM +0200, Miklos Szeredi wrote: > On Thu, Oct 15, 2020 at 11:22 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, Oct 15, 2020 at 12:55 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > I am wondering how should I fix this issue. Is it enough that I drop > > > the page lock (but keep the reference) inside the loop. And once copying > > > from user space is done, acquire page locks for all pages (Attached > > > a patch below). > > > > What is the page lock supposed to protect? > > > > Because whatever it protects, dropping the lock drops, and you'd need > > to re-check whatever the page lock was there for. > > > > > Or dropping page lock means that there are no guarantees that this > > > page did not get written back and removed from address space and > > > a new page has been placed at same offset. Does that mean I should > > > instead be looking up page cache again after copying from user > > > space is done. > > > > I don't know why fuse does multiple pages to begin with. Why can't it > > do whatever it does just one page at a time? > > > > But yes, you probably should look the page up again whenever you've > > unlocked it, because it might have been truncated or whatever. > > > > Not that this is purely about unlocking the page, not about "after > > copying from user space". The iov_iter_copy_from_user_atomic() part is > > safe - if that takes a page fault, it will just do a partial copy, it > > won't deadlock. > > > > So you can potentially do multiple pages, and keep them all locked, > > but only as long as the copies are all done with that > > "from_user_atomic()" case. Which normally works fine, since normal > > users will write stuff that they just generated, so it will all be > > there. > > > > It's only when that returns zero, and you do the fallback to pre-fault > > in any data with iov_iter_fault_in_readable() that you need to unlock > > _all_ pages (and once you do that, I don't see what possible advantage > > the multi-page array can have). > > > > Of course, the way that code is written, it always does the > > iov_iter_fault_in_readable() for each page - it's not written like > > some kind of "special case fallback thing". > > This was added by commit ea9b9907b82a ("fuse: implement > perform_write") in v2.6.26 and remains essentially unchanged, AFAICS. > So this is an old bug indeed. > > So what is the page lock protecting? I think not truncation, because > inode_lock should be sufficient protection. > > What it does after sending a synchronous WRITE and before unlocking > the pages is set the PG_uptodate flag, but only if the complete page > was really written, which is what the uptodate flag really says: this > page is in sync with the underlying fs. > > So I think the page lock here is trying to protect against concurrent > reads/faults on not uptodate pages. I.e. until the WRITE request > completes it is unknown whether the page was really written or not, so > any reads must block until this state becomes known. This logic falls > down on already cached pages, since they start out uptodate and the > write does not clear this flag. > > So keeping the pages locked has dubious value: short writes don't seem > to work correctly anyway. Which means that we can probably just set > the page uptodate right after filling it from the user buffer, and > unlock the page immediately. Hi Miklos, As you said, for the full page WRITE, we can probably mark it page uptodate write away and drop page lock (Keep reference and send WRITE request to fuse server). For the partial page write this will not work and there seem to be atleast two options. A. Either we read the page back from disk first and mark it uptodate. B. Or we keep track of such partial writes and block any further reads/readpage/direct_IO on these pages till partial write is complete. After that looks like page will be left notuptodate in page cache and reader will read it from disk. We are doing something similar for tracking writeback requests. It is much more complicated though and we probably can design something simpler for these writethrough/synchronous writes. I am assuming that A. will lead to performance penalty for short random writes. So B might be better from performance point of view. Is it worth giving option B a try. Thanks Vivek ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-20 20:42 ` Vivek Goyal @ 2020-10-21 7:40 ` Miklos Szeredi 2020-10-21 20:12 ` Vivek Goyal 0 siblings, 1 reply; 25+ messages in thread From: Miklos Szeredi @ 2020-10-21 7:40 UTC (permalink / raw) To: Vivek Goyal Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Tue, Oct 20, 2020 at 10:42 PM Vivek Goyal <vgoyal@redhat.com> wrote: > As you said, for the full page WRITE, we can probably mark it > page uptodate write away and drop page lock (Keep reference and > send WRITE request to fuse server). For the partial page write this will > not work and there seem to be atleast two options. > > A. Either we read the page back from disk first and mark it uptodate. > > B. Or we keep track of such partial writes and block any further > reads/readpage/direct_IO on these pages till partial write is > complete. After that looks like page will be left notuptodate > in page cache and reader will read it from disk. We are doing > something similar for tracking writeback requests. It is much > more complicated though and we probably can design something > simpler for these writethrough/synchronous writes. > > I am assuming that A. will lead to performance penalty for short > random writes. C. Keep partial tail page locked. If write involves a partial and head AND tail page, then read head page first. I think that would be a good compromise between performance and simplicity. WDYT? Thanks, Miklos ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-21 7:40 ` Miklos Szeredi @ 2020-10-21 20:12 ` Vivek Goyal 2020-10-28 20:29 ` Miklos Szeredi 0 siblings, 1 reply; 25+ messages in thread From: Vivek Goyal @ 2020-10-21 20:12 UTC (permalink / raw) To: Miklos Szeredi Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Wed, Oct 21, 2020 at 09:40:26AM +0200, Miklos Szeredi wrote: > On Tue, Oct 20, 2020 at 10:42 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > As you said, for the full page WRITE, we can probably mark it > > page uptodate write away and drop page lock (Keep reference and > > send WRITE request to fuse server). For the partial page write this will > > not work and there seem to be atleast two options. > > > > A. Either we read the page back from disk first and mark it uptodate. > > > > B. Or we keep track of such partial writes and block any further > > reads/readpage/direct_IO on these pages till partial write is > > complete. After that looks like page will be left notuptodate > > in page cache and reader will read it from disk. We are doing > > something similar for tracking writeback requests. It is much > > more complicated though and we probably can design something > > simpler for these writethrough/synchronous writes. > > > > I am assuming that A. will lead to performance penalty for short > > random writes. > > C. Keep partial tail page locked. If write involves a partial and > head AND tail page, then read head page first. I think that would be > a good compromise between performance and simplicity. > > WDYT? Hi Miklos, Sounds good. I refined this idea little bit more to avoid read completely. We can add read later if it benefits in certain situations. D. If one does a partial page write of a page which is not uptodate, then keep page locked and do not try to send multiple pages in that write. If page is uptodate, then release page lock and continue to add more pages to same request. IOW, if head page is partial (and it is not uptodate), we will just send first WRITE with head page. Rest of the pages will go in second WRITE and tail page could be locked if it was a partial write and page was not uptodate. Please have a look at attached patch. I still some concerns though with error handling. Not sure what to do about it. 1. What happens if WRITE fails. If we are writing a full page and we already marked it as Uptodate (And not dirty), then we have page cache in page where we wrote data but could not send it to disk (and did not mark dirty as well). So if a user reads the page back it might get cache copy or get old copy from disk (if page cache copy was released). 2. What happens if it is a partial page write to an Uptodate page in cache and that WRITE fails. Now we have same error scenario as 1. In fact this is true for even current code and not necessarily a new scenario. 3. Current code marks a page Uptodate upon WRITE completion if it was full page WRITE. What if page was uptodate to begin with and write fails. So current code will not mark it Uptodate but it is already uptodate and we have same problem as 1. Apart from above, there are some other concerns as well. So with this patch, if a page is Uptodate we drop lock and send WRITE. Otherwise we keep page lock and send WRITE. This should probably be fine from read or fault read point of view. Given we are holding inode lock, that means write path is not a problem as well. But What if page is redirtied through a write mapping ------------------------------------------------- If page is redirtied through writable mmap, then two writes for same page can go in any order. But in synchronous write we are carrying pointer to page cache page, so it probably does not matter. We will just write same data twice. What about races with direct_IO read ------------------------------------ If a WRITE is in progress, it is probably not marked dirty so generic_file_read_iter() will probably not block on filemap_write_and_wait_range() and continue mapping->a_ops->direct_IO(). And that means it can read previous disk data before this WRITE is complete. Hopefully that's not a concern becase we have not returned write() success to user space. So any parallel direct I/O probably should not assume any order of operation. So primary concern with this patch seems to be error handling if fuse WRITE fails. --- fs/fuse/file.c | 28 ++++++++++++++++++---------- fs/fuse/fuse_i.h | 1 + 2 files changed, 19 insertions(+), 10 deletions(-) Index: redhat-linux/fs/fuse/file.c =================================================================== --- redhat-linux.orig/fs/fuse/file.c 2020-10-21 11:41:43.983166360 -0400 +++ redhat-linux/fs/fuse/file.c 2020-10-21 15:16:24.151166360 -0400 @@ -1106,17 +1106,15 @@ static ssize_t fuse_send_write_pages(str count = ia->write.out.size; for (i = 0; i < ap->num_pages; i++) { struct page *page = ap->pages[i]; + bool page_locked = ap->page_locked && (i == ap->num_pages - 1); - if (!err && !offset && count >= PAGE_SIZE) - SetPageUptodate(page); - - if (count > PAGE_SIZE - offset) - count -= PAGE_SIZE - offset; - else - count = 0; - offset = 0; - - unlock_page(page); + /* TODO: What if an error happened and it was a full page + * write. Should we mark it dirty now so that it is written + * back later. Otherwise we have a page in page cache which + * is not marked dirty and it is not synced to disk either. + */ + if (page_locked) + unlock_page(page); put_page(page); } @@ -1180,6 +1178,16 @@ static ssize_t fuse_fill_write_pages(str if (offset == PAGE_SIZE) offset = 0; + /* If we copied full page, mark it uptodate */ + if (tmp == PAGE_SIZE) + SetPageUptodate(page); + + if (PageUptodate(page)) { + unlock_page(page); + } else { + ap->page_locked = true; + break; + } if (!fc->big_writes) break; } while (iov_iter_count(ii) && count < fc->max_write && Index: redhat-linux/fs/fuse/fuse_i.h =================================================================== --- redhat-linux.orig/fs/fuse/fuse_i.h 2020-10-20 15:18:59.471971851 -0400 +++ redhat-linux/fs/fuse/fuse_i.h 2020-10-21 14:39:53.701166360 -0400 @@ -275,6 +275,7 @@ struct fuse_args_pages { struct page **pages; struct fuse_page_desc *descs; unsigned int num_pages; + bool page_locked; }; #define FUSE_ARGS(args) struct fuse_args args = {} ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-21 20:12 ` Vivek Goyal @ 2020-10-28 20:29 ` Miklos Szeredi 2021-02-09 10:01 ` Miklos Szeredi 0 siblings, 1 reply; 25+ messages in thread From: Miklos Szeredi @ 2020-10-28 20:29 UTC (permalink / raw) To: Vivek Goyal Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Wed, Oct 21, 2020 at 10:12 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > D. If one does a partial page write of a page which is not uptodate, then > keep page locked and do not try to send multiple pages in that write. > If page is uptodate, then release page lock and continue to add more > pages to same request. > > IOW, if head page is partial (and it is not uptodate), we will just > send first WRITE with head page. Rest of the pages will go in second > WRITE and tail page could be locked if it was a partial write and > page was not uptodate. Please have a look at attached patch. Looks good. > I still some concerns though with error handling. Not sure what to > do about it. > > 1. What happens if WRITE fails. If we are writing a full page and we > already marked it as Uptodate (And not dirty), then we have page > cache in page where we wrote data but could not send it to disk > (and did not mark dirty as well). So if a user reads the page > back it might get cache copy or get old copy from disk (if page > cache copy was released). AFAICS this is what happens on write failure in current code IF the page was uptodate previously. Moving the SetPageUptodate() before the WRITE makes this happen in all cases. On write failure the page the uptodate flag should be cleared, which should partially solve the above issue. There still remains a window where a concurrent read or load would get the wrong data, but I don't think anybody cares (same happens with buffered write). > 2. What happens if it is a partial page write to an Uptodate page > in cache and that WRITE fails. Now we have same error scenario > as 1. In fact this is true for even current code and not > necessarily a new scenario. Same as above: need to clear uptodate flag. > 3. Current code marks a page Uptodate upon WRITE completion if > it was full page WRITE. What if page was uptodate to begin > with and write fails. So current code will not mark it > Uptodate but it is already uptodate and we have same problem as 1. Yes. > Apart from above, there are some other concerns as well. > > So with this patch, if a page is Uptodate we drop lock and send WRITE. > Otherwise we keep page lock and send WRITE. This should probably be > fine from read or fault read point of view. Given we are holding inode > lock, that means write path is not a problem as well. But > > What if page is redirtied through a write mapping > ------------------------------------------------- > If page is redirtied through writable mmap, then two writes for same > page can go in any order. But in synchronous write we are carrying > pointer to page cache page, so it probably does not matter. We will > just write same data twice. It's not that simple. Data dirtied through an mmap will be written back using a temporary page. So a WRITE request with such a page can be in flight while a write(2) triggers a synchronous WRITE request for the same data. The two WRITEs are not ordered in any way and in fact the page lock doesn't help, so this is not a new issue. OTOH this does not appear to be a problem in real life, since without msync() it is not guaranteed that the memory mapping is synchronized with the backing file (Linux has stronger guarantees, but test suites such as fsx assume the lesser guarantees by POSIX). This could be fixed to conform to the stronger coherency guarantee by calling fuse_wait_on_page_writeback() after having gotten a locked page in the synchronous writeback path. > > What about races with direct_IO read > ------------------------------------ > If a WRITE is in progress, it is probably not marked dirty so > generic_file_read_iter() will probably not block on > filemap_write_and_wait_range() and continue mapping->a_ops->direct_IO(). > And that means it can read previous disk data before this WRITE is > complete. Synchronous write vs. direct IO read shouldn't be a problem as long as the server provides a coherent view of the file to both. Thanks, Miklos ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-28 20:29 ` Miklos Szeredi @ 2021-02-09 10:01 ` Miklos Szeredi 2021-02-09 19:09 ` Vivek Goyal 0 siblings, 1 reply; 25+ messages in thread From: Miklos Szeredi @ 2021-02-09 10:01 UTC (permalink / raw) To: Vivek Goyal Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein Hi Vivek, Here's an updated patch with a header comment containing all the gory details. It's basically your patch, but it's missing your S-o-b. If you are fine with it, can you please provide one? The only change I made was to clear PG_uptodate on write error, otherwise I think the patch is fine. Splitting the request might cause a performance regression in some cases (lots of unaligned writes that spill over a page boundary) but I don't have a better idea at this moment. Thanks, Miklos ---- Date: Wed, 21 Oct 2020 16:12:49 -0400 From: Vivek Goyal <vgoyal@redhat.com> Subject: fuse: fix write deadlock There are two modes for write(2) and friends in fuse: a) write through (update page cache, send sync WRITE request to userspace) b) buffered write (update page cache, async writeout later) The write through method kept all the page cache pages locked that were used for the request. Keeping more than one page locked is deadlock prone and Qian Cai demonstrated this with trinity fuzzing. The reason for keeping the pages locked is that concurrent mapped reads shouldn't try to pull possibly stale data into the page cache. For full page writes, the easy way to fix this is to make the cached page be the authoritative source by marking the page PG_uptodate immediately. After this the page can be safely unlocked, since mapped/cached reads will take the written data from the cache. Concurrent mapped writes will now cause data in the original WRITE request to be updated; this however doesn't cause any data inconsistency and this scenario should be exceedingly rare anyway. If the WRITE request returns with an error in the above case, currently the page is not marked uptodate; this means that a concurrent read will always read consistent data. After this patch the page is uptodate between writing to the cache and receiving the error: there's window where a cached read will read the wrong data. While theoretically this could be a regression, it is unlikely to be one in practice, since this is normal for buffered writes. In case of a partial page write to an already uptodate page the locking is also unnecessary, with the above caveats. Partial write of a not uptodate page still needs to be handled. One way would be to read the complete page before doing the write. This is not possible, since it might break filesystems that don't expect any READ requests when the file was opened O_WRONLY. The other solution is to serialize the synchronous write with reads from the partial pages. The easiest way to do this is to keep the partial pages locked. The problem is that a write() may involve two such pages (one head and one tail). This patch fixes it by only locking the partial tail page. If there's a partial head page as well, then split that off as a separate WRITE request. Reported-by: Qian Cai <cai@lca.pw> Fixes: ea9b9907b82a ("fuse: implement perform_write") Cc: <stable@vger.kernel.org> # v2.6.26 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> --- fs/fuse/file.c | 25 +++++++++++++++---------- fs/fuse/fuse_i.h | 1 + 2 files changed, 16 insertions(+), 10 deletions(-) --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1117,17 +1117,12 @@ static ssize_t fuse_send_write_pages(str count = ia->write.out.size; for (i = 0; i < ap->num_pages; i++) { struct page *page = ap->pages[i]; + bool page_locked = ap->page_locked && (i == ap->num_pages - 1); - if (!err && !offset && count >= PAGE_SIZE) - SetPageUptodate(page); - - if (count > PAGE_SIZE - offset) - count -= PAGE_SIZE - offset; - else - count = 0; - offset = 0; - - unlock_page(page); + if (err) + ClearPageUptodate(page); + if (page_locked) + unlock_page(page); put_page(page); } @@ -1191,6 +1186,16 @@ static ssize_t fuse_fill_write_pages(str if (offset == PAGE_SIZE) offset = 0; + /* If we copied full page, mark it uptodate */ + if (tmp == PAGE_SIZE) + SetPageUptodate(page); + + if (PageUptodate(page)) { + unlock_page(page); + } else { + ap->page_locked = true; + break; + } if (!fc->big_writes) break; } while (iov_iter_count(ii) && count < fc->max_write && --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -277,6 +277,7 @@ struct fuse_args_pages { struct page **pages; struct fuse_page_desc *descs; unsigned int num_pages; + bool page_locked; }; #define FUSE_ARGS(args) struct fuse_args args = {} ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2021-02-09 10:01 ` Miklos Szeredi @ 2021-02-09 19:09 ` Vivek Goyal 0 siblings, 0 replies; 25+ messages in thread From: Vivek Goyal @ 2021-02-09 19:09 UTC (permalink / raw) To: Miklos Szeredi Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Tue, Feb 09, 2021 at 11:01:15AM +0100, Miklos Szeredi wrote: > Hi Vivek, > > Here's an updated patch with a header comment containing all the gory details. > It's basically your patch, but it's missing your S-o-b. If you are fine with > it, can you please provide one? > > The only change I made was to clear PG_uptodate on write error, otherwise I > think the patch is fine. Hi Miklos, In general I am fine with the patch. I am still little concerned with how to handle error scenario and how to handle it best. Whether to clear Pageuptodate on error or leave it alone. Leaving it alone will more like be writeback failure scenario where many filesystems don't set dirty flag on page again if writeback fails. Can't decide whether to leave it alone is better or not. So for now, I will just go along with clearing PageUptodate on error. Can you please also put Link to this mail thread in the commit id. There are quite a few good details in this mail thread. Will be good to be able to track it back from commit. Link: https://lore.kernel.org/linux-fsdevel/4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw/ With that. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > > Splitting the request might cause a performance regression in some cases (lots > of unaligned writes that spill over a page boundary) but I don't have a better > idea at this moment. > > Thanks, > Miklos > ---- > > Date: Wed, 21 Oct 2020 16:12:49 -0400 > From: Vivek Goyal <vgoyal@redhat.com> > Subject: fuse: fix write deadlock > > There are two modes for write(2) and friends in fuse: > > a) write through (update page cache, send sync WRITE request to userspace) > > b) buffered write (update page cache, async writeout later) > > The write through method kept all the page cache pages locked that were > used for the request. Keeping more than one page locked is deadlock prone > and Qian Cai demonstrated this with trinity fuzzing. > > The reason for keeping the pages locked is that concurrent mapped reads > shouldn't try to pull possibly stale data into the page cache. > > For full page writes, the easy way to fix this is to make the cached page > be the authoritative source by marking the page PG_uptodate immediately. > After this the page can be safely unlocked, since mapped/cached reads will > take the written data from the cache. > > Concurrent mapped writes will now cause data in the original WRITE request > to be updated; this however doesn't cause any data inconsistency and this > scenario should be exceedingly rare anyway. > > If the WRITE request returns with an error in the above case, currently the > page is not marked uptodate; this means that a concurrent read will always > read consistent data. After this patch the page is uptodate between > writing to the cache and receiving the error: there's window where a cached > read will read the wrong data. While theoretically this could be a > regression, it is unlikely to be one in practice, since this is normal for > buffered writes. > > In case of a partial page write to an already uptodate page the locking is > also unnecessary, with the above caveats. > > Partial write of a not uptodate page still needs to be handled. One way > would be to read the complete page before doing the write. This is not > possible, since it might break filesystems that don't expect any READ > requests when the file was opened O_WRONLY. > > The other solution is to serialize the synchronous write with reads from > the partial pages. The easiest way to do this is to keep the partial pages > locked. The problem is that a write() may involve two such pages (one head > and one tail). This patch fixes it by only locking the partial tail page. > If there's a partial head page as well, then split that off as a separate > WRITE request. > > Reported-by: Qian Cai <cai@lca.pw> > Fixes: ea9b9907b82a ("fuse: implement perform_write") > Cc: <stable@vger.kernel.org> # v2.6.26 > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> > --- > fs/fuse/file.c | 25 +++++++++++++++---------- > fs/fuse/fuse_i.h | 1 + > 2 files changed, 16 insertions(+), 10 deletions(-) > > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -1117,17 +1117,12 @@ static ssize_t fuse_send_write_pages(str > count = ia->write.out.size; > for (i = 0; i < ap->num_pages; i++) { > struct page *page = ap->pages[i]; > + bool page_locked = ap->page_locked && (i == ap->num_pages - 1); > > - if (!err && !offset && count >= PAGE_SIZE) > - SetPageUptodate(page); > - > - if (count > PAGE_SIZE - offset) > - count -= PAGE_SIZE - offset; > - else > - count = 0; > - offset = 0; > - > - unlock_page(page); > + if (err) > + ClearPageUptodate(page); > + if (page_locked) > + unlock_page(page); > put_page(page); > } > > @@ -1191,6 +1186,16 @@ static ssize_t fuse_fill_write_pages(str > if (offset == PAGE_SIZE) > offset = 0; > > + /* If we copied full page, mark it uptodate */ > + if (tmp == PAGE_SIZE) > + SetPageUptodate(page); > + > + if (PageUptodate(page)) { > + unlock_page(page); > + } else { > + ap->page_locked = true; > + break; > + } > if (!fc->big_writes) > break; > } while (iov_iter_count(ii) && count < fc->max_write && > --- a/fs/fuse/fuse_i.h > +++ b/fs/fuse/fuse_i.h > @@ -277,6 +277,7 @@ struct fuse_args_pages { > struct page **pages; > struct fuse_page_desc *descs; > unsigned int num_pages; > + bool page_locked; > }; > > #define FUSE_ARGS(args) struct fuse_args args = {} > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-15 21:21 ` Linus Torvalds 2020-10-16 10:02 ` Miklos Szeredi @ 2020-10-16 18:19 ` Vivek Goyal 2020-10-16 18:24 ` Linus Torvalds 2020-10-16 23:03 ` Dave Chinner 1 sibling, 2 replies; 25+ messages in thread From: Vivek Goyal @ 2020-10-16 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Thu, Oct 15, 2020 at 02:21:58PM -0700, Linus Torvalds wrote: [..] > > I don't know why fuse does multiple pages to begin with. Why can't it > do whatever it does just one page at a time? Sending multiple pages in single WRITE command does seem to help a lot with performance. I modified code to write only one page at a time and ran a fio job with sequential writes(and random writes), block size 64K and compared the performance on virtiofs. NAME WORKLOAD Bandwidth IOPS one-page-write seqwrite-psync 58.3mb 933 multi-page-write seqwrite-psync 265.7mb 4251 one-page-write randwrite-psync 53.5mb 856 multi-page-write randwrite-psync 315.5mb 5047 So with multi page writes performance seems much better for this particular workload. Thanks Vivek ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-16 18:19 ` Vivek Goyal @ 2020-10-16 18:24 ` Linus Torvalds 2020-10-16 18:24 ` Linus Torvalds 2020-10-16 23:03 ` Dave Chinner 1 sibling, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2020-10-16 18:24 UTC (permalink / raw) To: Vivek Goyal Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Fri, Oct 16, 2020 at 11:19 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > So with multi page writes performance seems much better for this > particular workload. Looking at that code, I don't see why the page needs to be unlocked after it's been filled, so I do think the easy solution is to just always unlock the pages. Then the rest of the multi-page code is probably fine. But hey, who knows. I might be missing something. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-16 18:24 ` Linus Torvalds @ 2020-10-16 18:24 ` Linus Torvalds 0 siblings, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2020-10-16 18:24 UTC (permalink / raw) To: Vivek Goyal Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Fri, Oct 16, 2020 at 11:24 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Looking at that code, I don't see why the page needs to be unlocked [..] That "unlocked" should be "locked", of course. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) 2020-10-16 18:19 ` Vivek Goyal 2020-10-16 18:24 ` Linus Torvalds @ 2020-10-16 23:03 ` Dave Chinner 1 sibling, 0 replies; 25+ messages in thread From: Dave Chinner @ 2020-10-16 23:03 UTC (permalink / raw) To: Vivek Goyal Cc: Linus Torvalds, Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein On Fri, Oct 16, 2020 at 02:19:08PM -0400, Vivek Goyal wrote: > On Thu, Oct 15, 2020 at 02:21:58PM -0700, Linus Torvalds wrote: > > [..] > > > > I don't know why fuse does multiple pages to begin with. Why can't it > > do whatever it does just one page at a time? > > Sending multiple pages in single WRITE command does seem to help a lot > with performance. I modified code to write only one page at a time > and ran a fio job with sequential writes(and random writes), > block size 64K and compared the performance on virtiofs. > > NAME WORKLOAD Bandwidth IOPS > one-page-write seqwrite-psync 58.3mb 933 > multi-page-write seqwrite-psync 265.7mb 4251 > > one-page-write randwrite-psync 53.5mb 856 > multi-page-write randwrite-psync 315.5mb 5047 > > So with multi page writes performance seems much better for this > particular workload. Huh. This is essentially the problem the iomap buffered write path was designed to solve. Filesystems like gfs2 got similar major improvements in large buffered write throughput when switching to use iomap for buffered IO.... Essentially, it works by having iomap_apply() first ask the filesystem to map the IO range, then iterates the page cache across the io range performing the desired operation (iomap_write_actor() in the case of a buffered write), then it tells the filesystem how much of the original range it completed copying into the cache. Hence the filesystem only does one mapping/completion operation per contiguous IO range instead of once per dirtied page, and the inner loop just locks a page at a time as it works over the range. Pages are marked uptodate+dirty as the user data is copied into them, not when the entire IO is completely written. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2021-02-09 19:09 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds 2020-10-13 20:03 ` Linus Torvalds 2020-10-14 13:05 ` Kirill A. Shutemov 2020-10-14 16:53 ` Linus Torvalds 2020-10-14 18:15 ` Matthew Wilcox 2020-10-15 10:41 ` Kirill A. Shutemov 2020-10-15 9:43 ` Kirill A. Shutemov 2020-10-15 16:44 ` Linus Torvalds 2020-10-14 5:50 ` Hugh Dickins [not found] ` <4794a3fa3742a5e84fb0f934944204b55730829b.camel@lca.pw> 2020-10-15 2:44 ` Linus Torvalds 2020-10-15 15:16 ` Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) Vivek Goyal 2020-10-15 19:55 ` Vivek Goyal 2020-10-15 21:21 ` Linus Torvalds 2020-10-16 10:02 ` Miklos Szeredi 2020-10-16 12:27 ` Matthew Wilcox 2020-10-20 20:42 ` Vivek Goyal 2020-10-21 7:40 ` Miklos Szeredi 2020-10-21 20:12 ` Vivek Goyal 2020-10-28 20:29 ` Miklos Szeredi 2021-02-09 10:01 ` Miklos Szeredi 2021-02-09 19:09 ` Vivek Goyal 2020-10-16 18:19 ` Vivek Goyal 2020-10-16 18:24 ` Linus Torvalds 2020-10-16 18:24 ` Linus Torvalds 2020-10-16 23:03 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).