Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/4] Some more lock_page work..
@ 2020-10-13 19:59 Linus Torvalds
  2020-10-13 20:03 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-10-13 19:59 UTC (permalink / raw)
  To: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov
  Cc: Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein

So this is all just preliminary, but I'd really like to have people
think more about the page fault handling of the page lock, and I have
a small experimental series of patches for people to look at and maybe
get the discussion started.

The three first patches are really just fairly trivial cleanups. They
also likely don't really matter, because the *bulk* of all faults -
particularly the ones that really shouldn't need any page locking
games - should be all about just "filemap_map_pages()". Which is that
optimistic "let's insert pages from the page cache as efficiently as
possible" case.

That's how all the normal private pages that don't actually get
modified (so executables, but also any load that uses mmap as a
zero-copy read) should generally get populated.

That code doesn't actually do "lock_page()" itself (because it all
runs under the RCU read lock), but it does to do a trylock, and give
up if the page was locked. Which is fine as long as you don't get any
contention, but any concurrent faults of the same page in different
address spaces will then just mess with other faulters and cause it to
fall out of the fast path.

And looking at that code, I'm pretty sure it doesn't actually *need*
the page lock. It wants the page lock for two reasons:

 - the truncation worries (which may or may not be relevant - xfs
wraps the map_pages with xfs_ilock)

 - compound page worries (sub-page mapcount updates and page splitting issues)

The compound page case I'm not sure about, but it's probably fine to
just lock the page in that case - once we end up actually just mapping
a hugepage, the number of page faults should be small enough that it
probably doesn't matter.

The truncation thing could be handled like xfs does, but honestly, I
think it can equally well be handled by just doing some operations in
the right order, and double-checking that we don't race with truncate.
IOW, first increasing the page mapcount, and then re-checking that the
page still isn't locked and the mapping is still valid, and reachable
in the xarray.

Because we can still just drop out of this loop and not populate the
page table if we see anything odd going on, but if *this* code doesn't
bother getting the page lock (and we make the COW code not do it
either), then in all the normal cases you will never have that "fall
out of the common case".

IOW, I think right now the thing that makes us fall back to the actual
page lock is this code itself: by doing the 'trylock", it will make
other users of the same page not able to do the fast-case. And I think
the trylock is unnecessary.

ANYWAY. The patch I have here isn't actually that "just do the checks
in the right order" patch. No, it's a dirty nasty "a private mapping
doesn't need to be so careful" patch. Ugly, brutish and nasty. Not the
right thing at all. But I'm doing it here simply because I wanted to
test it out and get people to look at this.

This code is "tested" in the sense that it builds for me, and I'm
actually running it right now. But I haven't actually stress-tested it
or tried to see if it gets rid of some page lock heavy cases.

Comments?

                         Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds
@ 2020-10-13 20:03 ` Linus Torvalds
  2020-10-14 13:05   ` Kirill A. Shutemov
  2020-10-14  5:50 ` Hugh Dickins
  2020-10-15  1:48 ` Qian Cai
  2 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-10-13 20:03 UTC (permalink / raw)
  To: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov
  Cc: Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein


[-- Attachment #1: Type: text/plain, Size: 362 bytes --]

On Tue, Oct 13, 2020 at 12:59 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Comments?

Gaah. My alpine gmail setup has gotten broken by more fancy gmail
security features, so sending the actual patches that way broke down.

So here they are as attachments instead. I'll fix my alpine
configuration after the merge window.

                  Linus

[-- Attachment #2: 0001-mm-move-final-page-locking-out-of-__do_fault-helper-.patch --]
[-- Type: text/x-patch, Size: 2744 bytes --]

From c6f074f1758e233965495f589863b75ab0e1609d Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 10:22:00 -0700
Subject: [PATCH 1/4] mm: move final page locking out of __do_fault() helper
 into callers

The old semantics of our __do_fault() helper was that it always locked
the page unless there was an error (or unless the faulting had already
handled a COW event).

That turns out to be a mistake.  Not all callers actually want the page
locked at all, and they might as well check the same VM_FAULT_LOCKED bit
that __do_fault() itself checked whether the page is already locked or
not.

This change only moves that final page locking out into the callers, but
intentionally does not actually change any of the locking semantics: the
callers will not just do that final page locking themselves instead.

That means that future patches may then decide to not lock the page
after all, but this is just preparation for any such future change.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..b4a7b81dcc7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3616,11 +3616,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		return VM_FAULT_HWPOISON;
 	}
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	return ret;
 }
 
@@ -4000,6 +3995,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
 	unlock_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -4031,6 +4031,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
@@ -4054,6 +4059,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #3: 0002-mm-don-t-lock-the-page-only-to-immediately-unlock-it.patch --]
[-- Type: text/x-patch, Size: 2276 bytes --]

From dcc752881236fa914d0286da71e46b31956aa0dc Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 10:43:05 -0700
Subject: [PATCH 2/4] mm: don't lock the page, only to immediately unlock it
 again for do_page_mkwrite()

Our page locking during fault handling a bit messy, and the shared fault
code in particular was locking the page only to immediately unlock it
again because do_page_mkwrite() wanted it unlocked.

We keep the "did we lock it" state around in the VM_FAULT_LOCKED bit, so
let's just use that knowledge, and not first lock it if it wasn't
locked, only to then unlock it again.

It would be even better to transfer the "did we already lock this page"
information into do_page_mkwrite(), because that function will actually
want to lock it eventually anyway, but let's just clean up one thing at
a time.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b4a7b81dcc7a..5c93b4bec063 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,25 +4059,33 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
 	 */
 	if (vma->vm_ops->page_mkwrite) {
-		unlock_page(vmf->page);
+		/* do_page_mkwrite() wants the page unlocked */
+		if (ret & VM_FAULT_LOCKED) {
+			unlock_page(vmf->page);
+			ret &= ~VM_FAULT_LOCKED;
+		}
+
 		tmp = do_page_mkwrite(vmf);
 		if (unlikely(!tmp ||
 				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 			put_page(vmf->page);
 			return tmp;
 		}
+
+		/* Did do_page_mkwrite() lock the page again? */
+		ret |= tmp & VM_FAULT_LOCKED;
 	}
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 					VM_FAULT_RETRY))) {
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #4: 0003-mm-do_cow_fault-does-not-need-the-source-page-to-be-.patch --]
[-- Type: text/x-patch, Size: 1984 bytes --]

From 969e62f9784dcd3083ba3fe32b87bdea8319aba9 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 11:00:33 -0700
Subject: [PATCH 3/4] mm: do_cow_fault() does not need the source page to be
 locked

This removes the "lock if it wasn't locked" logic from do_cow_fault(),
since we're not even going to install that page into the destination
address space (finish_fault() will use ->cow_page rather than ->page),
and copying the source page does not need the source to be locked.

So instead of doing "lock if it wasn't locked" followed by an
unconditional unlock of the page, just do "unlock if it was locked".

Of course, since all the normal file mapping ->fault() handlers
currently lock the page they return (see filemap_fault() for details),
all of this is pretty much theoretical.

But this is the right thing to do - making sure we hold the page lock
when we really don't is just confusing and wrong.  And this prepares the
way for any future changes to filemap_fault() where we go "Oh, we
actually _don't_ need to lock the page for this case at all".

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5c93b4bec063..d4d32d0c33c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4031,16 +4031,13 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
+	if (ret & VM_FAULT_LOCKED)
+		unlock_page(vmf->page);
+
 	ret |= finish_fault(vmf);
-	unlock_page(vmf->page);
 	put_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #5: 0004-mm-make-filemap_map_pages-avoid-the-page-lock-if-pos.patch --]
[-- Type: text/x-patch, Size: 3529 bytes --]

From 107014f091622dcb411c0ae38c99a95704a62f3f Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 12:03:40 -0700
Subject: [PATCH 4/4] mm: make filemap_map_pages() avoid the page lock if
 possible

Private mappings don't need to be 100% serialized with file truncation
etc, and act more like an optimized "read()" call.  So we can avoid
taking the page lock for them.

NOTE! This is a trial patch.  I'm not entirely happy about this, because
I think we can avoid the page lock for shared mappings too, by just
changing the order in which we do some of the checks.

In particular, once we have the page table lock (which we need anyway),
we could increment the page mapping count.  That - together with
re-checking that the page still isn't locked - should be a sufficient
guarantee that nobody has finished truncating that page yet, and any
future truncation will end up being serialized on the page table lock.

The compund page case probably needs some thinking about too.

Needs-review-by: Matthew Wilcox <willy@infradead.org>
Needs-review-by: Kirill A. Shutemov <kirill@shutemov.name>
Needs-review-by: Hugh Dickins <hughd@google.com>
Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/filemap.c | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 748b7b1b4f6d..6accb7905a36 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2788,6 +2788,7 @@ EXPORT_SYMBOL(filemap_fault);
 void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
+	bool shared = vmf->vma->vm_flags & VM_SHARED;
 	struct file *file = vmf->vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t last_pgoff = start_pgoff;
@@ -2798,6 +2799,8 @@ void filemap_map_pages(struct vm_fault *vmf,
 
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_pgoff) {
+		bool locked = false;
+
 		if (xas_retry(&xas, page))
 			continue;
 		if (xa_is_value(page))
@@ -2815,15 +2818,40 @@ void filemap_map_pages(struct vm_fault *vmf,
 		/* Has the page moved or been split? */
 		if (unlikely(page != xas_reload(&xas)))
 			goto skip;
+		/*
+		 * also recheck the page lock after getting the reference,
+		 * so that any page lockers will have seen us incrementing
+		 * it or not see us at all.
+		 */
+		if (PageLocked(page))
+			goto skip;
+
 		page = find_subpage(page, xas.xa_index);
 
 		if (!PageUptodate(page) ||
 				PageReadahead(page) ||
 				PageHWPoison(page))
 			goto skip;
-		if (!trylock_page(page))
-			goto skip;
 
+		/*
+		 * We only need to be really careful about races
+		 * with truncate etc for shared mappings.
+		 *
+		 * But we also need to lock the page for compound
+		 * pages (see alloc_set_pte -> page_add_file_rmap).
+		 */
+		if (shared || PageTransCompound(page)) {
+			if (!trylock_page(page))
+				goto skip;
+			locked = true;
+		}
+
+		/*
+		 * Even if we don't get the page lock, we'll re-check
+		 * the page mapping and the mapping size.
+		 *
+		 * It won't hurt, even if it's racy.
+		 */
 		if (page->mapping != mapping || !PageUptodate(page))
 			goto unlock;
 
@@ -2840,10 +2868,12 @@ void filemap_map_pages(struct vm_fault *vmf,
 		last_pgoff = xas.xa_index;
 		if (alloc_set_pte(vmf, page))
 			goto unlock;
-		unlock_page(page);
+		if (locked)
+			unlock_page(page);
 		goto next;
 unlock:
-		unlock_page(page);
+		if (locked)
+			unlock_page(page);
 skip:
 		put_page(page);
 next:
-- 
2.28.0.218.gc12ef3d349


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds
  2020-10-13 20:03 ` Linus Torvalds
@ 2020-10-14  5:50 ` Hugh Dickins
  2020-10-15  1:48 ` Qian Cai
  2 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2020-10-14  5:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM,
	Andrew Morton, linux-fsdevel, Amir Goldstein

On Tue, 13 Oct 2020, Linus Torvalds wrote:

> So this is all just preliminary, but I'd really like to have people
> think more about the page fault handling of the page lock, and I have
> a small experimental series of patches for people to look at and maybe
> get the discussion started.
> 
> The three first patches are really just fairly trivial cleanups. They
> also likely don't really matter, because the *bulk* of all faults -
> particularly the ones that really shouldn't need any page locking
> games - should be all about just "filemap_map_pages()". Which is that
> optimistic "let's insert pages from the page cache as efficiently as
> possible" case.
> 
> That's how all the normal private pages that don't actually get
> modified (so executables, but also any load that uses mmap as a
> zero-copy read) should generally get populated.
> 
> That code doesn't actually do "lock_page()" itself (because it all
> runs under the RCU read lock), but it does to do a trylock, and give
> up if the page was locked. Which is fine as long as you don't get any
> contention, but any concurrent faults of the same page in different
> address spaces will then just mess with other faulters and cause it to
> fall out of the fast path.
> 
> And looking at that code, I'm pretty sure it doesn't actually *need*
> the page lock. It wants the page lock for two reasons:
> 
>  - the truncation worries (which may or may not be relevant - xfs
> wraps the map_pages with xfs_ilock)
> 
>  - compound page worries (sub-page mapcount updates and page splitting issues)
> 
> The compound page case I'm not sure about, but it's probably fine to
> just lock the page in that case - once we end up actually just mapping
> a hugepage, the number of page faults should be small enough that it
> probably doesn't matter.
> 
> The truncation thing could be handled like xfs does, but honestly, I
> think it can equally well be handled by just doing some operations in
> the right order, and double-checking that we don't race with truncate.
> IOW, first increasing the page mapcount, and then re-checking that the
> page still isn't locked and the mapping is still valid, and reachable
> in the xarray.
> 
> Because we can still just drop out of this loop and not populate the
> page table if we see anything odd going on, but if *this* code doesn't
> bother getting the page lock (and we make the COW code not do it
> either), then in all the normal cases you will never have that "fall
> out of the common case".
> 
> IOW, I think right now the thing that makes us fall back to the actual
> page lock is this code itself: by doing the 'trylock", it will make
> other users of the same page not able to do the fast-case. And I think
> the trylock is unnecessary.
> 
> ANYWAY. The patch I have here isn't actually that "just do the checks
> in the right order" patch. No, it's a dirty nasty "a private mapping
> doesn't need to be so careful" patch. Ugly, brutish and nasty. Not the
> right thing at all. But I'm doing it here simply because I wanted to
> test it out and get people to look at this.
> 
> This code is "tested" in the sense that it builds for me, and I'm
> actually running it right now. But I haven't actually stress-tested it
> or tried to see if it gets rid of some page lock heavy cases.
> 
> Comments?

I haven't even read a word you wrote yet (okay, apart from "Comments?"),
nor studied the patches; but have put them under my usual load, and the
only adjustment I've needed so far is

--- 5.9.0/mm/khugepaged.c	2020-10-11 14:15:50.000000000 -0700
+++ linux/mm/khugepaged.c	2020-10-13 21:44:26.000000000 -0700
@@ -1814,7 +1814,7 @@ static void collapse_file(struct mm_stru
 		xas_set(&xas, index);
 
 		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
-		VM_BUG_ON_PAGE(page_mapped(page), page);
+//		VM_BUG_ON_PAGE(page_mapped(page), page);
 
 		/*
 		 * The page is expected to have page_count() == 3:

But it's going to take a lot of diligence to get confident with them:
I have no grip on all the places in which we do assume page lock held.

The place to start looking will be 2.6.23's git history, in which
Nick IIRC introduced VM_FAULT_LOCKED: I thought it was an essential
step on the way to speculative page cache, but may be misremembering.

Or maybe when I actually read what you've said, I'll find that you
have already done that research.

Hugh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-13 20:03 ` Linus Torvalds
@ 2020-10-14 13:05   ` Kirill A. Shutemov
  2020-10-14 16:53     ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2020-10-14 13:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Tue, Oct 13, 2020 at 01:03:30PM -0700, Linus Torvalds wrote:
> On Tue, Oct 13, 2020 at 12:59 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Comments?

So we remove strict serialization against truncation in
filemap_map_pages() if the mapping is private.

IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for
such mappings. The "page->mapping != mapping" check makes the race window
smaller, but doesn't remove it.

I'm not sure all codepaths are fine with this. For instance, looks like
migration will back off such pages: __unmap_and_move() doesn't know how to
deal with mapped pages with ->mapping == NULL.
Yes, it is not crash, but still...

Do I miss something?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-14 13:05   ` Kirill A. Shutemov
@ 2020-10-14 16:53     ` Linus Torvalds
  2020-10-14 18:15       ` Matthew Wilcox
  2020-10-15  9:43       ` Kirill A. Shutemov
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-10-14 16:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Wed, Oct 14, 2020 at 6:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> So we remove strict serialization against truncation in
> filemap_map_pages() if the mapping is private.

Well, that particular patch does.

But as mentioned, that's the wrong way of doing it. It's just that the
right way - which keeps the serialization - requires some changes to
how we actually install the pte.

Right now that goes through alloc_set_pte(), and that one is basically
written with the page lock in mind. So the current logic is

 - lock page

 - check page->mapping etc that is now stable

 - lock page tables if required (alloc_set_pte -> pte_alloc_one_map ->
pte_offset_map_lock)

 - insert pte

but that whole alloc_set_pte() thing is actually much too generic for
what "map_pages()" really wants, and I think we could re-organize it.

In particular, what I _think_ we could do is:

 - lock the page tables

 - check that the page isn't locked

 - increment the page mapcount (atomic and ordered)

 - check that the page still isn't locked

 - insert pte

without taking the page lock. And the reason that's safe is that *if*
we're racing with something that is about the remove the page mapping,
then *that* code will

 (a) hold the page lock

 (b) before it removes the mapping, it has to remove it from any
existing mappings, which involves checking the mapcount and going off
and getting the page table locks to remove any ptes.

but my patch did *not* do that, because you have to re-organize things a bit.

Note that the above still keeps the "optimistic" model - if the page
is locked, we'll skip that pte, and we'll end up doing the heavy thing
(which will then take the pte lock if required). Which is why we
basically get away with that "only test page lock, don't take it"
approach with some ordering.

But my theory is that the actual truncate case is *so* rare that we
basically don't need to even worry about it, and that once the
(common) page fault case doesn't need the page lock, then that whole
"fall back to slow case" simply doesn't happen in practice.

> IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for
> such mappings. The "page->mapping != mapping" check makes the race window
> smaller, but doesn't remove it.

Yes, that patch I posted as an RFC does exactly that. But I think we
can keep the serialization if we just make slightly more involved
changes.

And they aren't necessarily a _lot_ more involved. In fact, I think we
may already hold the page table lock due to doing that
"pte_alloc_one_map()" thing over all of filemap_map_pages(). So I
think the only _real_ problem is that I think we increment the
page_mapcount() too late in alloc_set_pte().

And yeah, I realize this is a bit handwavy. But that's why I sent
these things out as an RFC. To see if people can shoot holes in this,
and maybe do that proper patch instead of my "let's get discussion
going" one.

Also, I really do think that our current "private mappings are
coherent with fiel changes" is a big QoI issue: it's the right thing
to do. So I wouldn't actually really want to take advantage of
"private mappings don't really _have_ to be coherent according to
POSIX". That's a lazy cop-out.

So I dislike my patch, and don't think it's really acceptable, but as
a "let's try this" I think it's a good first step.

> I'm not sure all codepaths are fine with this. For instance, looks like
> migration will back off such pages: __unmap_and_move() doesn't know how to
> deal with mapped pages with ->mapping == NULL.

So as outlined above, I think we can avoid the ->mapping == NULL case,
and we can remove all the races.

But it would still do new things, like incremnt the page mapcount
without holding the page lock. I think that's ok - we already
_decrement_ it without holding the lock, so it's not that the count is
stable without locking - but at a minimum we'd have that "need to make
sure the memory ordering between testing the page lock initially,
incrementing the count, and re-testing is ok". And maybe some code
that expects things to be stable.

So I'm not at all claiming that things are obviously fine. I'm happy
to see that Hugh started some practical stress-tests on that RFC
patch, and I think we can improve on the situation.

               Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-14 16:53     ` Linus Torvalds
@ 2020-10-14 18:15       ` Matthew Wilcox
  2020-10-15 10:41         ` Kirill A. Shutemov
  2020-10-15  9:43       ` Kirill A. Shutemov
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2020-10-14 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Hugh Dickins, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Wed, Oct 14, 2020 at 09:53:35AM -0700, Linus Torvalds wrote:
> In particular, what I _think_ we could do is:
> 
>  - lock the page tables
> 
>  - check that the page isn't locked
> 
>  - increment the page mapcount (atomic and ordered)
> 
>  - check that the page still isn't locked
> 
>  - insert pte
> 
> without taking the page lock. And the reason that's safe is that *if*
[...]
> And they aren't necessarily a _lot_ more involved. In fact, I think we
> may already hold the page table lock due to doing that
> "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I
> think the only _real_ problem is that I think we increment the
> page_mapcount() too late in alloc_set_pte().

I'm not entirely sure why we require the page lock to be held in
page_add_file_rmap():

        } else {
                if (PageTransCompound(page) && page_mapping(page)) {
                        VM_WARN_ON_ONCE(!PageLocked(page));

                        SetPageDoubleMap(compound_head(page));
                        if (PageMlocked(page))
                                clear_page_mlock(compound_head(page));
                }

We have a reference to the page, so compound_head() isn't going
to change.  SetPageDoubleMap() is atomic.  PageMlocked() is atomic.
clear_page_mlock() does TestClearPageMlocked() as its first thing,
so that's atomic too.  What am I missing?  (Kirill added it, so I
assume he remembers ;-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds
  2020-10-13 20:03 ` Linus Torvalds
  2020-10-14  5:50 ` Hugh Dickins
@ 2020-10-15  1:48 ` Qian Cai
  2020-10-15  2:44   ` Linus Torvalds
  2 siblings, 1 reply; 23+ messages in thread
From: Qian Cai @ 2020-10-15  1:48 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov
  Cc: Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein, Vivek Goyal

On Tue, 2020-10-13 at 12:59 -0700, Linus Torvalds wrote:
> So this is all just preliminary, but I'd really like to have people
> think more about the page fault handling of the page lock, and I have
> a small experimental series of patches for people to look at and maybe
> get the discussion started.
> 
> The three first patches are really just fairly trivial cleanups. They
> also likely don't really matter, because the *bulk* of all faults -
> particularly the ones that really shouldn't need any page locking
> games - should be all about just "filemap_map_pages()". Which is that
> optimistic "let's insert pages from the page cache as efficiently as
> possible" case.
> 
> That's how all the normal private pages that don't actually get
> modified (so executables, but also any load that uses mmap as a
> zero-copy read) should generally get populated.
> 
> That code doesn't actually do "lock_page()" itself (because it all
> runs under the RCU read lock), but it does to do a trylock, and give
> up if the page was locked. Which is fine as long as you don't get any
> contention, but any concurrent faults of the same page in different
> address spaces will then just mess with other faulters and cause it to
> fall out of the fast path.
> 
> And looking at that code, I'm pretty sure it doesn't actually *need*
> the page lock. It wants the page lock for two reasons:
> 
>  - the truncation worries (which may or may not be relevant - xfs
> wraps the map_pages with xfs_ilock)
> 
>  - compound page worries (sub-page mapcount updates and page splitting issues)
> 
> The compound page case I'm not sure about, but it's probably fine to
> just lock the page in that case - once we end up actually just mapping
> a hugepage, the number of page faults should be small enough that it
> probably doesn't matter.
> 
> The truncation thing could be handled like xfs does, but honestly, I
> think it can equally well be handled by just doing some operations in
> the right order, and double-checking that we don't race with truncate.
> IOW, first increasing the page mapcount, and then re-checking that the
> page still isn't locked and the mapping is still valid, and reachable
> in the xarray.
> 
> Because we can still just drop out of this loop and not populate the
> page table if we see anything odd going on, but if *this* code doesn't
> bother getting the page lock (and we make the COW code not do it
> either), then in all the normal cases you will never have that "fall
> out of the common case".
> 
> IOW, I think right now the thing that makes us fall back to the actual
> page lock is this code itself: by doing the 'trylock", it will make
> other users of the same page not able to do the fast-case. And I think
> the trylock is unnecessary.
> 
> ANYWAY. The patch I have here isn't actually that "just do the checks
> in the right order" patch. No, it's a dirty nasty "a private mapping
> doesn't need to be so careful" patch. Ugly, brutish and nasty. Not the
> right thing at all. But I'm doing it here simply because I wanted to
> test it out and get people to look at this.
> 
> This code is "tested" in the sense that it builds for me, and I'm
> actually running it right now. But I haven't actually stress-tested it
> or tried to see if it gets rid of some page lock heavy cases.
> 
> Comments?
While on this topic, I just want to bring up a bug report that we are chasing an
issue that a process is stuck in the loop of wait_on_page_bit_common() for more
than 10 minutes before I gave up. Vivek did some instrumentation of code and can
confirm that there are no pending requests at virtiofs level. IOW, all requests
which have been issued, have completed, so this sounds more like a case of
missed wakup or some race in kernel. This can be reproduced on both the mainline
and linux-next by running syscall fuzzing on virtiofs by an unprivileged user. I
did also try to revert two recent commits from Linus touched
wait_on_page_bit_common(), but did not help.

The whole soft-lockups traces and sysrq-t can be found at:

https://caiqian.fedorapeople.org/files/dmesg.txt

[10813.290694] task:trinity-c33     state:D stack:25888 pid:254219 ppid: 87180 flags:0x00004004
[10813.292671] Call Trace:
[10813.293379]  __schedule+0x71d/0x1b50
[10813.294182]  ? __sched_text_start+0x8/0x8
[10813.295146]  ? mark_held_locks+0xb0/0x110
[10813.296117]  schedule+0xbf/0x270
[10813.296782]  ? __lock_page_killable+0x276/0x830
[10813.297867]  io_schedule+0x17/0x60
[10813.298772]  __lock_page_killable+0x33b/0x830
wait_on_page_bit_common at mm/filemap.c:1262
(inlined by) __lock_page_killable at mm/filemap.c:1540
[10813.299695]  ? wait_on_page_bit+0x710/0x710
[10813.300609]  ? __lock_page_or_retry+0x3c0/0x3c0
[10813.301894]  ? up_read+0x1a3/0x730
[10813.302791]  ? page_cache_free_page.isra.45+0x390/0x390
[10813.304077]  filemap_fault+0x2bd/0x2040
lock_page_maybe_drop_mmap at mm/filemap.c:2562
(inlined by) filemap_fault at mm/filemap.c:2727
[10813.305019]  ? read_cache_page_gfp+0x10/0x10
[10813.306041]  ? lock_downgrade+0x700/0x700
[10813.306958]  ? replace_page_cache_page+0x1130/0x1130
[10813.308124]  __do_fault+0xf5/0x530
[10813.308968]  handle_mm_fault+0x1c0e/0x25b0
[10813.309955]  ? copy_page_range+0xfe0/0xfe0
[10813.310895]  do_user_addr_fault+0x383/0x820
[10813.312084]  exc_page_fault+0x56/0xb0
[10813.312979]  asm_exc_page_fault+0x1e/0x30
[10813.313978] RIP: 0010:iov_iter_fault_in_readable+0x271/0x350
fault_in_pages_readable at include/linux/pagemap.h:745
(inlined by) iov_iter_fault_in_readable at lib/iov_iter.c:438
[10813.315293] Code: 48 39 d7 0f 82 1a ff ff ff 0f 01 cb 0f ae e8 44 89 c0 8a 0a 0f 01 ca 88 4c 24 70 85 c0 74 da e9 f8 fe ff ff 0f 01 cb 0f ae e8 <8a> 11 0f 01 ca 88 54 24 30 85 c0 0f 85 04 ff ff ff 48 29 ee e9
 45
[10813.319196] RSP: 0018:ffffc90017ccf830 EFLAGS: 00050246
[10813.320446] RAX: 0000000000000000 RBX: 1ffff92002f99f08 RCX: 00007fe284f1004c
[10813.322202] RDX: 0000000000000001 RSI: 0000000000001000 RDI: ffff8887a7664000
[10813.323729] RBP: 0000000000001000 R08: 0000000000000000 R09: 0000000000000000
[10813.325282] R10: ffffc90017ccfd48 R11: ffffed102789d5ff R12: ffff8887a7664020
[10813.326898] R13: ffffc90017ccfd40 R14: dffffc0000000000 R15: 0000000000e0df6a
[10813.328456]  ? iov_iter_revert+0x8e0/0x8e0
[10813.329404]  ? copyin+0x96/0xc0
[10813.330230]  ? iov_iter_copy_from_user_atomic+0x1f0/0xa40
[10813.331742]  fuse_perform_write+0x3eb/0xf20 [fuse]
fuse_fill_write_pages at fs/fuse/file.c:1150
(inlined by) fuse_perform_write at fs/fuse/file.c:1226
[10813.332880]  ? fuse_file_fallocate+0x5f0/0x5f0 [fuse]
[10813.334090]  fuse_file_write_iter+0x6b7/0x900 [fuse]
[10813.335191]  do_iter_readv_writev+0x42b/0x6d0
[10813.336161]  ? new_sync_write+0x610/0x610
[10813.337194]  do_iter_write+0x11f/0x5b0
[10813.338177]  ? __sb_start_write+0x229/0x2d0
[10813.339169]  vfs_writev+0x16d/0x2d0
[10813.339973]  ? vfs_iter_write+0xb0/0xb0
[10813.340950]  ? __fdget_pos+0x9c/0xb0
[10813.342039]  ? rcu_read_lock_sched_held+0x9c/0xd0
[10813.343120]  ? rcu_read_lock_bh_held+0xb0/0xb0
[10813.344104]  ? find_held_lock+0x33/0x1c0
[10813.345050]  do_writev+0xfb/0x1e0
[10813.345920]  ? vfs_writev+0x2d0/0x2d0
[10813.346802]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[10813.348026]  ? syscall_enter_from_user_mode+0x1c/0x50
[10813.349197]  do_syscall_64+0x33/0x40
[10813.350026]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-15  1:48 ` Qian Cai
@ 2020-10-15  2:44   ` Linus Torvalds
  2020-10-15 15:16     ` Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) Vivek Goyal
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-10-15  2:44 UTC (permalink / raw)
  To: Qian Cai
  Cc: Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov, Linux-MM,
	Andrew Morton, linux-fsdevel, Amir Goldstein, Vivek Goyal

On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@lca.pw> wrote:
>
> While on this topic, I just want to bring up a bug report that we are chasing an
> issue that a process is stuck in the loop of wait_on_page_bit_common() for more
> than 10 minutes before I gave up.

Judging by call trace, that looks like a deadlock rather than a missed wakeup.

The trace isn't reliable, but I find it suspicious that the call trace
just before the fault contains that
"iov_iter_copy_from_user_atomic()".

IOW, I think you're in fuse_fill_write_pages(), which has allocated
the page, locked it, and then it takes a page fault.

And the page fault waits on a page that is locked.

This is a classic deadlock.

The *intent* is that iov_iter_copy_from_user_atomic() returns zero,
and you retry without the page lock held.

HOWEVER.

That's not what fuse actually does. Fuse will do multiple pages, and
it will unlock only the _last_ page. It keeps the other pages locked,
and puts them in an array:

                ap->pages[ap->num_pages] = page;

And after the iov_iter_copy_from_user_atomic() fails, it does that
"unlock" and repeat.

But while the _last_ page was unlocked, the *previous* pages are still
locked in that array. Deadlock.

I really don't think this has anything at all to do with page locking,
and everything to do with fuse_fill_write_pages() having a deadlock if
the source of data is a mmap of one of the pages it is trying to write
to (just with an offset, so that it's not the last page).

See a similar code sequence in generic_perform_write(), but notice how
that code only has *one* page that it locks, and never holds an array
of pages around over that iov_iter_fault_in_readable() thing.

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-14 16:53     ` Linus Torvalds
  2020-10-14 18:15       ` Matthew Wilcox
@ 2020-10-15  9:43       ` Kirill A. Shutemov
  2020-10-15 16:44         ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2020-10-15  9:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Wed, Oct 14, 2020 at 09:53:35AM -0700, Linus Torvalds wrote:
> In particular, what I _think_ we could do is:
> 
>  - lock the page tables
> 
>  - check that the page isn't locked
> 
>  - increment the page mapcount (atomic and ordered)
> 
>  - check that the page still isn't locked
> 
>  - insert pte
> 
> without taking the page lock. And the reason that's safe is that *if*
> we're racing with something that is about the remove the page mapping,
> then *that* code will
> 
>  (a) hold the page lock
> 
>  (b) before it removes the mapping, it has to remove it from any
> existing mappings, which involves checking the mapcount and going off
> and getting the page table locks to remove any ptes.
> 
> but my patch did *not* do that, because you have to re-organize things a bit.

Okay, I see what you propose.

But I don't think it addresses race with try_to_unmap():

	CPU0				CPU1
filemap_map_pages()
  take ptl
  PageLocked() == false
 				lock_page()
				try_to_unmap()
				  rwc->done()
				    page_mapcount_is_zero() == true
				  rwc->done() == true, skip full rmap walk
				  never take ptl taken by CPU0
				try_to_unmap() == true
				...
				unlock_page()
  increment mapcount
  PageLocked() == false
  insert PTE

Are we willing to give up rwc->done() optimization?
Or do I miss some other serialization point?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-14 18:15       ` Matthew Wilcox
@ 2020-10-15 10:41         ` Kirill A. Shutemov
  0 siblings, 0 replies; 23+ messages in thread
From: Kirill A. Shutemov @ 2020-10-15 10:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Hugh Dickins, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Wed, Oct 14, 2020 at 07:15:09PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 14, 2020 at 09:53:35AM -0700, Linus Torvalds wrote:
> > In particular, what I _think_ we could do is:
> > 
> >  - lock the page tables
> > 
> >  - check that the page isn't locked
> > 
> >  - increment the page mapcount (atomic and ordered)
> > 
> >  - check that the page still isn't locked
> > 
> >  - insert pte
> > 
> > without taking the page lock. And the reason that's safe is that *if*
> [...]
> > And they aren't necessarily a _lot_ more involved. In fact, I think we
> > may already hold the page table lock due to doing that
> > "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I
> > think the only _real_ problem is that I think we increment the
> > page_mapcount() too late in alloc_set_pte().
> 
> I'm not entirely sure why we require the page lock to be held in
> page_add_file_rmap():
> 
>         } else {
>                 if (PageTransCompound(page) && page_mapping(page)) {
>                         VM_WARN_ON_ONCE(!PageLocked(page));
> 
>                         SetPageDoubleMap(compound_head(page));
>                         if (PageMlocked(page))
>                                 clear_page_mlock(compound_head(page));
>                 }
> 
> We have a reference to the page, so compound_head() isn't going
> to change.  SetPageDoubleMap() is atomic.  PageMlocked() is atomic.
> clear_page_mlock() does TestClearPageMlocked() as its first thing,
> so that's atomic too.  What am I missing?  (Kirill added it, so I
> assume he remembers ;-)

In general (with current scheme), we need page lock here to serialize
against try_to_unmap() and other rmap walkers.

For file THP, we also serialize against follow_trans_huge_pmd(FOLL_MLOCK).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-15  2:44   ` Linus Torvalds
@ 2020-10-15 15:16     ` Vivek Goyal
  2020-10-15 19:55       ` Vivek Goyal
  0 siblings, 1 reply; 23+ messages in thread
From: Vivek Goyal @ 2020-10-15 15:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov,
	Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein,
	Miklos Szeredi

On Wed, Oct 14, 2020 at 07:44:16PM -0700, Linus Torvalds wrote:
> On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@lca.pw> wrote:
> >
> > While on this topic, I just want to bring up a bug report that we are chasing an
> > issue that a process is stuck in the loop of wait_on_page_bit_common() for more
> > than 10 minutes before I gave up.
> 
> Judging by call trace, that looks like a deadlock rather than a missed wakeup.
> 
> The trace isn't reliable, but I find it suspicious that the call trace
> just before the fault contains that
> "iov_iter_copy_from_user_atomic()".
> 
> IOW, I think you're in fuse_fill_write_pages(), which has allocated
> the page, locked it, and then it takes a page fault.
> 
> And the page fault waits on a page that is locked.
> 
> This is a classic deadlock.
> 
> The *intent* is that iov_iter_copy_from_user_atomic() returns zero,
> and you retry without the page lock held.
> 
> HOWEVER.
> 
> That's not what fuse actually does. Fuse will do multiple pages, and
> it will unlock only the _last_ page. It keeps the other pages locked,
> and puts them in an array:
> 
>                 ap->pages[ap->num_pages] = page;
> 
> And after the iov_iter_copy_from_user_atomic() fails, it does that
> "unlock" and repeat.
> 
> But while the _last_ page was unlocked, the *previous* pages are still
> locked in that array. Deadlock.
> 
> I really don't think this has anything at all to do with page locking,
> and everything to do with fuse_fill_write_pages() having a deadlock if
> the source of data is a mmap of one of the pages it is trying to write
> to (just with an offset, so that it's not the last page).
> 
> See a similar code sequence in generic_perform_write(), but notice how
> that code only has *one* page that it locks, and never holds an array
> of pages around over that iov_iter_fault_in_readable() thing.

Indeed. This is a deadlock in fuse. Thanks for the analysis. I can now
trivially reproduce it with following program.

Thanks
Vivek

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/uio.h>

int main(int argc, char *argv[])
{
	int fd, ret;
	void *map_addr;
	size_t map_length = 2 * 4096;
	char *buf_out = "Hello World";
	struct iovec iov[2];

	if (argc != 2 ) {
		printf("Usage:%s <file-to-write> \n", argv[0]);
		exit(1);
	}

	fd = open(argv[1], O_RDWR | O_TRUNC);
	if (fd == -1) {
		fprintf(stderr, "Failed to open file %s:%s, errorno=%d\n", argv[1], strerror(errno), errno);
		exit(1);
	}

	map_addr = mmap(NULL, map_length, PROT_READ | PROT_WRITE, MAP_SHARED,
			fd, 0);
	if (map_addr == MAP_FAILED) {
		fprintf(stderr, "mmap failed %s, errorno=%d\n", strerror(errno),
			errno);
		exit(1);
	}

	/* Write first page and second page */
	pwrite(fd, buf_out, strlen(buf_out), 0);
	pwrite(fd, buf_out, strlen(buf_out), 4096);

	/* Copy from first page and then second page */
	iov[0].iov_base = map_addr;
	iov[0].iov_len = strlen(buf_out) + 1;

	iov[1].iov_base = map_addr + 4096;
	iov[1].iov_len = strlen(buf_out) + 1;

	/*
	 * Write second page offset (4K - 12), reading from first page and
	 * then second page. In first iteration we should fault in first
	 * page and lock second page. And in second iteration we should
	 * try fault in second page which is locked. Deadlock.
	 */
	ret = pwritev(fd, iov, 2, 4096 + 4096 - 12);
	printf("write() returned=%d\n", ret);
	munmap(map_addr, map_length);
	close(fd);
}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Some more lock_page work..
  2020-10-15  9:43       ` Kirill A. Shutemov
@ 2020-10-15 16:44         ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-10-15 16:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Matthew Wilcox, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Thu, Oct 15, 2020 at 2:43 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Okay, I see what you propose.
>
> But I don't think it addresses race with try_to_unmap():

I don't think it needs to.

Remember: the map_pages() thing is called only for when the page
tables are empty.

So try_to_unmap() will never see the pte entry, and there is nothing
to race with.

So sure, it can "race" with try_to_unmap like you say, but who cares?
The "race" is no different from taking the page lock _after_
try_to_unmap() already ran (and didn't see anything because the page
hadn't been mapped yet).

IOW I don't think try_to_unmap() really matters. The race you outline
can already happen with the "trylock()" - no different from the
trylock just succeeding after the unlock_page().

So you can think of map_pages() as all happening after try_to_unmap()
has already succeeded - and didn't see the new pte that hasn't been
filled in yet.

I do think there is a real race, but it is is with "__remove_mapping()".

That still happens under the page lock, but it doesn't actually
_depend_ on the page lock as far as I can tell. Because I think the
real protection there is that

        if (!page_ref_freeze(page, refcount))
                goto cannot_free;

it that code sees "oh, somebody else has a reference to the page, we
can't remove the mapping".

But it's entirely possible that I don't understand your worry, and I
overlooked something. If so, can you explain using smaller words,
please ;)

             Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-15 15:16     ` Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) Vivek Goyal
@ 2020-10-15 19:55       ` Vivek Goyal
  2020-10-15 21:21         ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Vivek Goyal @ 2020-10-15 19:55 UTC (permalink / raw)
  To: Linus Torvalds, Miklos Szeredi
  Cc: Qian Cai, Hugh Dickins, Matthew Wilcox, Kirill A . Shutemov,
	Linux-MM, Andrew Morton, linux-fsdevel, Amir Goldstein

On Thu, Oct 15, 2020 at 11:16:06AM -0400, Vivek Goyal wrote:
> On Wed, Oct 14, 2020 at 07:44:16PM -0700, Linus Torvalds wrote:
> > On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@lca.pw> wrote:
> > >
> > > While on this topic, I just want to bring up a bug report that we are chasing an
> > > issue that a process is stuck in the loop of wait_on_page_bit_common() for more
> > > than 10 minutes before I gave up.
> > 
> > Judging by call trace, that looks like a deadlock rather than a missed wakeup.
> > 
> > The trace isn't reliable, but I find it suspicious that the call trace
> > just before the fault contains that
> > "iov_iter_copy_from_user_atomic()".
> > 
> > IOW, I think you're in fuse_fill_write_pages(), which has allocated
> > the page, locked it, and then it takes a page fault.
> > 
> > And the page fault waits on a page that is locked.
> > 
> > This is a classic deadlock.
> > 
> > The *intent* is that iov_iter_copy_from_user_atomic() returns zero,
> > and you retry without the page lock held.
> > 
> > HOWEVER.
> > 
> > That's not what fuse actually does. Fuse will do multiple pages, and
> > it will unlock only the _last_ page. It keeps the other pages locked,
> > and puts them in an array:
> > 
> >                 ap->pages[ap->num_pages] = page;
> > 
> > And after the iov_iter_copy_from_user_atomic() fails, it does that
> > "unlock" and repeat.
> > 
> > But while the _last_ page was unlocked, the *previous* pages are still
> > locked in that array. Deadlock.
> > 
> > I really don't think this has anything at all to do with page locking,
> > and everything to do with fuse_fill_write_pages() having a deadlock if
> > the source of data is a mmap of one of the pages it is trying to write
> > to (just with an offset, so that it's not the last page).
> > 
> > See a similar code sequence in generic_perform_write(), but notice how
> > that code only has *one* page that it locks, and never holds an array
> > of pages around over that iov_iter_fault_in_readable() thing.
> 
> Indeed. This is a deadlock in fuse. Thanks for the analysis. I can now
> trivially reproduce it with following program.

I am wondering how should I fix this issue. Is it enough that I drop
the page lock (but keep the reference) inside the loop. And once copying
from user space is done, acquire page locks for all pages (Attached
a patch below).

Or dropping page lock means that there are no guarantees that this
page did not get written back and removed from address space and
a new page has been placed at same offset. Does that mean I should
instead be looking up page cache again after copying from user
space is done. 

---
 fs/fuse/file.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Index: redhat-linux/fs/fuse/file.c
===================================================================
--- redhat-linux.orig/fs/fuse/file.c	2020-10-15 14:11:23.464720742 -0400
+++ redhat-linux/fs/fuse/file.c	2020-10-15 15:23:17.716569698 -0400
@@ -1117,7 +1117,7 @@ static ssize_t fuse_fill_write_pages(str
 	struct fuse_conn *fc = get_fuse_conn(mapping->host);
 	unsigned offset = pos & (PAGE_SIZE - 1);
 	size_t count = 0;
-	int err;
+	int err, i;
 
 	ap->args.in_pages = true;
 	ap->descs[0].offset = offset;
@@ -1155,6 +1155,14 @@ static ssize_t fuse_fill_write_pages(str
 			goto again;
 		}
 
+		/*
+		 * Unlock page but retain reference count. Unlock is requied
+		 * otherwise it is possible that next loop iteration tries
+		 * to fault in page (source address) we have lock on and we
+		 * will deadlock. So drop lock but keep a reference on the
+		 * page. Re-acquire page lock after breaking out of the loop
+		 */
+		unlock_page(page);
 		err = 0;
 		ap->pages[ap->num_pages] = page;
 		ap->descs[ap->num_pages].length = tmp;
@@ -1171,7 +1179,15 @@ static ssize_t fuse_fill_write_pages(str
 	} while (iov_iter_count(ii) && count < fc->max_write &&
 		 ap->num_pages < max_pages && offset == 0);
 
-	return count > 0 ? count : err;
+	if (count <= 0)
+		return err;
+
+	for (i = 0; i < ap->num_pages; i++) {
+		/* Take page lock */
+		lock_page(ap->pages[i]);
+	}
+
+	return count;
 }
 
 static inline unsigned int fuse_wr_pages(loff_t pos, size_t len,


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-15 19:55       ` Vivek Goyal
@ 2020-10-15 21:21         ` Linus Torvalds
  2020-10-16 10:02           ` Miklos Szeredi
  2020-10-16 18:19           ` Vivek Goyal
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-10-15 21:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Thu, Oct 15, 2020 at 12:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> I am wondering how should I fix this issue. Is it enough that I drop
> the page lock (but keep the reference) inside the loop. And once copying
> from user space is done, acquire page locks for all pages (Attached
> a patch below).

What is the page lock supposed to protect?

Because whatever it protects, dropping the lock drops, and you'd need
to re-check whatever the page lock was there for.

> Or dropping page lock means that there are no guarantees that this
> page did not get written back and removed from address space and
> a new page has been placed at same offset. Does that mean I should
> instead be looking up page cache again after copying from user
> space is done.

I don't know why fuse does multiple pages to begin with. Why can't it
do whatever it does just one page at a time?

But yes, you probably should look the page up again whenever you've
unlocked it, because it might have been truncated or whatever.

Not that this is purely about unlocking the page, not about "after
copying from user space". The iov_iter_copy_from_user_atomic() part is
safe - if that takes a page fault, it will just do a partial copy, it
won't deadlock.

So you can potentially do multiple pages, and keep them all locked,
but only as long as the copies are all done with that
"from_user_atomic()" case. Which normally works fine, since normal
users will write stuff that they just generated, so it will all be
there.

It's only when that returns zero, and you do the fallback to pre-fault
in any data with iov_iter_fault_in_readable() that you need to unlock
_all_ pages (and once you do that, I don't see what possible advantage
the multi-page array can have).

Of course, the way that code is written, it always does the
iov_iter_fault_in_readable() for each page - it's not written like
some kind of "special case fallback thing".

I suspect the code was copied from the generic write code, but without
understanding why the generic write code was ok.

               Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-15 21:21         ` Linus Torvalds
@ 2020-10-16 10:02           ` Miklos Szeredi
  2020-10-16 12:27             ` Matthew Wilcox
  2020-10-20 20:42             ` Vivek Goyal
  2020-10-16 18:19           ` Vivek Goyal
  1 sibling, 2 replies; 23+ messages in thread
From: Miklos Szeredi @ 2020-10-16 10:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vivek Goyal, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Thu, Oct 15, 2020 at 11:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Oct 15, 2020 at 12:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > I am wondering how should I fix this issue. Is it enough that I drop
> > the page lock (but keep the reference) inside the loop. And once copying
> > from user space is done, acquire page locks for all pages (Attached
> > a patch below).
>
> What is the page lock supposed to protect?
>
> Because whatever it protects, dropping the lock drops, and you'd need
> to re-check whatever the page lock was there for.
>
> > Or dropping page lock means that there are no guarantees that this
> > page did not get written back and removed from address space and
> > a new page has been placed at same offset. Does that mean I should
> > instead be looking up page cache again after copying from user
> > space is done.
>
> I don't know why fuse does multiple pages to begin with. Why can't it
> do whatever it does just one page at a time?
>
> But yes, you probably should look the page up again whenever you've
> unlocked it, because it might have been truncated or whatever.
>
> Not that this is purely about unlocking the page, not about "after
> copying from user space". The iov_iter_copy_from_user_atomic() part is
> safe - if that takes a page fault, it will just do a partial copy, it
> won't deadlock.
>
> So you can potentially do multiple pages, and keep them all locked,
> but only as long as the copies are all done with that
> "from_user_atomic()" case. Which normally works fine, since normal
> users will write stuff that they just generated, so it will all be
> there.
>
> It's only when that returns zero, and you do the fallback to pre-fault
> in any data with iov_iter_fault_in_readable() that you need to unlock
> _all_ pages (and once you do that, I don't see what possible advantage
> the multi-page array can have).
>
> Of course, the way that code is written, it always does the
> iov_iter_fault_in_readable() for each page - it's not written like
> some kind of "special case fallback thing".

This was added by commit ea9b9907b82a ("fuse: implement
perform_write") in v2.6.26 and remains essentially unchanged, AFAICS.
So this is an old bug indeed.

So what is the page lock protecting?   I think not truncation, because
inode_lock should be sufficient protection.

What it does after sending a synchronous WRITE and before unlocking
the pages is set the PG_uptodate flag, but only if the complete page
was really written, which is what the uptodate flag really says:  this
page is in sync with the underlying fs.

So I think the page lock here is trying to protect against concurrent
reads/faults on not uptodate pages.  I.e. until the WRITE request
completes it is unknown whether the page was really written or not, so
any reads must block until this state becomes known.  This logic falls
down on already cached pages, since they start out uptodate and the
write does not clear this flag.

So keeping the pages locked has dubious value: short writes don't seem
to work correctly anyway.  Which means that we can probably just set
the page uptodate right after filling it from the user buffer, and
unlock the page immediately.

Am I missing something?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-16 10:02           ` Miklos Szeredi
@ 2020-10-16 12:27             ` Matthew Wilcox
  2020-10-20 20:42             ` Vivek Goyal
  1 sibling, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2020-10-16 12:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Vivek Goyal, Qian Cai, Hugh Dickins,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Fri, Oct 16, 2020 at 12:02:21PM +0200, Miklos Szeredi wrote:
> This was added by commit ea9b9907b82a ("fuse: implement
> perform_write") in v2.6.26 and remains essentially unchanged, AFAICS.
> So this is an old bug indeed.
> 
> So what is the page lock protecting?   I think not truncation, because
> inode_lock should be sufficient protection.
> 
> What it does after sending a synchronous WRITE and before unlocking
> the pages is set the PG_uptodate flag, but only if the complete page
> was really written, which is what the uptodate flag really says:  this
> page is in sync with the underlying fs.

Not in sync with.  Uptodate means "every byte on this page is at least
as new as the bytes on storage".  Dirty means "at least one byte is
newer than the bytes on storage".

> So I think the page lock here is trying to protect against concurrent
> reads/faults on not uptodate pages.  I.e. until the WRITE request
> completes it is unknown whether the page was really written or not, so
> any reads must block until this state becomes known.  This logic falls
> down on already cached pages, since they start out uptodate and the
> write does not clear this flag.

That's not how the page cache should work -- if a write() has touched
every byte in a page, then the page should be marked as Uptodate, and it
can immediately satisfy read()s without even touching the backing store.

> So keeping the pages locked has dubious value: short writes don't seem
> to work correctly anyway.  Which means that we can probably just set
> the page uptodate right after filling it from the user buffer, and
> unlock the page immediately.
> 
> Am I missing something?

I haven't looked at fuse in detail -- are you handling partial page
writes?  That is, if someone writes to bytes 5-15 of a file, are you
first reading bytes 0-4 and 16-4095 from the backing store?  If so,
you can mark the page Uptodate as soon as you've copied bytes 5-15
from the user.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-15 21:21         ` Linus Torvalds
  2020-10-16 10:02           ` Miklos Szeredi
@ 2020-10-16 18:19           ` Vivek Goyal
  2020-10-16 18:24             ` Linus Torvalds
  2020-10-16 23:03             ` Dave Chinner
  1 sibling, 2 replies; 23+ messages in thread
From: Vivek Goyal @ 2020-10-16 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Thu, Oct 15, 2020 at 02:21:58PM -0700, Linus Torvalds wrote:

[..]
> 
> I don't know why fuse does multiple pages to begin with. Why can't it
> do whatever it does just one page at a time?

Sending multiple pages in single WRITE command does seem to help a lot
with performance. I modified code to write only one page at a time
and ran a fio job with sequential writes(and random writes),
block size 64K and compared the performance on virtiofs.

NAME                    WORKLOAD                Bandwidth       IOPS
one-page-write          seqwrite-psync          58.3mb          933
multi-page-write        seqwrite-psync          265.7mb         4251

one-page-write          randwrite-psync         53.5mb          856
multi-page-write        randwrite-psync         315.5mb         5047

So with multi page writes performance seems much better for this
particular workload.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-16 18:19           ` Vivek Goyal
@ 2020-10-16 18:24             ` Linus Torvalds
  2020-10-16 18:24               ` Linus Torvalds
  2020-10-16 23:03             ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-10-16 18:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Fri, Oct 16, 2020 at 11:19 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> So with multi page writes performance seems much better for this
> particular workload.

Looking at that code, I don't see why the page needs to be unlocked
after it's been filled, so I do think the easy solution is to just
always unlock the pages. Then the rest of the multi-page code is
probably fine.

But hey, who knows. I might be missing something.

            Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-16 18:24             ` Linus Torvalds
@ 2020-10-16 18:24               ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-10-16 18:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miklos Szeredi, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Fri, Oct 16, 2020 at 11:24 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Looking at that code, I don't see why the page needs to be unlocked [..]

That "unlocked" should be "locked", of course.

            Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-16 18:19           ` Vivek Goyal
  2020-10-16 18:24             ` Linus Torvalds
@ 2020-10-16 23:03             ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2020-10-16 23:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, Miklos Szeredi, Qian Cai, Hugh Dickins,
	Matthew Wilcox, Kirill A . Shutemov, Linux-MM, Andrew Morton,
	linux-fsdevel, Amir Goldstein

On Fri, Oct 16, 2020 at 02:19:08PM -0400, Vivek Goyal wrote:
> On Thu, Oct 15, 2020 at 02:21:58PM -0700, Linus Torvalds wrote:
> 
> [..]
> > 
> > I don't know why fuse does multiple pages to begin with. Why can't it
> > do whatever it does just one page at a time?
> 
> Sending multiple pages in single WRITE command does seem to help a lot
> with performance. I modified code to write only one page at a time
> and ran a fio job with sequential writes(and random writes),
> block size 64K and compared the performance on virtiofs.
> 
> NAME                    WORKLOAD                Bandwidth       IOPS
> one-page-write          seqwrite-psync          58.3mb          933
> multi-page-write        seqwrite-psync          265.7mb         4251
> 
> one-page-write          randwrite-psync         53.5mb          856
> multi-page-write        randwrite-psync         315.5mb         5047
> 
> So with multi page writes performance seems much better for this
> particular workload.

Huh. This is essentially the problem the iomap buffered write path
was designed to solve.  Filesystems like gfs2 got similar major
improvements in large buffered write throughput when switching to
use iomap for buffered IO....

Essentially, it works by having iomap_apply() first ask the
filesystem to map the IO range, then iterates the page cache across
the io range performing the desired operation (iomap_write_actor()
in the case of a buffered write), then it tells the filesystem how
much of the original range it completed copying into the cache.

Hence the filesystem only does one mapping/completion operation per
contiguous IO range instead of once per dirtied page, and the inner
loop just locks a page at a time as it works over the range.  Pages
are marked uptodate+dirty as the user data is copied into them, not
when the entire IO is completely written.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-16 10:02           ` Miklos Szeredi
  2020-10-16 12:27             ` Matthew Wilcox
@ 2020-10-20 20:42             ` Vivek Goyal
  2020-10-21  7:40               ` Miklos Szeredi
  1 sibling, 1 reply; 23+ messages in thread
From: Vivek Goyal @ 2020-10-20 20:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Fri, Oct 16, 2020 at 12:02:21PM +0200, Miklos Szeredi wrote:
> On Thu, Oct 15, 2020 at 11:22 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Oct 15, 2020 at 12:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > I am wondering how should I fix this issue. Is it enough that I drop
> > > the page lock (but keep the reference) inside the loop. And once copying
> > > from user space is done, acquire page locks for all pages (Attached
> > > a patch below).
> >
> > What is the page lock supposed to protect?
> >
> > Because whatever it protects, dropping the lock drops, and you'd need
> > to re-check whatever the page lock was there for.
> >
> > > Or dropping page lock means that there are no guarantees that this
> > > page did not get written back and removed from address space and
> > > a new page has been placed at same offset. Does that mean I should
> > > instead be looking up page cache again after copying from user
> > > space is done.
> >
> > I don't know why fuse does multiple pages to begin with. Why can't it
> > do whatever it does just one page at a time?
> >
> > But yes, you probably should look the page up again whenever you've
> > unlocked it, because it might have been truncated or whatever.
> >
> > Not that this is purely about unlocking the page, not about "after
> > copying from user space". The iov_iter_copy_from_user_atomic() part is
> > safe - if that takes a page fault, it will just do a partial copy, it
> > won't deadlock.
> >
> > So you can potentially do multiple pages, and keep them all locked,
> > but only as long as the copies are all done with that
> > "from_user_atomic()" case. Which normally works fine, since normal
> > users will write stuff that they just generated, so it will all be
> > there.
> >
> > It's only when that returns zero, and you do the fallback to pre-fault
> > in any data with iov_iter_fault_in_readable() that you need to unlock
> > _all_ pages (and once you do that, I don't see what possible advantage
> > the multi-page array can have).
> >
> > Of course, the way that code is written, it always does the
> > iov_iter_fault_in_readable() for each page - it's not written like
> > some kind of "special case fallback thing".
> 
> This was added by commit ea9b9907b82a ("fuse: implement
> perform_write") in v2.6.26 and remains essentially unchanged, AFAICS.
> So this is an old bug indeed.
> 
> So what is the page lock protecting?   I think not truncation, because
> inode_lock should be sufficient protection.
> 
> What it does after sending a synchronous WRITE and before unlocking
> the pages is set the PG_uptodate flag, but only if the complete page
> was really written, which is what the uptodate flag really says:  this
> page is in sync with the underlying fs.
> 
> So I think the page lock here is trying to protect against concurrent
> reads/faults on not uptodate pages.  I.e. until the WRITE request
> completes it is unknown whether the page was really written or not, so
> any reads must block until this state becomes known.  This logic falls
> down on already cached pages, since they start out uptodate and the
> write does not clear this flag.
> 
> So keeping the pages locked has dubious value: short writes don't seem
> to work correctly anyway. Which means that we can probably just set
> the page uptodate right after filling it from the user buffer, and
> unlock the page immediately.

Hi Miklos,

As you said, for the full page WRITE, we can probably mark it
page uptodate write away and drop page lock (Keep reference and
send WRITE request to fuse server). For the partial page write this will
not work and there seem to be atleast two options.

A. Either we read the page back from disk first and mark it uptodate.

B. Or we keep track of such partial writes and block any further
   reads/readpage/direct_IO on these pages till partial write is
   complete. After that looks like page will be left notuptodate
   in page cache and reader will read it from disk. We are doing
   something similar for tracking writeback requests. It is much
   more complicated though and we probably can design something
   simpler for these writethrough/synchronous writes.

I am assuming that A. will lead to performance penalty for short
random writes. So B might be better from performance point of
view.

Is it worth giving option B a try.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-20 20:42             ` Vivek Goyal
@ 2020-10-21  7:40               ` Miklos Szeredi
  2020-10-21 20:12                 ` Vivek Goyal
  0 siblings, 1 reply; 23+ messages in thread
From: Miklos Szeredi @ 2020-10-21  7:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Tue, Oct 20, 2020 at 10:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:

> As you said, for the full page WRITE, we can probably mark it
> page uptodate write away and drop page lock (Keep reference and
> send WRITE request to fuse server). For the partial page write this will
> not work and there seem to be atleast two options.
>
> A. Either we read the page back from disk first and mark it uptodate.
>
> B. Or we keep track of such partial writes and block any further
>    reads/readpage/direct_IO on these pages till partial write is
>    complete. After that looks like page will be left notuptodate
>    in page cache and reader will read it from disk. We are doing
>    something similar for tracking writeback requests. It is much
>    more complicated though and we probably can design something
>    simpler for these writethrough/synchronous writes.
>
> I am assuming that A. will lead to performance penalty for short
> random writes.

C.  Keep partial tail page locked.  If write involves a partial and
head AND tail page, then read head page first.  I think that would be
a good compromise between performance and simplicity.

WDYT?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..)
  2020-10-21  7:40               ` Miklos Szeredi
@ 2020-10-21 20:12                 ` Vivek Goyal
  0 siblings, 0 replies; 23+ messages in thread
From: Vivek Goyal @ 2020-10-21 20:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Qian Cai, Hugh Dickins, Matthew Wilcox,
	Kirill A . Shutemov, Linux-MM, Andrew Morton, linux-fsdevel,
	Amir Goldstein

On Wed, Oct 21, 2020 at 09:40:26AM +0200, Miklos Szeredi wrote:
> On Tue, Oct 20, 2020 at 10:42 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > As you said, for the full page WRITE, we can probably mark it
> > page uptodate write away and drop page lock (Keep reference and
> > send WRITE request to fuse server). For the partial page write this will
> > not work and there seem to be atleast two options.
> >
> > A. Either we read the page back from disk first and mark it uptodate.
> >
> > B. Or we keep track of such partial writes and block any further
> >    reads/readpage/direct_IO on these pages till partial write is
> >    complete. After that looks like page will be left notuptodate
> >    in page cache and reader will read it from disk. We are doing
> >    something similar for tracking writeback requests. It is much
> >    more complicated though and we probably can design something
> >    simpler for these writethrough/synchronous writes.
> >
> > I am assuming that A. will lead to performance penalty for short
> > random writes.
> 
> C.  Keep partial tail page locked.  If write involves a partial and
> head AND tail page, then read head page first.  I think that would be
> a good compromise between performance and simplicity.
> 
> WDYT?

Hi Miklos,

Sounds good. I refined this idea little bit more to avoid read
completely. We can add read later if it benefits in certain
situations.

D. If one does a partial page write of a page which is not uptodate, then
   keep page locked and do not try to send multiple pages in that write.
   If page is uptodate, then release page lock and continue to add more
   pages to same request.

IOW, if head page is partial (and it is not uptodate), we will just
send first WRITE with head page. Rest of the pages will go in second
WRITE and tail page could be locked if it was a partial write and
page was not uptodate. Please have a look at attached patch.

I still some concerns though with error handling. Not sure what to
do about it.

1. What happens if WRITE fails. If we are writing a full page and we
  already marked it as Uptodate (And not dirty), then we have page
  cache in page where we wrote data but could not send it to disk
  (and did not mark dirty as well). So if a user reads the page
  back it might get cache copy or get old copy from disk (if page
  cache copy was released).

2. What happens if it is a partial page write to an Uptodate page
  in cache and that WRITE fails. Now we have same error scenario
  as 1. In fact this is true for even current code and not
  necessarily a new scenario.

3. Current code marks a page Uptodate upon WRITE completion if
   it was full page WRITE. What if page was uptodate to begin
   with and write fails. So current code will not mark it
   Uptodate but it is already uptodate and we have same problem as 1.

Apart from above, there are some other concerns as well.

So with this patch, if a page is Uptodate we drop lock and send WRITE.
Otherwise we keep page lock and send WRITE. This should probably be
fine from read or fault read point of view. Given we are holding inode
lock, that means write path is not a problem as well. But

What if page is redirtied through a write mapping
-------------------------------------------------
If page is redirtied through writable mmap, then two writes for same
page can go in any order. But in synchronous write we are carrying
pointer to page cache page, so it probably does not matter. We will
just write same data twice.

What about races with direct_IO read
------------------------------------
If a WRITE is in progress, it is probably not marked dirty so
generic_file_read_iter() will probably not block on
filemap_write_and_wait_range() and continue mapping->a_ops->direct_IO().
And that means it can read previous disk data before this WRITE is
complete. 

Hopefully that's not a concern becase we have not returned write()
success to user space. So any parallel direct I/O probably should
not assume any order of operation.

So primary concern with this patch seems to be error handling if fuse
WRITE fails.

---
 fs/fuse/file.c   |   28 ++++++++++++++++++----------
 fs/fuse/fuse_i.h |    1 +
 2 files changed, 19 insertions(+), 10 deletions(-)

Index: redhat-linux/fs/fuse/file.c
===================================================================
--- redhat-linux.orig/fs/fuse/file.c	2020-10-21 11:41:43.983166360 -0400
+++ redhat-linux/fs/fuse/file.c	2020-10-21 15:16:24.151166360 -0400
@@ -1106,17 +1106,15 @@ static ssize_t fuse_send_write_pages(str
 	count = ia->write.out.size;
 	for (i = 0; i < ap->num_pages; i++) {
 		struct page *page = ap->pages[i];
+		bool page_locked = ap->page_locked && (i == ap->num_pages - 1);
 
-		if (!err && !offset && count >= PAGE_SIZE)
-			SetPageUptodate(page);
-
-		if (count > PAGE_SIZE - offset)
-			count -= PAGE_SIZE - offset;
-		else
-			count = 0;
-		offset = 0;
-
-		unlock_page(page);
+		/* TODO: What if an error happened and it was a full page
+		 * write. Should we mark it dirty now so that it is written
+		 * back later. Otherwise we have a page in page cache which
+		 * is not marked dirty and it is not synced to disk either.
+		 */
+		if (page_locked)
+			unlock_page(page);
 		put_page(page);
 	}
 
@@ -1180,6 +1178,16 @@ static ssize_t fuse_fill_write_pages(str
 		if (offset == PAGE_SIZE)
 			offset = 0;
 
+		/* If we copied full page, mark it uptodate */
+		if (tmp == PAGE_SIZE)
+			SetPageUptodate(page);
+
+		if (PageUptodate(page)) {
+			unlock_page(page);
+		} else {
+			ap->page_locked = true;
+			break;
+		}
 		if (!fc->big_writes)
 			break;
 	} while (iov_iter_count(ii) && count < fc->max_write &&
Index: redhat-linux/fs/fuse/fuse_i.h
===================================================================
--- redhat-linux.orig/fs/fuse/fuse_i.h	2020-10-20 15:18:59.471971851 -0400
+++ redhat-linux/fs/fuse/fuse_i.h	2020-10-21 14:39:53.701166360 -0400
@@ -275,6 +275,7 @@ struct fuse_args_pages {
 	struct page **pages;
 	struct fuse_page_desc *descs;
 	unsigned int num_pages;
+	bool page_locked;
 };
 
 #define FUSE_ARGS(args) struct fuse_args args = {}



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-13 19:59 [PATCH 0/4] Some more lock_page work Linus Torvalds
2020-10-13 20:03 ` Linus Torvalds
2020-10-14 13:05   ` Kirill A. Shutemov
2020-10-14 16:53     ` Linus Torvalds
2020-10-14 18:15       ` Matthew Wilcox
2020-10-15 10:41         ` Kirill A. Shutemov
2020-10-15  9:43       ` Kirill A. Shutemov
2020-10-15 16:44         ` Linus Torvalds
2020-10-14  5:50 ` Hugh Dickins
2020-10-15  1:48 ` Qian Cai
2020-10-15  2:44   ` Linus Torvalds
2020-10-15 15:16     ` Possible deadlock in fuse write path (Was: Re: [PATCH 0/4] Some more lock_page work..) Vivek Goyal
2020-10-15 19:55       ` Vivek Goyal
2020-10-15 21:21         ` Linus Torvalds
2020-10-16 10:02           ` Miklos Szeredi
2020-10-16 12:27             ` Matthew Wilcox
2020-10-20 20:42             ` Vivek Goyal
2020-10-21  7:40               ` Miklos Szeredi
2020-10-21 20:12                 ` Vivek Goyal
2020-10-16 18:19           ` Vivek Goyal
2020-10-16 18:24             ` Linus Torvalds
2020-10-16 18:24               ` Linus Torvalds
2020-10-16 23:03             ` Dave Chinner

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git