Re: [PATCH 0/4] Some more lock_page work..

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Linux-MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Amir Goldstein <amir73il@gmail.com>
Subject: Re: [PATCH 0/4] Some more lock_page work..
Date: Tue, 13 Oct 2020 13:03:30 -0700	[thread overview]
Message-ID: <CAHk-=wicH=FaLOeum9_f7Vyyz9Fe4MWmELT7WKR_UbfY37yX-Q@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wgkD+sVx3cHAAzhVO5orgksY=7i8q6mbzwBjN0+4XTAUw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 362 bytes --]

On Tue, Oct 13, 2020 at 12:59 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Comments?

Gaah. My alpine gmail setup has gotten broken by more fancy gmail
security features, so sending the actual patches that way broke down.

So here they are as attachments instead. I'll fix my alpine
configuration after the merge window.

                  Linus

[-- Attachment #2: 0001-mm-move-final-page-locking-out-of-__do_fault-helper-.patch --]
[-- Type: text/x-patch, Size: 2744 bytes --]

From c6f074f1758e233965495f589863b75ab0e1609d Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 10:22:00 -0700
Subject: [PATCH 1/4] mm: move final page locking out of __do_fault() helper
 into callers

The old semantics of our __do_fault() helper was that it always locked
the page unless there was an error (or unless the faulting had already
handled a COW event).

That turns out to be a mistake.  Not all callers actually want the page
locked at all, and they might as well check the same VM_FAULT_LOCKED bit
that __do_fault() itself checked whether the page is already locked or
not.

This change only moves that final page locking out into the callers, but
intentionally does not actually change any of the locking semantics: the
callers will not just do that final page locking themselves instead.

That means that future patches may then decide to not lock the page
after all, but this is just preparation for any such future change.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..b4a7b81dcc7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3616,11 +3616,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		return VM_FAULT_HWPOISON;
 	}
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	return ret;
 }
 
@@ -4000,6 +3995,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
 	unlock_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -4031,6 +4031,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
@@ -4054,6 +4059,11 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #3: 0002-mm-don-t-lock-the-page-only-to-immediately-unlock-it.patch --]
[-- Type: text/x-patch, Size: 2276 bytes --]

From dcc752881236fa914d0286da71e46b31956aa0dc Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 10:43:05 -0700
Subject: [PATCH 2/4] mm: don't lock the page, only to immediately unlock it
 again for do_page_mkwrite()

Our page locking during fault handling a bit messy, and the shared fault
code in particular was locking the page only to immediately unlock it
again because do_page_mkwrite() wanted it unlocked.

We keep the "did we lock it" state around in the VM_FAULT_LOCKED bit, so
let's just use that knowledge, and not first lock it if it wasn't
locked, only to then unlock it again.

It would be even better to transfer the "did we already lock this page"
information into do_page_mkwrite(), because that function will actually
want to lock it eventually anyway, but let's just clean up one thing at
a time.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b4a7b81dcc7a..5c93b4bec063 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,25 +4059,33 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	/*
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
 	 */
 	if (vma->vm_ops->page_mkwrite) {
-		unlock_page(vmf->page);
+		/* do_page_mkwrite() wants the page unlocked */
+		if (ret & VM_FAULT_LOCKED) {
+			unlock_page(vmf->page);
+			ret &= ~VM_FAULT_LOCKED;
+		}
+
 		tmp = do_page_mkwrite(vmf);
 		if (unlikely(!tmp ||
 				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 			put_page(vmf->page);
 			return tmp;
 		}
+
+		/* Did do_page_mkwrite() lock the page again? */
+		ret |= tmp & VM_FAULT_LOCKED;
 	}
 
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf->page);
+	else
+		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
+
 	ret |= finish_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 					VM_FAULT_RETRY))) {
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #4: 0003-mm-do_cow_fault-does-not-need-the-source-page-to-be-.patch --]
[-- Type: text/x-patch, Size: 1984 bytes --]

From 969e62f9784dcd3083ba3fe32b87bdea8319aba9 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 11:00:33 -0700
Subject: [PATCH 3/4] mm: do_cow_fault() does not need the source page to be
 locked

This removes the "lock if it wasn't locked" logic from do_cow_fault(),
since we're not even going to install that page into the destination
address space (finish_fault() will use ->cow_page rather than ->page),
and copying the source page does not need the source to be locked.

So instead of doing "lock if it wasn't locked" followed by an
unconditional unlock of the page, just do "unlock if it was locked".

Of course, since all the normal file mapping ->fault() handlers
currently lock the page they return (see filemap_fault() for details),
all of this is pretty much theoretical.

But this is the right thing to do - making sure we hold the page lock
when we really don't is just confusing and wrong.  And this prepares the
way for any future changes to filemap_fault() where we go "Oh, we
actually _don't_ need to lock the page for this case at all".

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5c93b4bec063..d4d32d0c33c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4031,16 +4031,13 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
 	if (ret & VM_FAULT_DONE_COW)
 		return ret;
 
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
-		lock_page(vmf->page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
-
 	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
 	__SetPageUptodate(vmf->cow_page);
 
+	if (ret & VM_FAULT_LOCKED)
+		unlock_page(vmf->page);
+
 	ret |= finish_fault(vmf);
-	unlock_page(vmf->page);
 	put_page(vmf->page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #5: 0004-mm-make-filemap_map_pages-avoid-the-page-lock-if-pos.patch --]
[-- Type: text/x-patch, Size: 3529 bytes --]

From 107014f091622dcb411c0ae38c99a95704a62f3f Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 13 Oct 2020 12:03:40 -0700
Subject: [PATCH 4/4] mm: make filemap_map_pages() avoid the page lock if
 possible

Private mappings don't need to be 100% serialized with file truncation
etc, and act more like an optimized "read()" call.  So we can avoid
taking the page lock for them.

NOTE! This is a trial patch.  I'm not entirely happy about this, because
I think we can avoid the page lock for shared mappings too, by just
changing the order in which we do some of the checks.

In particular, once we have the page table lock (which we need anyway),
we could increment the page mapping count.  That - together with
re-checking that the page still isn't locked - should be a sufficient
guarantee that nobody has finished truncating that page yet, and any
future truncation will end up being serialized on the page table lock.

The compund page case probably needs some thinking about too.

Needs-review-by: Matthew Wilcox <willy@infradead.org>
Needs-review-by: Kirill A. Shutemov <kirill@shutemov.name>
Needs-review-by: Hugh Dickins <hughd@google.com>
Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/filemap.c | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 748b7b1b4f6d..6accb7905a36 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2788,6 +2788,7 @@ EXPORT_SYMBOL(filemap_fault);
 void filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
+	bool shared = vmf->vma->vm_flags & VM_SHARED;
 	struct file *file = vmf->vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t last_pgoff = start_pgoff;
@@ -2798,6 +2799,8 @@ void filemap_map_pages(struct vm_fault *vmf,
 
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_pgoff) {
+		bool locked = false;
+
 		if (xas_retry(&xas, page))
 			continue;
 		if (xa_is_value(page))
@@ -2815,15 +2818,40 @@ void filemap_map_pages(struct vm_fault *vmf,
 		/* Has the page moved or been split? */
 		if (unlikely(page != xas_reload(&xas)))
 			goto skip;
+		/*
+		 * also recheck the page lock after getting the reference,
+		 * so that any page lockers will have seen us incrementing
+		 * it or not see us at all.
+		 */
+		if (PageLocked(page))
+			goto skip;
+
 		page = find_subpage(page, xas.xa_index);
 
 		if (!PageUptodate(page) ||
 				PageReadahead(page) ||
 				PageHWPoison(page))
 			goto skip;
-		if (!trylock_page(page))
-			goto skip;
 
+		/*
+		 * We only need to be really careful about races
+		 * with truncate etc for shared mappings.
+		 *
+		 * But we also need to lock the page for compound
+		 * pages (see alloc_set_pte -> page_add_file_rmap).
+		 */
+		if (shared || PageTransCompound(page)) {
+			if (!trylock_page(page))
+				goto skip;
+			locked = true;
+		}
+
+		/*
+		 * Even if we don't get the page lock, we'll re-check
+		 * the page mapping and the mapping size.
+		 *
+		 * It won't hurt, even if it's racy.
+		 */
 		if (page->mapping != mapping || !PageUptodate(page))
 			goto unlock;
 
@@ -2840,10 +2868,12 @@ void filemap_map_pages(struct vm_fault *vmf,
 		last_pgoff = xas.xa_index;
 		if (alloc_set_pte(vmf, page))
 			goto unlock;
-		unlock_page(page);
+		if (locked)
+			unlock_page(page);
 		goto next;
 unlock:
-		unlock_page(page);
+		if (locked)
+			unlock_page(page);
 skip:
 		put_page(page);
 next:
-- 
2.28.0.218.gc12ef3d349