linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap
@ 2022-01-26  9:55 David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand, Nadav Amit

This series is the result of the discussion on the previous approach [1].
More information on the general COW issues can be found there.


This series attempts to optimize and streamline the COW logic for ordinary
anon pages and THP anon pages, fixing two remaining instances of
CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information can
leak from a parent process to a child process via anonymous pages shared
during fork().

This issue, including other related COW issues, has been summarized in [2]:
"
  1. Observing Memory Modifications of Private Pages From A Child Process
  
  Long story short: process-private memory might not be as private as you
  think once you fork(): successive modifications of private memory
  regions in the parent process can still be observed by the child
  process, for example, by smart use of vmsplice()+munmap().
  
  The core problem is that pinning pages readable in a child process, such
  as done via the vmsplice system call, can result in a child process
  observing memory modifications done in the parent process the child is
  not supposed to observe. [1] contains an excellent summary and [2]
  contains further details. This issue was assigned CVE-2020-29374 [9].
  
  For this to trigger, it's required to use a fork() without subsequent
  exec(), for example, as used under Android zygote. Without further
  details about an application that forks less-privileged child processes,
  one cannot really say what's actually affected and what's not -- see the
  details section the end of this mail for a short sshd/openssh analysis.
  
  While commit 17839856fd58 ("gup: document and work around "COW can break
  either way" issue") fixed this issue and resulted in other problems
  (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
  simplification") re-introduced part of the problem unfortunately.
  
  The original reproducer can be modified quite easily to use THP [3] and
  make the issue appear again on upstream kernels. I modified it to use
  hugetlb [4] and it triggers as well. The problem is certainly less
  severe with hugetlb than with THP; it merely highlights that we still
  have plenty of open holes we should be closing/fixing.
  
  Regarding vmsplice(), the only known workaround is to disallow the
  vmsplice() system call ... or disable THP and hugetlb. But who knows
  what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
  the end, it's a more generic issue.
"

This security issue was first reported by Jann Horn on 27 May 2020 and it
currently affects anonymous pages during swapin, anonymous THP and hugetlb.
This series tackles anonymous pages during swapin and anonymous THP:
* do_swap_page() for handling COW on PTEs *during* swapin directly
* do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
  faults

With this series, we'll apply the same COW logic we have in do_wp_page()
to all swappable anon pages: don't reuse (map writable) the page in
case there are additional references (page_count() != 1). All users of
reuse_swap_page() are remove, and consequently reuse_swap_page() is
removed.

In general, we're struggling with the following COW-related issues:
(1) "missed COW": we miss to copy on write and reuse the page (map it
    writable) although we must copy because there are pending references
    from another process to this page. The result is a security issue.
(2) "wrong COW": we copy on write although we wouldn't have to and
    shouldn't: if there are valid GUP references, they will become out of
    sync with the pages mapped into the page table. We fail to detect that
    such a page can be reused safely, especially if never more than a
    single process mapped the page. The result is an intra process
    memory corruption.
(3) "unnecessary COW": we copy on write although we wouldn't have to:
    performance degradation and temporary increases swap+memory consumption
    can be the result.

While this series fixes (1) for swappable anon pages, it tries to reduce
reported cases of (3) first as good and easy as possible to limit the
impact when streamlining. The individual patches try to describe in which
cases we will run into (3).

This series certainly makes (2) worse for THP, because a THP will now get
PTE-mapped on write faults if there are additional references, even if
there was only ever a single process involved: once PTE-mapped, we'll copy
each and every subpage and won't reuse any subpage as long as the
underlying compound page wasn't split.

I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to
mark anon pages that are exclusive to a single process, allow GUP pins only
on such exclusive pages, and allow turning exclusive pages shared
(clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with
PageAnonExclusive set never have to be copied during write faults, but
eventually during fork() if they cannot be turned shared. The improved
reuse logic in this series will essentially also be the logic to reset
PageAnonExclusive. This work will certainly take a while, but I'm planning
on sharing details before having code fully ready.


#1-#5 can be applied independently of the rest. #6-#9 are mostly only
cleanups related to reuse_swap_page().


Notes:
* For now, I'll leave hugetlb code untouched: "unnecessary COW" might
  easily break existing setups because hugetlb pages are a scarce resource
  and we could just end up having to crash the application when we run out
  of hugetlb pages. We have to be very careful and the security aspect with
  hugetlb is most certainly less relevant than for unprivileged anon pages.
* Instead of lru_add_drain() we might actually just drain the lru_add list
  or even just remove the single page of interest from the lru_add list.
  This would require a new helper function, and could be added if the
  conditional lru_add_drain() turn out to be a problem.
* I am pretty much confused about the swapcache handling in khugepaged
  code. This certainly needs an additional pair of eyes.
* I extended the test case already included in [1] to also test for the
  newly found do_swap_page() case. I'll send that out separately once/if
  this part was reviewed.

[1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

David Hildenbrand (9):
  mm: optimize do_wp_page() for exclusive pages in the swapcache
  mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
  mm: slightly clarify KSM logic in do_swap_page()
  mm: streamline COW logic in do_swap_page()
  mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
  mm/khugepaged: remove reuse_swap_page() usage
  mm/swapfile: remove reuse_swap_page()
  mm/huge_memory: remove stale page_trans_huge_mapcount()
  mm/huge_memory: remove stale locking logic from __split_huge_pmd()

 include/linux/mm.h   |   5 --
 include/linux/swap.h |   4 --
 mm/huge_memory.c     |  99 ++++++----------------------------
 mm/khugepaged.c      |  16 ++++--
 mm/memory.c          | 123 +++++++++++++++++++++++++++++++------------
 mm/swapfile.c        | 104 ------------------------------------
 6 files changed, 119 insertions(+), 232 deletions(-)


base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
-- 
2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26 14:25   ` Matthew Wilcox
  2022-01-28 12:53   ` Vlastimil Babka
  2022-01-26  9:55 ` [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs David Hildenbrand
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand, Nadav Amit

Liang Zhang reported [1] that the current COW logic in do_wp_page() is
sub-optimal when it comes to swap+read fault+write fault of anonymous
pages that have a single user, visible via a performance degradation in
the redis benchmark. Something similar was previously reported [2] by
Nadav with a simple reproducer.

Let's optimize for pages that have been added to the swapcache but only
have an exclusive owner. Try removing the swapcache reference if there is
hope that we're the exclusive user.

We will fail removing the swapcache reference in two scenarios:
(1) There are additional swap entries referencing the page: copying
    instead of reusing is the right thing to do.
(2) The page is under writeback: theoretically we might be able to reuse
    in some cases, however, we cannot remove the additional reference
    and will have to copy.

Further, we might have additional references from the LRU pagevecs,
which will force us to copy instead of being able to reuse. We'll try
handling such references for some scenarios next. Concurrent writeback
cannot be handled easily and we'll always have to copy.

While at it, remove the superfluous page_mapcount() check: it's
implicitly covered by the page_count() for ordinary anon pages.

[1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
[2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com

Reported-by: Liang Zhang <zhangliang5@huawei.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..bcd3b7c50891 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3291,19 +3291,27 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	if (PageAnon(vmf->page)) {
 		struct page *page = vmf->page;
 
-		/* PageKsm() doesn't necessarily raise the page refcount */
-		if (PageKsm(page) || page_count(page) != 1)
+		/*
+		 * We have to verify under page lock: these early checks are
+		 * just an optimization to avoid locking the page and freeing
+		 * the swapcache if there is little hope that we can reuse.
+		 *
+		 * PageKsm() doesn't necessarily raise the page refcount.
+		 */
+		if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
 			goto copy;
 		if (!trylock_page(page))
 			goto copy;
-		if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+		if (PageKsm(page) || page_count(page) != 1) {
 			unlock_page(page);
 			goto copy;
 		}
 		/*
-		 * Ok, we've got the only map reference, and the only
-		 * page count reference, and the page is locked,
-		 * it's dark out, and we're wearing sunglasses. Hit it.
+		 * Ok, we've got the only page reference from our mapping
+		 * and the page is locked, it's dark out, and we're wearing
+		 * sunglasses. Hit it.
 		 */
 		unlock_page(page);
 		wp_page_reuse(vmf);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26 14:31   ` Matthew Wilcox
  2022-01-26  9:55 ` [PATCH RFC v2 3/9] mm: slightly clarify KSM logic in do_swap_page() David Hildenbrand
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

For example, if a page just got swapped in via a read fault, the LRU
pagevecs might still hold a reference to the page. If we trigger a
write fault on such a page, the additional reference from the LRU
pagevecs will prohibit reusing the page.

Let's conditionally drain the local LRU pagevecs when we stumble over a
!PageLRU() page. We cannot easily drain remote LRU pagevecs and it might
not be desirable performance-wise. Consequently, this will only avoid
copying in some cases.

We cannot easily handle the following cases and we will always have to
copy:

(1) The page is referenced in the LRU pagevecs of other CPUs. We really
    would have to drain the LRU pagevecs of all CPUs -- most probably
    copying is much cheaper.

(2) The page is already PageLRU() but is getting moved between LRU
    lists, for example, for activation (e.g., mark_page_accessed()),
    deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to
    drain mostly unconditionally, which might be bad performance-wise.
    Most probably this won't happen too often in practice.

Note that there are other reasons why an anon page might temporarily not
be PageLRU(): for example, compaction and migration have to isolate LRU
pages from the LRU lists first (isolate_lru_page()), moving them to
temporary local lists and clearing PageLRU() and holding an additional
reference on the page. In that case, we'll always copy.

This change seems to be fairly effective with the reproducer [1] shared
by Nadav, as long as writeback is done synchronously, for example, using
zram. However, with asynchronous writeback, we'll usually fail to free the
swapcache because the page is still under writeback: something we cannot
easily optimize for, and maybe it's not really relevant in practice.

[1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index bcd3b7c50891..61d67ceef734 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3298,7 +3298,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		 *
 		 * PageKsm() doesn't necessarily raise the page refcount.
 		 */
-		if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
+		if (PageKsm(page))
+			goto copy;
+		if (page_count(page) > 1 + PageSwapCache(page) + !PageLRU(page))
+			goto copy;
+		if (!PageLRU(page))
+			/*
+			 * Note: We cannot easily detect+handle references from
+			 * remote LRU pagevecs or references to PageLRU() pages.
+			 */
+			lru_add_drain();
+		if (page_count(page) > 1 + PageSwapCache(page))
 			goto copy;
 		if (!trylock_page(page))
 			goto copy;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 3/9] mm: slightly clarify KSM logic in do_swap_page()
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 4/9] mm: streamline COW " David Hildenbrand
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

Let's make it clearer that KSM might only have to copy a page
in case we have a page in the swapcache, not if we allocated a fresh
page and bypassed the swapcache. While at it, add a comment why this is
usually necessary and merge the two swapcache conditions.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 61d67ceef734..ab3153252cfe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3617,21 +3617,29 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	/*
-	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
-	 * release the swapcache from under us.  The page pin, and pte_same
-	 * test below, are not enough to exclude that.  Even if it is still
-	 * swapcache, we need to check that the page's swap has not changed.
-	 */
-	if (unlikely((!PageSwapCache(page) ||
-			page_private(page) != entry.val)) && swapcache)
-		goto out_page;
-
-	page = ksm_might_need_to_copy(page, vma, vmf->address);
-	if (unlikely(!page)) {
-		ret = VM_FAULT_OOM;
-		page = swapcache;
-		goto out_page;
+	if (swapcache) {
+		/*
+		 * Make sure try_to_free_swap or reuse_swap_page or swapoff did
+		 * not release the swapcache from under us.  The page pin, and
+		 * pte_same test below, are not enough to exclude that.  Even if
+		 * it is still swapcache, we need to check that the page's swap
+		 * has not changed.
+		 */
+		if (unlikely(!PageSwapCache(page) ||
+			     page_private(page) != entry.val))
+			goto out_page;
+
+		/*
+		 * KSM sometimes has to copy on read faults, for example, if
+		 * page->index of !PageKSM() pages would be nonlinear inside the
+		 * anon VMA -- PageKSM() is lost on actual swapout.
+		 */
+		page = ksm_might_need_to_copy(page, vma, vmf->address);
+		if (unlikely(!page)) {
+			ret = VM_FAULT_OOM;
+			page = swapcache;
+			goto out_page;
+		}
 	}
 
 	cgroup_throttle_swaprate(page, GFP_KERNEL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 4/9] mm: streamline COW logic in do_swap_page()
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
                   ` (2 preceding siblings ...)
  2022-01-26  9:55 ` [PATCH RFC v2 3/9] mm: slightly clarify KSM logic in do_swap_page() David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() David Hildenbrand
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

Currently we have a different COW logic when:
* triggering a read-fault to swapin first and then trigger a write-fault
  -> do_swap_page() + do_wp_page()
* triggering a write-fault to swapin
  -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()

The COW logic in do_swap_page() is different than our reuse logic in
do_wp_page(). The COW logic in do_wp_page() -- page_count() == 1 --  makes
currently sure that we certainly don't have a remaining reference, e.g.,
via GUP, on the target page we want to reuse: if there is any unexpected
reference, we have to copy to avoid information leaks.

As do_swap_page() behaves differently, in environments with swap enabled we
can currently have an unintended information leak from the parent to the
child, similar as known from CVE-2020-29374:

	1. Parent writes to anonymous page
	-> Page is mapped writable and modified
	2. Page is swapped out
	-> Page is unmapped and replaced by swap entry
	3. fork()
	-> Swap entries are copied to child
	4. Child pins page R/O
	-> Page is mapped R/O into child
	5. Child unmaps page
	-> Child still holds GUP reference
	6. Parent writes to page
	-> Page is reused in do_swap_page()
	-> Child can observe changes

Exchanging 2. and 3. should have the same effect.

Let's apply the same COW logic as in do_wp_page(), conditionally trying to
remove the page from the swapcache after freeing the swap entry, however,
before actually mapping our page. We can change the order now that
we use try_to_free_swap(), which doesn't care about the mapcount,
instead of reuse_swap_page().

To handle references from the LRU pagevecs, conditionally drain the local
LRU pagevecs when required, however, don't consider the page_count() when
deciding whether to drain to keep it simple for now.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 55 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ab3153252cfe..ba23d13b8410 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3499,6 +3499,25 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline bool should_try_to_free_swap(struct page *page,
+					   struct vm_area_struct *vma,
+					   unsigned int fault_flags)
+{
+	if (!PageSwapCache(page))
+		return false;
+	if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
+	    PageMlocked(page))
+		return true;
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * owner. Try freeing the swapcache to get rid of the swapcache
+	 * reference in case it's likely that we will succeed.
+	 */
+	return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+		page_count(page) == 2;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3640,6 +3659,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			page = swapcache;
 			goto out_page;
 		}
+
+		/*
+		 * If we want to map a page that's in the swapcache writable, we
+		 * have to detect via the refcount if we're really the exclusive
+		 * owner. Try removing the extra reference from the local LRU
+		 * pagevecs if required.
+		 */
+		if ((vmf->flags & FAULT_FLAG_WRITE) && page == swapcache &&
+		    !PageKsm(page) && !PageLRU(page))
+			lru_add_drain();
 	}
 
 	cgroup_throttle_swaprate(page, GFP_KERNEL);
@@ -3658,19 +3687,25 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/*
-	 * The page isn't present yet, go ahead with the fault.
-	 *
-	 * Be careful about the sequence of operations here.
-	 * To get its accounting right, reuse_swap_page() must be called
-	 * while the page is counted on swap but not yet in mapcount i.e.
-	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
-	 * must be called after the swap_free(), or it will never succeed.
+	 * Remove the swap entry and conditionally try to free up the swapcache.
+	 * We're already holding a reference on the page but haven't mapped it
+	 * yet.
 	 */
+	swap_free(entry);
+	if (should_try_to_free_swap(page, vma, vmf->flags))
+		try_to_free_swap(page);
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+
+	/*
+	 * Same logic as in do_wp_page(); however, optimize for fresh pages
+	 * that are certainly not shared because we just allocated them without
+	 * exposing them to the swapcache.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+	    (page != swapcache || page_count(page) == 1)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
@@ -3696,10 +3731,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 
-	swap_free(entry);
-	if (mem_cgroup_swap_full(page) ||
-	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
-		try_to_free_swap(page);
 	unlock_page(page);
 	if (page != swapcache && swapcache) {
 		/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
                   ` (3 preceding siblings ...)
  2022-01-26  9:55 ` [PATCH RFC v2 4/9] mm: streamline COW " David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26 20:36   ` Yang Shi
  2022-01-26  9:55 ` [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage David Hildenbrand
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

We currently have a different COW logic for anon THP than we have for
ordinary anon pages in do_wp_page(): the effect is that the issue reported
in CVE-2020-29374 is currently still possible for anon THP: an unintended
information leak from the parent to the child.

Let's apply the same logic (page_count() == 1), with similar
optimizations to remove additional references first as we really want to
avoid PTE-mapping the THP and copying individual pages best we can.

If we end up with a page that has page_count() != 1, we'll have to PTE-map
the THP and fallback to do_wp_page(), which will always copy the page.

Note that KSM does not apply to THP.

I. Interaction with the swapcache and writeback

While a THP is in the swapcache, the swapcache holds one reference on each
subpage of the THP. So with PageSwapCache() set, we expect as many
additional references as we have subpages. If we manage to remove the
THP from the swapcache, all these references will be gone.

Usually, a THP is not split when entered into the swapcache and stays a
compound page. However, try_to_unmap() will PTE-map the THP and use PTE
swap entries. There are no PMD swap entries for that purpose, consequently,
we always only swapin subpages into PTEs.

Removing a page from the swapcache can fail either when there are remaining
swap entries (in which case COW is the right thing to do) or if the page is
currently under writeback.

Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be
possible only in corner cases, for example, if try_to_unmap() failed
after adding the page to the swapcache. However, it's comparatively easy to
handle.

As we have to fully unmap a THP before starting writeback, and swapin is
always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the
swapcache that is under writeback. This should at least leave writeback
out of the picture.

II. Interaction with GUP references

Having a R/O PMD-mapped THP with GUP references (i.e., R/O references)
will result in PTE-mapping the THP on a write fault. Similar to ordinary
anon pages, do_wp_page() will have to copy sub-pages and result in a
disconnect between the GUP references and the pages actually mapped into
the page tables. To improve the situation in the future, we'll need
additional handling to mark anonymous pages as definitely exclusive to a
single process, only allow GUP pins on exclusive anon pages, and
disallow sharing of exclusive anon pages with GUP pins e.g., during
fork().

III. Interaction with references from LRU pagevecs

Similar to ordinary anon pages, we can have LRU pagevecs referencing our
THP. Reliably removing such references requires draining LRU pagevecs on
all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can
sleep. For now, similar do do_wp_page(), let's conditionally drain the
local LRU pagevecs only if we detect !PageLRU().

IV. Interaction with speculative/temporary references

Similar to ordinary anon pages, other speculative/temporary references on
the THP, for example, from the pagecache or page migration code, will
disallow exclusive reuse of the page. We'll have to PTE-map the THP.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..b6ba88a98266 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
 	pmd_t orig_pmd = vmf->orig_pmd;
+	int swapcache_refs = 0;
 
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	/* Lock page for reuse_swap_page() */
 	if (!trylock_page(page)) {
 		get_page(page);
 		spin_unlock(vmf->ptl);
@@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	}
 
 	/*
-	 * We can only reuse the page if nobody else maps the huge page or it's
-	 * part.
+	 * See do_wp_page(): we can only map the page writable if there are
+	 * no additional references.
 	 */
-	if (reuse_swap_page(page)) {
+	if (PageSwapCache(page))
+		swapcache_refs = thp_nr_pages(page);
+	if (page_count(page) > 1 + swapcache_refs + !PageLRU(page))
+		goto unlock_fallback;
+	if (!PageLRU(page))
+		lru_add_drain();
+	if (page_count(page) > 1 + swapcache_refs)
+		goto unlock_fallback;
+	if (swapcache_refs)
+		try_to_free_swap(page);
+	if (page_count(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1333,6 +1343,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		return VM_FAULT_WRITE;
 	}
 
+unlock_fallback:
 	unlock_page(page);
 	spin_unlock(vmf->ptl);
 fallback:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
                   ` (4 preceding siblings ...)
  2022-01-26  9:55 ` [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-27 21:23   ` Yang Shi
  2022-01-26  9:55 ` [PATCH RFC v2 7/9] mm/swapfile: remove reuse_swap_page() David Hildenbrand
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

reuse_swap_page() currently indicates if we can write to an anon page
without COW. A COW is required if the page is shared by multiple
processes (either already mapped or via swap entries) or if there is
concurrent writeback that cannot tolerate concurrent page modifications.

reuse_swap_page() doesn't check for pending references from other
processes that already unmapped the page, however,
is_refcount_suitable() essentially does the same thing in the context of
khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
we want to remove that function.

In the context of khugepaged, we are not actually going to write to the
page and we don't really care about other processes mapping the page:
for example, without swap, we don't care about shared pages at all.

The current logic seems to be:
* Writable: -> Not shared, but might be in the swapcache. Nobody can
  fault it in from the swapcache as there are no other swap entries.
* Readable and not in the swapcache: Might be shared (but nobody can
  fault it in from the swapcache).
* Readable and in the swapcache: Might be shared and someone might be
  able to fault it in from the swapcache. Make sure we're the exclusive
  owner via reuse_swap_page().

Having to guess due to lack of comments and documentation, the current
logic really only wants to make sure that a page that might be shared
cannot be faulted in from the swapcache while khugepaged is active.
It's hard to guess why that is that case and if it's really still required,
but let's try keeping that logic unmodified.

Instead of relying on reuse_swap_page(), let's unconditionally
try_to_free_swap(), special casing PageKsm(). try_to_free_swap() will fail
if there are still swap entries targeting the page or if the page is under
writeback.

After a successful try_to_free_swap() that page cannot be readded to the
swapcache because we're keeping the page locked and removed from the LRU
until we actually perform the copy. So once we succeeded removing a page
from the swapcache, it cannot be re-added until we're done copying. Add a
comment stating that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/khugepaged.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..bc0ff598e98f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -683,10 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		if (!pte_write(pteval) && PageSwapCache(page) &&
-				!reuse_swap_page(page)) {
+		    (PageKsm(page) || !try_to_free_swap(page))) {
 			/*
-			 * Page is in the swap cache and cannot be re-used.
-			 * It cannot be collapsed into a THP.
+			 * Possibly shared page cannot be removed from the
+			 * swapache. It cannot be collapsed into a THP.
 			 */
 			unlock_page(page);
 			result = SCAN_SWAP_CACHE_PAGE;
@@ -702,6 +702,16 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
+
+		/*
+		 * We're holding the page lock and removed the page from the
+		 * LRU. Once done copying, we'll unlock and readd to the
+		 * LRU via release_pte_page(). If the page is still in the
+		 * swapcache, we're the exclusive owner. Due to the page lock
+		 * the page cannot be added to the swapcache until we're done
+		 * and consequently it cannot be faulted in from the swapcache
+		 * into another process.
+		 */
 		mod_node_page_state(page_pgdat(page),
 				NR_ISOLATED_ANON + page_is_file_lru(page),
 				compound_nr(page));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 7/9] mm/swapfile: remove reuse_swap_page()
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
                   ` (5 preceding siblings ...)
  2022-01-26  9:55 ` [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 8/9] mm/huge_memory: remove stale page_trans_huge_mapcount() David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd() David Hildenbrand
  8 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

All users are gone, let's remove it. We'll let SWP_STABLE_WRITES stick
around for now, as it might come in handy in the near future.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swap.h |   4 --
 mm/swapfile.c        | 104 -------------------------------------------
 2 files changed, 108 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d38d9475c4d..b546e4bd5c5a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -514,7 +514,6 @@ extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
-extern bool reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
@@ -680,9 +679,6 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page) \
-	(page_trans_huge_mapcount(page) == 1)
-
 static inline int try_to_free_swap(struct page *page)
 {
 	return 0;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bf0df7aa7158..a5183315dc58 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1167,16 +1167,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
-{
-	struct swap_info_struct *p;
-
-	p = _swap_info_get(entry);
-	if (p)
-		spin_lock(&p->lock);
-	return p;
-}
-
 static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
 					struct swap_info_struct *q)
 {
@@ -1601,100 +1591,6 @@ static bool page_swapped(struct page *page)
 	return false;
 }
 
-static int page_trans_huge_map_swapcount(struct page *page,
-					 int *total_swapcount)
-{
-	int i, map_swapcount, _total_swapcount;
-	unsigned long offset = 0;
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci = NULL;
-	unsigned char *map = NULL;
-	int swapcount = 0;
-
-	/* hugetlbfs shouldn't call it */
-	VM_BUG_ON_PAGE(PageHuge(page), page);
-
-	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) {
-		if (PageSwapCache(page))
-			swapcount = page_swapcount(page);
-		if (total_swapcount)
-			*total_swapcount = swapcount;
-		return swapcount + page_trans_huge_mapcount(page);
-	}
-
-	page = compound_head(page);
-
-	_total_swapcount = map_swapcount = 0;
-	if (PageSwapCache(page)) {
-		swp_entry_t entry;
-
-		entry.val = page_private(page);
-		si = _swap_info_get(entry);
-		if (si) {
-			map = si->swap_map;
-			offset = swp_offset(entry);
-		}
-	}
-	if (map)
-		ci = lock_cluster(si, offset);
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		int mapcount = atomic_read(&page[i]._mapcount) + 1;
-		if (map) {
-			swapcount = swap_count(map[offset + i]);
-			_total_swapcount += swapcount;
-		}
-		map_swapcount = max(map_swapcount, mapcount + swapcount);
-	}
-	unlock_cluster(ci);
-
-	if (PageDoubleMap(page))
-		map_swapcount -= 1;
-
-	if (total_swapcount)
-		*total_swapcount = _total_swapcount;
-
-	return map_swapcount + compound_mapcount(page);
-}
-
-/*
- * We can write to an anon page without COW if there are no other references
- * to it.  And as a side-effect, free up its swap: because the old content
- * on disk will never be read, and seeking back there to write new content
- * later would only waste time away from clustering.
- */
-bool reuse_swap_page(struct page *page)
-{
-	int count, total_swapcount;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	if (unlikely(PageKsm(page)))
-		return false;
-	count = page_trans_huge_map_swapcount(page, &total_swapcount);
-	if (count == 1 && PageSwapCache(page) &&
-	    (likely(!PageTransCompound(page)) ||
-	     /* The remaining swap count will be freed soon */
-	     total_swapcount == page_swapcount(page))) {
-		if (!PageWriteback(page)) {
-			page = compound_head(page);
-			delete_from_swap_cache(page);
-			SetPageDirty(page);
-		} else {
-			swp_entry_t entry;
-			struct swap_info_struct *p;
-
-			entry.val = page_private(page);
-			p = swap_info_get(entry);
-			if (p->flags & SWP_STABLE_WRITES) {
-				spin_unlock(&p->lock);
-				return false;
-			}
-			spin_unlock(&p->lock);
-		}
-	}
-
-	return count <= 1;
-}
-
 /*
  * If swap is getting full, or if there are no more mappings of this page,
  * then try_to_free_swap is called to free its swap space.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 8/9] mm/huge_memory: remove stale page_trans_huge_mapcount()
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
                   ` (6 preceding siblings ...)
  2022-01-26  9:55 ` [PATCH RFC v2 7/9] mm/swapfile: remove reuse_swap_page() David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd() David Hildenbrand
  8 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

All users are gone, let's remove it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  5 -----
 mm/huge_memory.c   | 48 ----------------------------------------------
 2 files changed, 53 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..df6b34d66a9b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -820,16 +820,11 @@ static inline int page_mapcount(struct page *page)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int total_mapcount(struct page *page);
-int page_trans_huge_mapcount(struct page *page);
 #else
 static inline int total_mapcount(struct page *page)
 {
 	return page_mapcount(page);
 }
-static inline int page_trans_huge_mapcount(struct page *page)
-{
-	return page_mapcount(page);
-}
 #endif
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b6ba88a98266..863c933b3b1e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2529,54 +2529,6 @@ int total_mapcount(struct page *page)
 	return ret;
 }
 
-/*
- * This calculates accurately how many mappings a transparent hugepage
- * has (unlike page_mapcount() which isn't fully accurate). This full
- * accuracy is primarily needed to know if copy-on-write faults can
- * reuse the page and change the mapping to read-write instead of
- * copying them. At the same time this returns the total_mapcount too.
- *
- * The function returns the highest mapcount any one of the subpages
- * has. If the return value is one, even if different processes are
- * mapping different subpages of the transparent hugepage, they can
- * all reuse it, because each process is reusing a different subpage.
- *
- * The total_mapcount is instead counting all virtual mappings of the
- * subpages. If the total_mapcount is equal to "one", it tells the
- * caller all mappings belong to the same "mm" and in turn the
- * anon_vma of the transparent hugepage can become the vma->anon_vma
- * local one as no other process may be mapping any of the subpages.
- *
- * It would be more accurate to replace page_mapcount() with
- * page_trans_huge_mapcount(), however we only use
- * page_trans_huge_mapcount() in the copy-on-write faults where we
- * need full accuracy to avoid breaking page pinning, because
- * page_trans_huge_mapcount() is slower than page_mapcount().
- */
-int page_trans_huge_mapcount(struct page *page)
-{
-	int i, ret;
-
-	/* hugetlbfs shouldn't call it */
-	VM_BUG_ON_PAGE(PageHuge(page), page);
-
-	if (likely(!PageTransCompound(page)))
-		return atomic_read(&page->_mapcount) + 1;
-
-	page = compound_head(page);
-
-	ret = 0;
-	for (i = 0; i < thp_nr_pages(page); i++) {
-		int mapcount = atomic_read(&page[i]._mapcount) + 1;
-		ret = max(ret, mapcount);
-	}
-
-	if (PageDoubleMap(page))
-		ret -= 1;
-
-	return ret + compound_mapcount(page);
-}
-
 /* Racy check whether the huge page can be split */
 bool can_split_huge_page(struct page *page, int *pextra_pins)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC v2 9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd()
  2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
                   ` (7 preceding siblings ...)
  2022-01-26  9:55 ` [PATCH RFC v2 8/9] mm/huge_memory: remove stale page_trans_huge_mapcount() David Hildenbrand
@ 2022-01-26  9:55 ` David Hildenbrand
  8 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand

Let's remove the stale logic that was required for reuse_swap_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 32 +-------------------------------
 1 file changed, 1 insertion(+), 31 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 863c933b3b1e..5cc438f92548 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2158,8 +2158,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	spinlock_t *ptl;
 	struct mmu_notifier_range range;
-	bool do_unlock_page = false;
-	pmd_t _pmd;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address & HPAGE_PMD_MASK,
@@ -2178,35 +2176,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			goto out;
 	}
 
-repeat:
 	if (pmd_trans_huge(*pmd)) {
-		if (!page) {
+		if (!page)
 			page = pmd_page(*pmd);
-			/*
-			 * An anonymous page must be locked, to ensure that a
-			 * concurrent reuse_swap_page() sees stable mapcount;
-			 * but reuse_swap_page() is not used on shmem or file,
-			 * and page lock must not be taken when zap_pmd_range()
-			 * calls __split_huge_pmd() while i_mmap_lock is held.
-			 */
-			if (PageAnon(page)) {
-				if (unlikely(!trylock_page(page))) {
-					get_page(page);
-					_pmd = *pmd;
-					spin_unlock(ptl);
-					lock_page(page);
-					spin_lock(ptl);
-					if (unlikely(!pmd_same(*pmd, _pmd))) {
-						unlock_page(page);
-						put_page(page);
-						page = NULL;
-						goto repeat;
-					}
-					put_page(page);
-				}
-				do_unlock_page = true;
-			}
-		}
 		if (PageMlocked(page))
 			clear_page_mlock(page);
 	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
@@ -2214,8 +2186,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	__split_huge_pmd_locked(vma, pmd, range.start, freeze);
 out:
 	spin_unlock(ptl);
-	if (do_unlock_page)
-		unlock_page(page);
 	/*
 	 * No need to double call mmu_notifier->invalidate_range() callback.
 	 * They are 3 cases to consider inside __split_huge_pmd_locked():
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache
  2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
@ 2022-01-26 14:25   ` Matthew Wilcox
  2022-01-28 12:53   ` Vlastimil Babka
  1 sibling, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2022-01-26 14:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, Nadav Amit

On Wed, Jan 26, 2022 at 10:55:49AM +0100, David Hildenbrand wrote:
> Liang Zhang reported [1] that the current COW logic in do_wp_page() is
> sub-optimal when it comes to swap+read fault+write fault of anonymous
> pages that have a single user, visible via a performance degradation in
> the redis benchmark. Something similar was previously reported [2] by
> Nadav with a simple reproducer.
> 
> Let's optimize for pages that have been added to the swapcache but only
> have an exclusive owner. Try removing the swapcache reference if there is
> hope that we're the exclusive user.
> 
> We will fail removing the swapcache reference in two scenarios:
> (1) There are additional swap entries referencing the page: copying
>     instead of reusing is the right thing to do.
> (2) The page is under writeback: theoretically we might be able to reuse
>     in some cases, however, we cannot remove the additional reference
>     and will have to copy.
> 
> Further, we might have additional references from the LRU pagevecs,
> which will force us to copy instead of being able to reuse. We'll try
> handling such references for some scenarios next. Concurrent writeback
> cannot be handled easily and we'll always have to copy.
> 
> While at it, remove the superfluous page_mapcount() check: it's
> implicitly covered by the page_count() for ordinary anon pages.
> 
> [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
> [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
> 
> Reported-by: Liang Zhang <zhangliang5@huawei.com>
> Reported-by: Nadav Amit <nadav.amit@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
  2022-01-26  9:55 ` [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs David Hildenbrand
@ 2022-01-26 14:31   ` Matthew Wilcox
  2022-01-26 14:36     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2022-01-26 14:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm

On Wed, Jan 26, 2022 at 10:55:50AM +0100, David Hildenbrand wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index bcd3b7c50891..61d67ceef734 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3298,7 +3298,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  		 *
>  		 * PageKsm() doesn't necessarily raise the page refcount.
>  		 */
> -		if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
> +		if (PageKsm(page))
> +			goto copy;
> +		if (page_count(page) > 1 + PageSwapCache(page) + !PageLRU(page))
> +			goto copy;
> +		if (!PageLRU(page))
> +			/*
> +			 * Note: We cannot easily detect+handle references from
> +			 * remote LRU pagevecs or references to PageLRU() pages.
> +			 */
> +			lru_add_drain();
> +		if (page_count(page) > 1 + PageSwapCache(page))
>  			goto copy;

I worry we're starting to get too accurate here.  How about:

		if (PageKsm(page) || page_count(page) > 3)
			goto copy;
		if (!PageLRU(page))
			lru_add_drain();
		if (!trylock_page(page))
			goto copy;
...

>  		if (!trylock_page(page))
>  			goto copy;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
  2022-01-26 14:31   ` Matthew Wilcox
@ 2022-01-26 14:36     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26 14:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm

On 26.01.22 15:31, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 10:55:50AM +0100, David Hildenbrand wrote:
>> diff --git a/mm/memory.c b/mm/memory.c
>> index bcd3b7c50891..61d67ceef734 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3298,7 +3298,17 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>  		 *
>>  		 * PageKsm() doesn't necessarily raise the page refcount.
>>  		 */
>> -		if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>> +		if (PageKsm(page))
>> +			goto copy;
>> +		if (page_count(page) > 1 + PageSwapCache(page) + !PageLRU(page))
>> +			goto copy;
>> +		if (!PageLRU(page))
>> +			/*
>> +			 * Note: We cannot easily detect+handle references from
>> +			 * remote LRU pagevecs or references to PageLRU() pages.
>> +			 */
>> +			lru_add_drain();
>> +		if (page_count(page) > 1 + PageSwapCache(page))
>>  			goto copy;
> 
> I worry we're starting to get too accurate here.  How about:
> 

In that simplified case we'll essentially trylock_page() and for most
pages that are ordinarily shared by e.g., 2 processes.

> 		if (PageKsm(page) || page_count(page) > 3)
> 			goto copy;
> 		if (!PageLRU(page))
> 			lru_add_drain();
> 		if (!trylock_page(page))
> 			goto copy;
> ...
> 

I think we might at least want the

if (page_count(page) > 1 + PageSwapCache(page))
	goto copy;

check here.

>>  		if (!trylock_page(page))
>>  			goto copy;
>> -- 
>> 2.34.1
>>
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
  2022-01-26  9:55 ` [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() David Hildenbrand
@ 2022-01-26 20:36   ` Yang Shi
  2022-01-27  8:14     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2022-01-26 20:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport,
	Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Jann Horn,
	Michal Hocko, Nadav Amit, Rik van Riel, Roman Gushchin,
	Andrea Arcangeli, Peter Xu, Donald Dutile, Christoph Hellwig,
	Oleg Nesterov, Jan Kara, Liang Zhang, Linux MM

On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> We currently have a different COW logic for anon THP than we have for
> ordinary anon pages in do_wp_page(): the effect is that the issue reported
> in CVE-2020-29374 is currently still possible for anon THP: an unintended
> information leak from the parent to the child.
>
> Let's apply the same logic (page_count() == 1), with similar
> optimizations to remove additional references first as we really want to
> avoid PTE-mapping the THP and copying individual pages best we can.
>
> If we end up with a page that has page_count() != 1, we'll have to PTE-map
> the THP and fallback to do_wp_page(), which will always copy the page.
>
> Note that KSM does not apply to THP.
>
> I. Interaction with the swapcache and writeback
>
> While a THP is in the swapcache, the swapcache holds one reference on each
> subpage of the THP. So with PageSwapCache() set, we expect as many
> additional references as we have subpages. If we manage to remove the
> THP from the swapcache, all these references will be gone.
>
> Usually, a THP is not split when entered into the swapcache and stays a
> compound page. However, try_to_unmap() will PTE-map the THP and use PTE
> swap entries. There are no PMD swap entries for that purpose, consequently,
> we always only swapin subpages into PTEs.
>
> Removing a page from the swapcache can fail either when there are remaining
> swap entries (in which case COW is the right thing to do) or if the page is
> currently under writeback.
>
> Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be
> possible only in corner cases, for example, if try_to_unmap() failed
> after adding the page to the swapcache. However, it's comparatively easy to
> handle.
>
> As we have to fully unmap a THP before starting writeback, and swapin is
> always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the
> swapcache that is under writeback. This should at least leave writeback
> out of the picture.
>
> II. Interaction with GUP references
>
> Having a R/O PMD-mapped THP with GUP references (i.e., R/O references)
> will result in PTE-mapping the THP on a write fault. Similar to ordinary
> anon pages, do_wp_page() will have to copy sub-pages and result in a
> disconnect between the GUP references and the pages actually mapped into
> the page tables. To improve the situation in the future, we'll need
> additional handling to mark anonymous pages as definitely exclusive to a
> single process, only allow GUP pins on exclusive anon pages, and
> disallow sharing of exclusive anon pages with GUP pins e.g., during
> fork().
>
> III. Interaction with references from LRU pagevecs
>
> Similar to ordinary anon pages, we can have LRU pagevecs referencing our
> THP. Reliably removing such references requires draining LRU pagevecs on
> all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can
> sleep. For now, similar do do_wp_page(), let's conditionally drain the
> local LRU pagevecs only if we detect !PageLRU().
>
> IV. Interaction with speculative/temporary references
>
> Similar to ordinary anon pages, other speculative/temporary references on
> the THP, for example, from the pagecache or page migration code, will
> disallow exclusive reuse of the page. We'll have to PTE-map the THP.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 406a3c28c026..b6ba88a98266 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>         struct page *page;
>         unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>         pmd_t orig_pmd = vmf->orig_pmd;
> +       int swapcache_refs = 0;
>
>         vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
>         VM_BUG_ON_VMA(!vma->anon_vma, vma);
> @@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>         page = pmd_page(orig_pmd);
>         VM_BUG_ON_PAGE(!PageHead(page), page);
>
> -       /* Lock page for reuse_swap_page() */
>         if (!trylock_page(page)) {
>                 get_page(page);
>                 spin_unlock(vmf->ptl);
> @@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>         }
>
>         /*
> -        * We can only reuse the page if nobody else maps the huge page or it's
> -        * part.
> +        * See do_wp_page(): we can only map the page writable if there are
> +        * no additional references.
>          */
> -       if (reuse_swap_page(page)) {
> +       if (PageSwapCache(page))
> +               swapcache_refs = thp_nr_pages(page);
> +       if (page_count(page) > 1 + swapcache_refs + !PageLRU(page))
> +               goto unlock_fallback;
> +       if (!PageLRU(page))
> +               lru_add_drain();

IMHO, draining lru doesn't help out too much for THP since THP will be
drained to LRU immediately once it is added into pagevec.

> +       if (page_count(page) > 1 + swapcache_refs)
> +               goto unlock_fallback;
> +       if (swapcache_refs)
> +               try_to_free_swap(page);
> +       if (page_count(page) == 1) {
>                 pmd_t entry;
>                 entry = pmd_mkyoung(orig_pmd);
>                 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -1333,6 +1343,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>                 return VM_FAULT_WRITE;
>         }
>
> +unlock_fallback:
>         unlock_page(page);
>         spin_unlock(vmf->ptl);
>  fallback:
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
  2022-01-26 20:36   ` Yang Shi
@ 2022-01-27  8:14     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-27  8:14 UTC (permalink / raw)
  To: Yang Shi
  Cc: Linux Kernel Mailing List, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport,
	Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Jann Horn,
	Michal Hocko, Nadav Amit, Rik van Riel, Roman Gushchin,
	Andrea Arcangeli, Peter Xu, Donald Dutile, Christoph Hellwig,
	Oleg Nesterov, Jan Kara, Liang Zhang, Linux MM

On 26.01.22 21:36, Yang Shi wrote:
> On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> We currently have a different COW logic for anon THP than we have for
>> ordinary anon pages in do_wp_page(): the effect is that the issue reported
>> in CVE-2020-29374 is currently still possible for anon THP: an unintended
>> information leak from the parent to the child.
>>
>> Let's apply the same logic (page_count() == 1), with similar
>> optimizations to remove additional references first as we really want to
>> avoid PTE-mapping the THP and copying individual pages best we can.
>>
>> If we end up with a page that has page_count() != 1, we'll have to PTE-map
>> the THP and fallback to do_wp_page(), which will always copy the page.
>>
>> Note that KSM does not apply to THP.
>>
>> I. Interaction with the swapcache and writeback
>>
>> While a THP is in the swapcache, the swapcache holds one reference on each
>> subpage of the THP. So with PageSwapCache() set, we expect as many
>> additional references as we have subpages. If we manage to remove the
>> THP from the swapcache, all these references will be gone.
>>
>> Usually, a THP is not split when entered into the swapcache and stays a
>> compound page. However, try_to_unmap() will PTE-map the THP and use PTE
>> swap entries. There are no PMD swap entries for that purpose, consequently,
>> we always only swapin subpages into PTEs.
>>
>> Removing a page from the swapcache can fail either when there are remaining
>> swap entries (in which case COW is the right thing to do) or if the page is
>> currently under writeback.
>>
>> Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be
>> possible only in corner cases, for example, if try_to_unmap() failed
>> after adding the page to the swapcache. However, it's comparatively easy to
>> handle.
>>
>> As we have to fully unmap a THP before starting writeback, and swapin is
>> always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in the
>> swapcache that is under writeback. This should at least leave writeback
>> out of the picture.
>>
>> II. Interaction with GUP references
>>
>> Having a R/O PMD-mapped THP with GUP references (i.e., R/O references)
>> will result in PTE-mapping the THP on a write fault. Similar to ordinary
>> anon pages, do_wp_page() will have to copy sub-pages and result in a
>> disconnect between the GUP references and the pages actually mapped into
>> the page tables. To improve the situation in the future, we'll need
>> additional handling to mark anonymous pages as definitely exclusive to a
>> single process, only allow GUP pins on exclusive anon pages, and
>> disallow sharing of exclusive anon pages with GUP pins e.g., during
>> fork().
>>
>> III. Interaction with references from LRU pagevecs
>>
>> Similar to ordinary anon pages, we can have LRU pagevecs referencing our
>> THP. Reliably removing such references requires draining LRU pagevecs on
>> all CPUs -- lru_add_drain_all() -- a possibly expensive operation that can
>> sleep. For now, similar do do_wp_page(), let's conditionally drain the
>> local LRU pagevecs only if we detect !PageLRU().
>>
>> IV. Interaction with speculative/temporary references
>>
>> Similar to ordinary anon pages, other speculative/temporary references on
>> the THP, for example, from the pagecache or page migration code, will
>> disallow exclusive reuse of the page. We'll have to PTE-map the THP.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/huge_memory.c | 19 +++++++++++++++----
>>  1 file changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 406a3c28c026..b6ba88a98266 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1286,6 +1286,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>>         struct page *page;
>>         unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>         pmd_t orig_pmd = vmf->orig_pmd;
>> +       int swapcache_refs = 0;
>>
>>         vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
>>         VM_BUG_ON_VMA(!vma->anon_vma, vma);
>> @@ -1303,7 +1304,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>>         page = pmd_page(orig_pmd);
>>         VM_BUG_ON_PAGE(!PageHead(page), page);
>>
>> -       /* Lock page for reuse_swap_page() */
>>         if (!trylock_page(page)) {
>>                 get_page(page);
>>                 spin_unlock(vmf->ptl);
>> @@ -1319,10 +1319,20 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
>>         }
>>
>>         /*
>> -        * We can only reuse the page if nobody else maps the huge page or it's
>> -        * part.
>> +        * See do_wp_page(): we can only map the page writable if there are
>> +        * no additional references.
>>          */
>> -       if (reuse_swap_page(page)) {
>> +       if (PageSwapCache(page))
>> +               swapcache_refs = thp_nr_pages(page);
>> +       if (page_count(page) > 1 + swapcache_refs + !PageLRU(page))
>> +               goto unlock_fallback;
>> +       if (!PageLRU(page))
>> +               lru_add_drain();
> 
> IMHO, draining lru doesn't help out too much for THP since THP will be
> drained to LRU immediately once it is added into pagevec.

Oh, thanks, I think you're right. The interesting bit is

static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page
*page)
{
	bool ret = false;

	if (!pagevec_add(pvec, page) || PageCompound(page) ||
			lru_cache_disabled())
		ret = true;

	return ret;
}

Which indeed drains after adding it to the pagevec.

Will adjust the patch and update the description/comment accordingly,
thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage
  2022-01-26  9:55 ` [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage David Hildenbrand
@ 2022-01-27 21:23   ` Yang Shi
  2022-01-28  8:41     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Yang Shi @ 2022-01-27 21:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport,
	Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Jann Horn,
	Michal Hocko, Nadav Amit, Rik van Riel, Roman Gushchin,
	Andrea Arcangeli, Peter Xu, Donald Dutile, Christoph Hellwig,
	Oleg Nesterov, Jan Kara, Liang Zhang, Linux MM

On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> reuse_swap_page() currently indicates if we can write to an anon page
> without COW. A COW is required if the page is shared by multiple
> processes (either already mapped or via swap entries) or if there is
> concurrent writeback that cannot tolerate concurrent page modifications.
>
> reuse_swap_page() doesn't check for pending references from other
> processes that already unmapped the page, however,
> is_refcount_suitable() essentially does the same thing in the context of
> khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
> we want to remove that function.
>
> In the context of khugepaged, we are not actually going to write to the
> page and we don't really care about other processes mapping the page:
> for example, without swap, we don't care about shared pages at all.
>
> The current logic seems to be:
> * Writable: -> Not shared, but might be in the swapcache. Nobody can
>   fault it in from the swapcache as there are no other swap entries.
> * Readable and not in the swapcache: Might be shared (but nobody can
>   fault it in from the swapcache).
> * Readable and in the swapcache: Might be shared and someone might be
>   able to fault it in from the swapcache. Make sure we're the exclusive
>   owner via reuse_swap_page().
>
> Having to guess due to lack of comments and documentation, the current
> logic really only wants to make sure that a page that might be shared
> cannot be faulted in from the swapcache while khugepaged is active.
> It's hard to guess why that is that case and if it's really still required,
> but let's try keeping that logic unmodified.

I don't think it could be faulted in while khugepaged is active since
khugepaged does hold mmap_lock in write mode IIUC. So page fault is
serialized against khugepaged.

My wild guess is that collapsing shared pages was not supported before
v5.8, so we need reuse_swap_page() to tell us if the page in swap
cache is shared or not. But it is not true anymore. And khugepaged
just allocates a THP then copy the data from base pages to huge page
then replace PTEs to PMD, it doesn't change the content of the page,
so I failed to see a problem by collapsing a shared page in swap
cache. But I'm really not entirely sure, I may miss something...



>
> Instead of relying on reuse_swap_page(), let's unconditionally
> try_to_free_swap(), special casing PageKsm(). try_to_free_swap() will fail
> if there are still swap entries targeting the page or if the page is under
> writeback.
>
> After a successful try_to_free_swap() that page cannot be readded to the
> swapcache because we're keeping the page locked and removed from the LRU
> until we actually perform the copy. So once we succeeded removing a page
> from the swapcache, it cannot be re-added until we're done copying. Add a
> comment stating that.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/khugepaged.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 35f14d0a00a6..bc0ff598e98f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -683,10 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 if (!pte_write(pteval) && PageSwapCache(page) &&
> -                               !reuse_swap_page(page)) {
> +                   (PageKsm(page) || !try_to_free_swap(page))) {
>                         /*
> -                        * Page is in the swap cache and cannot be re-used.
> -                        * It cannot be collapsed into a THP.
> +                        * Possibly shared page cannot be removed from the
> +                        * swapache. It cannot be collapsed into a THP.
>                          */
>                         unlock_page(page);
>                         result = SCAN_SWAP_CACHE_PAGE;
> @@ -702,6 +702,16 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         result = SCAN_DEL_PAGE_LRU;
>                         goto out;
>                 }
> +
> +               /*
> +                * We're holding the page lock and removed the page from the
> +                * LRU. Once done copying, we'll unlock and readd to the
> +                * LRU via release_pte_page(). If the page is still in the
> +                * swapcache, we're the exclusive owner. Due to the page lock
> +                * the page cannot be added to the swapcache until we're done
> +                * and consequently it cannot be faulted in from the swapcache
> +                * into another process.
> +                */
>                 mod_node_page_state(page_pgdat(page),
>                                 NR_ISOLATED_ANON + page_is_file_lru(page),
>                                 compound_nr(page));
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage
  2022-01-27 21:23   ` Yang Shi
@ 2022-01-28  8:41     ` David Hildenbrand
  2022-01-28 17:10       ` Yang Shi
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-01-28  8:41 UTC (permalink / raw)
  To: Yang Shi
  Cc: Linux Kernel Mailing List, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport,
	Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Jann Horn,
	Michal Hocko, Nadav Amit, Rik van Riel, Roman Gushchin,
	Andrea Arcangeli, Peter Xu, Donald Dutile, Christoph Hellwig,
	Oleg Nesterov, Jan Kara, Liang Zhang, Linux MM

On 27.01.22 22:23, Yang Shi wrote:
> On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> reuse_swap_page() currently indicates if we can write to an anon page
>> without COW. A COW is required if the page is shared by multiple
>> processes (either already mapped or via swap entries) or if there is
>> concurrent writeback that cannot tolerate concurrent page modifications.
>>
>> reuse_swap_page() doesn't check for pending references from other
>> processes that already unmapped the page, however,
>> is_refcount_suitable() essentially does the same thing in the context of
>> khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
>> we want to remove that function.
>>
>> In the context of khugepaged, we are not actually going to write to the
>> page and we don't really care about other processes mapping the page:
>> for example, without swap, we don't care about shared pages at all.
>>
>> The current logic seems to be:
>> * Writable: -> Not shared, but might be in the swapcache. Nobody can
>>   fault it in from the swapcache as there are no other swap entries.
>> * Readable and not in the swapcache: Might be shared (but nobody can
>>   fault it in from the swapcache).
>> * Readable and in the swapcache: Might be shared and someone might be
>>   able to fault it in from the swapcache. Make sure we're the exclusive
>>   owner via reuse_swap_page().
>>
>> Having to guess due to lack of comments and documentation, the current
>> logic really only wants to make sure that a page that might be shared
>> cannot be faulted in from the swapcache while khugepaged is active.
>> It's hard to guess why that is that case and if it's really still required,
>> but let's try keeping that logic unmodified.
> 
> I don't think it could be faulted in while khugepaged is active since
> khugepaged does hold mmap_lock in write mode IIUC. So page fault is
> serialized against khugepaged.

It could get faulted in by another process sharing the page, because we 
only synchronize against the current process.

> 
> My wild guess is that collapsing shared pages was not supported before
> v5.8, so we need reuse_swap_page() to tell us if the page in swap
> cache is shared or not. But it is not true anymore. And khugepaged
> just allocates a THP then copy the data from base pages to huge page
> then replace PTEs to PMD, it doesn't change the content of the page,
> so I failed to see a problem by collapsing a shared page in swap
> cache. But I'm really not entirely sure, I may miss something...


Looking more closely where this logic originates from, it was introduced in:

commit 10359213d05acf804558bda7cc9b8422a828d1cd
Author: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Date:   Wed Feb 11 15:28:28 2015 -0800

    mm: incorporate read-only pages into transparent huge pages
    
    This patch aims to improve THP collapse rates, by allowing THP collapse in
    the presence of read-only ptes, like those left in place by do_swap_page
    after a read fault.
    
    Currently THP can collapse 4kB pages into a THP when there are up to
    khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
    the same limit for read-only ptes.


The change essentially results in a read-only mapped PTE page getting copied and
mapped writable via a new PMD-mapped THP.

It mentions do_swap_page(), so I assume it just tried to do what do_swap_page()
would do when trying to map a page swapped in from the page cache writable
immediately.

But we differ from do_swap_page() that we're not actually going to map the page
writable, we're going to copy the page (__collapse_huge_page_copy()) and map
the copy writable.

I assume we can remove that logic.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache
  2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
  2022-01-26 14:25   ` Matthew Wilcox
@ 2022-01-28 12:53   ` Vlastimil Babka
  2022-01-28 13:44     ` David Hildenbrand
  1 sibling, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2022-01-28 12:53 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	linux-mm, Nadav Amit

On 1/26/22 10:55, David Hildenbrand wrote:
> Liang Zhang reported [1] that the current COW logic in do_wp_page() is
> sub-optimal when it comes to swap+read fault+write fault of anonymous
> pages that have a single user, visible via a performance degradation in
> the redis benchmark. Something similar was previously reported [2] by
> Nadav with a simple reproducer.

Can we make the description more self-contained? I.e. describe that
sub-optimal COW means we copy when it's not necessary, and this can happen
if swap-out is followed by a swap-in for read and a then a write fault
(IIUC), because the swap cache reference increases page_count()...

> Let's optimize for pages that have been added to the swapcache but only
> have an exclusive owner. Try removing the swapcache reference if there is
> hope that we're the exclusive user.

Can we expect any downside for reclaim efficiency due to the more aggressive
removal from swapcache? Probably not, as we are doing the removal when the
page is about to get dirty, so we wouldn't be able to reuse any previously
swapped out content anyway. Maybe it's even beneficial?

> We will fail removing the swapcache reference in two scenarios:
> (1) There are additional swap entries referencing the page: copying
>     instead of reusing is the right thing to do.
> (2) The page is under writeback: theoretically we might be able to reuse
>     in some cases, however, we cannot remove the additional reference
>     and will have to copy.
> 
> Further, we might have additional references from the LRU pagevecs,
> which will force us to copy instead of being able to reuse. We'll try
> handling such references for some scenarios next. Concurrent writeback
> cannot be handled easily and we'll always have to copy.
> 
> While at it, remove the superfluous page_mapcount() check: it's
> implicitly covered by the page_count() for ordinary anon pages.
> 
> [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
> [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
> 
> Reported-by: Liang Zhang <zhangliang5@huawei.com>
> Reported-by: Nadav Amit <nadav.amit@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/memory.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..bcd3b7c50891 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3291,19 +3291,27 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  	if (PageAnon(vmf->page)) {
>  		struct page *page = vmf->page;
>  
> -		/* PageKsm() doesn't necessarily raise the page refcount */
> -		if (PageKsm(page) || page_count(page) != 1)
> +		/*
> +		 * We have to verify under page lock: these early checks are
> +		 * just an optimization to avoid locking the page and freeing
> +		 * the swapcache if there is little hope that we can reuse.
> +		 *
> +		 * PageKsm() doesn't necessarily raise the page refcount.
> +		 */
> +		if (PageKsm(page) || page_count(page) > 1 + PageSwapCache(page))
>  			goto copy;
>  		if (!trylock_page(page))
>  			goto copy;
> -		if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> +		if (PageSwapCache(page))
> +			try_to_free_swap(page);
> +		if (PageKsm(page) || page_count(page) != 1) {
>  			unlock_page(page);
>  			goto copy;
>  		}
>  		/*
> -		 * Ok, we've got the only map reference, and the only
> -		 * page count reference, and the page is locked,
> -		 * it's dark out, and we're wearing sunglasses. Hit it.
> +		 * Ok, we've got the only page reference from our mapping
> +		 * and the page is locked, it's dark out, and we're wearing
> +		 * sunglasses. Hit it.
>  		 */
>  		unlock_page(page);
>  		wp_page_reuse(vmf);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache
  2022-01-28 12:53   ` Vlastimil Babka
@ 2022-01-28 13:44     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-28 13:44 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	linux-mm, Nadav Amit

On 28.01.22 13:53, Vlastimil Babka wrote:
> On 1/26/22 10:55, David Hildenbrand wrote:
>> Liang Zhang reported [1] that the current COW logic in do_wp_page() is
>> sub-optimal when it comes to swap+read fault+write fault of anonymous
>> pages that have a single user, visible via a performance degradation in
>> the redis benchmark. Something similar was previously reported [2] by
>> Nadav with a simple reproducer.
> 
> Can we make the description more self-contained? I.e. describe that
> sub-optimal COW means we copy when it's not necessary, and this can happen
> if swap-out is followed by a swap-in for read and a then a write fault
> (IIUC), because the swap cache reference increases page_count()..

Sure, I can add some more details.

> 
>> Let's optimize for pages that have been added to the swapcache but only
>> have an exclusive owner. Try removing the swapcache reference if there is
>> hope that we're the exclusive user.
> 
> Can we expect any downside for reclaim efficiency due to the more aggressive
> removal from swapcache? Probably not, as we are doing the removal when the
> page is about to get dirty, so we wouldn't be able to reuse any previously
> swapped out content anyway. Maybe it's even beneficial?

We're only try removing the swapcache reference if it's likely that we
succeed in reusing the page after we succeeded in removing the swapcache
reference (IOW, no other swap entries, no writeback).

So the early "page_count(page) > 1 + PageSwapCache(page)" check is of
big value I think. It should be rare that we remove the reference and
still end up with "page_count(page) != 1".

And if we're reusing the page (re-dirtying it), removing it from the
swapache is the right thing to do: reuse_swap_cache() would have done
the same.

[...]

>>
>> Reported-by: Liang Zhang <zhangliang5@huawei.com>
>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage
  2022-01-28  8:41     ` David Hildenbrand
@ 2022-01-28 17:10       ` Yang Shi
  0 siblings, 0 replies; 20+ messages in thread
From: Yang Shi @ 2022-01-28 17:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Andrew Morton, Hugh Dickins,
	Linus Torvalds, David Rientjes, Shakeel Butt, John Hubbard,
	Jason Gunthorpe, Mike Kravetz, Mike Rapoport,
	Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Jann Horn,
	Michal Hocko, Nadav Amit, Rik van Riel, Roman Gushchin,
	Andrea Arcangeli, Peter Xu, Donald Dutile, Christoph Hellwig,
	Oleg Nesterov, Jan Kara, Liang Zhang, Linux MM

On Fri, Jan 28, 2022 at 12:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.01.22 22:23, Yang Shi wrote:
> > On Wed, Jan 26, 2022 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> reuse_swap_page() currently indicates if we can write to an anon page
> >> without COW. A COW is required if the page is shared by multiple
> >> processes (either already mapped or via swap entries) or if there is
> >> concurrent writeback that cannot tolerate concurrent page modifications.
> >>
> >> reuse_swap_page() doesn't check for pending references from other
> >> processes that already unmapped the page, however,
> >> is_refcount_suitable() essentially does the same thing in the context of
> >> khugepaged. khugepaged is the last remaining user of reuse_swap_page() and
> >> we want to remove that function.
> >>
> >> In the context of khugepaged, we are not actually going to write to the
> >> page and we don't really care about other processes mapping the page:
> >> for example, without swap, we don't care about shared pages at all.
> >>
> >> The current logic seems to be:
> >> * Writable: -> Not shared, but might be in the swapcache. Nobody can
> >>   fault it in from the swapcache as there are no other swap entries.
> >> * Readable and not in the swapcache: Might be shared (but nobody can
> >>   fault it in from the swapcache).
> >> * Readable and in the swapcache: Might be shared and someone might be
> >>   able to fault it in from the swapcache. Make sure we're the exclusive
> >>   owner via reuse_swap_page().
> >>
> >> Having to guess due to lack of comments and documentation, the current
> >> logic really only wants to make sure that a page that might be shared
> >> cannot be faulted in from the swapcache while khugepaged is active.
> >> It's hard to guess why that is that case and if it's really still required,
> >> but let's try keeping that logic unmodified.
> >
> > I don't think it could be faulted in while khugepaged is active since
> > khugepaged does hold mmap_lock in write mode IIUC. So page fault is
> > serialized against khugepaged.
>
> It could get faulted in by another process sharing the page, because we
> only synchronize against the current process.

Yes, sure. I'm supposed you meant do_swap_page() called by another
process. But it is serialized by page lock. So khugepaged won't see
something in limbo state IIUC.

>
> >
> > My wild guess is that collapsing shared pages was not supported before
> > v5.8, so we need reuse_swap_page() to tell us if the page in swap
> > cache is shared or not. But it is not true anymore. And khugepaged
> > just allocates a THP then copy the data from base pages to huge page
> > then replace PTEs to PMD, it doesn't change the content of the page,
> > so I failed to see a problem by collapsing a shared page in swap
> > cache. But I'm really not entirely sure, I may miss something...
>
>
> Looking more closely where this logic originates from, it was introduced in:
>
> commit 10359213d05acf804558bda7cc9b8422a828d1cd
> Author: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Date:   Wed Feb 11 15:28:28 2015 -0800
>
>     mm: incorporate read-only pages into transparent huge pages
>
>     This patch aims to improve THP collapse rates, by allowing THP collapse in
>     the presence of read-only ptes, like those left in place by do_swap_page
>     after a read fault.
>
>     Currently THP can collapse 4kB pages into a THP when there are up to
>     khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch applies
>     the same limit for read-only ptes.
>
>
> The change essentially results in a read-only mapped PTE page getting copied and
> mapped writable via a new PMD-mapped THP.
>
> It mentions do_swap_page(), so I assume it just tried to do what do_swap_page()
> would do when trying to map a page swapped in from the page cache writable
> immediately.
>
> But we differ from do_swap_page() that we're not actually going to map the page
> writable, we're going to copy the page (__collapse_huge_page_copy()) and map
> the copy writable.

Yeah, this is the point. Khugepaged or the process being collapsed
won't write to the original page. Just unshare it.

>
> I assume we can remove that logic.
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-01-28 17:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26  9:55 [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
2022-01-26 14:25   ` Matthew Wilcox
2022-01-28 12:53   ` Vlastimil Babka
2022-01-28 13:44     ` David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs David Hildenbrand
2022-01-26 14:31   ` Matthew Wilcox
2022-01-26 14:36     ` David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 3/9] mm: slightly clarify KSM logic in do_swap_page() David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 4/9] mm: streamline COW " David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() David Hildenbrand
2022-01-26 20:36   ` Yang Shi
2022-01-27  8:14     ` David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 6/9] mm/khugepaged: remove reuse_swap_page() usage David Hildenbrand
2022-01-27 21:23   ` Yang Shi
2022-01-28  8:41     ` David Hildenbrand
2022-01-28 17:10       ` Yang Shi
2022-01-26  9:55 ` [PATCH RFC v2 7/9] mm/swapfile: remove reuse_swap_page() David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 8/9] mm/huge_memory: remove stale page_trans_huge_mapcount() David Hildenbrand
2022-01-26  9:55 ` [PATCH RFC v2 9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd() David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).