From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Hugh Dickins <hughd@google.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Shakeel Butt <shakeelb@google.com>,
John Hubbard <jhubbard@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Mike Rapoport <rppt@linux.ibm.com>,
Yang Shi <shy828301@gmail.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Matthew Wilcox <willy@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Michal Hocko <mhocko@kernel.org>, Nadav Amit <namit@vmware.com>,
Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>,
Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>,
Jan Kara <jack@suse.cz>, Liang Zhang <zhangliang5@huawei.com>,
linux-mm@kvack.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v3 4/9] mm: streamline COW logic in do_swap_page()
Date: Mon, 31 Jan 2022 17:29:34 +0100 [thread overview]
Message-ID: <20220131162940.210846-5-david@redhat.com> (raw)
In-Reply-To: <20220131162940.210846-1-david@redhat.com>
Currently we have a different COW logic when:
* triggering a read-fault to swapin first and then trigger a write-fault
-> do_swap_page() + do_wp_page()
* triggering a write-fault to swapin
-> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()
The COW logic in do_swap_page() is different than our reuse logic in
do_wp_page(). The COW logic in do_wp_page() -- page_count() == 1 -- makes
currently sure that we certainly don't have a remaining reference, e.g.,
via GUP, on the target page we want to reuse: if there is any unexpected
reference, we have to copy to avoid information leaks.
As do_swap_page() behaves differently, in environments with swap enabled we
can currently have an unintended information leak from the parent to the
child, similar as known from CVE-2020-29374:
1. Parent writes to anonymous page
-> Page is mapped writable and modified
2. Page is swapped out
-> Page is unmapped and replaced by swap entry
3. fork()
-> Swap entries are copied to child
4. Child pins page R/O
-> Page is mapped R/O into child
5. Child unmaps page
-> Child still holds GUP reference
6. Parent writes to page
-> Page is reused in do_swap_page()
-> Child can observe changes
Exchanging 2. and 3. should have the same effect.
Let's apply the same COW logic as in do_wp_page(), conditionally trying to
remove the page from the swapcache after freeing the swap entry, however,
before actually mapping our page. We can change the order now that
we use try_to_free_swap(), which doesn't care about the mapcount,
instead of reuse_swap_page().
To handle references from the LRU pagevecs, conditionally drain the local
LRU pagevecs when required, however, don't consider the page_count() when
deciding whether to drain to keep it simple for now.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/memory.c | 55 +++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 43 insertions(+), 12 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 3c91294cca98..c6177d897964 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3497,6 +3497,25 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
return 0;
}
+static inline bool should_try_to_free_swap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int fault_flags)
+{
+ if (!PageSwapCache(page))
+ return false;
+ if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
+ PageMlocked(page))
+ return true;
+ /*
+ * If we want to map a page that's in the swapcache writable, we
+ * have to detect via the refcount if we're really the exclusive
+ * user. Try freeing the swapcache to get rid of the swapcache
+ * reference only in case it's likely that we'll be the exlusive user.
+ */
+ return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+ page_count(page) == 2;
+}
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3638,6 +3657,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page = swapcache;
goto out_page;
}
+
+ /*
+ * If we want to map a page that's in the swapcache writable, we
+ * have to detect via the refcount if we're really the exclusive
+ * owner. Try removing the extra reference from the local LRU
+ * pagevecs if required.
+ */
+ if ((vmf->flags & FAULT_FLAG_WRITE) && page == swapcache &&
+ !PageKsm(page) && !PageLRU(page))
+ lru_add_drain();
}
cgroup_throttle_swaprate(page, GFP_KERNEL);
@@ -3656,19 +3685,25 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
/*
- * The page isn't present yet, go ahead with the fault.
- *
- * Be careful about the sequence of operations here.
- * To get its accounting right, reuse_swap_page() must be called
- * while the page is counted on swap but not yet in mapcount i.e.
- * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
- * must be called after the swap_free(), or it will never succeed.
+ * Remove the swap entry and conditionally try to free up the swapcache.
+ * We're already holding a reference on the page but haven't mapped it
+ * yet.
*/
+ swap_free(entry);
+ if (should_try_to_free_swap(page, vma, vmf->flags))
+ try_to_free_swap(page);
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
- if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+
+ /*
+ * Same logic as in do_wp_page(); however, optimize for fresh pages
+ * that are certainly not shared because we just allocated them without
+ * exposing them to the swapcache.
+ */
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+ (page != swapcache || page_count(page) == 1)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
@@ -3694,10 +3729,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
- swap_free(entry);
- if (mem_cgroup_swap_full(page) ||
- (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
- try_to_free_swap(page);
unlock_page(page);
if (page != swapcache && swapcache) {
/*
--
2.34.1
next prev parent reply other threads:[~2022-01-31 16:33 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-31 16:29 [PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap David Hildenbrand
2022-01-31 16:29 ` [PATCH v3 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
2022-01-31 16:29 ` [PATCH v3 2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs David Hildenbrand
2022-03-09 17:53 ` Vlastimil Babka
2022-01-31 16:29 ` [PATCH v3 3/9] mm: slightly clarify KSM logic in do_swap_page() David Hildenbrand
2022-03-09 18:03 ` Vlastimil Babka
2022-03-09 18:48 ` Yang Shi
2022-03-09 19:15 ` David Hildenbrand
2022-03-09 20:50 ` Andrew Morton
2022-01-31 16:29 ` David Hildenbrand [this message]
2022-03-10 9:41 ` [PATCH v3 4/9] mm: streamline COW " Vlastimil Babka
2022-01-31 16:29 ` [PATCH v3 5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() David Hildenbrand
2022-03-10 9:52 ` Vlastimil Babka
2022-01-31 16:29 ` [PATCH v3 6/9] mm/khugepaged: remove reuse_swap_page() usage David Hildenbrand
2022-02-01 21:31 ` Yang Shi
2022-03-10 10:37 ` Vlastimil Babka
2022-01-31 16:29 ` [PATCH v3 7/9] mm/swapfile: remove stale reuse_swap_page() David Hildenbrand
2022-02-02 14:35 ` Christoph Hellwig
2022-02-02 17:01 ` David Hildenbrand
2022-03-10 10:44 ` Vlastimil Babka
2022-01-31 16:29 ` [PATCH v3 8/9] mm/huge_memory: remove stale page_trans_huge_mapcount() David Hildenbrand
2022-03-10 10:50 ` Vlastimil Babka
2022-01-31 16:29 ` [PATCH v3 9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd() David Hildenbrand
2022-03-10 11:02 ` Vlastimil Babka
2022-02-01 18:59 ` [PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220131162940.210846-5-david@redhat.com \
--to=david@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=ddutile@redhat.com \
--cc=guro@fb.com \
--cc=hch@lst.de \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=namit@vmware.com \
--cc=oleg@redhat.com \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=rppt@linux.ibm.com \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=zhangliang5@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).