[PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap

* [PATCH RFC v2 0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap
@ 2022-01-26  9:55 David Hildenbrand
  2022-01-26  9:55 ` [PATCH RFC v2 1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache David Hildenbrand
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-01-26  9:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, linux-mm, David Hildenbrand, Nadav Amit

This series is the result of the discussion on the previous approach [1].
More information on the general COW issues can be found there.

This series attempts to optimize and streamline the COW logic for ordinary
anon pages and THP anon pages, fixing two remaining instances of
CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information can
leak from a parent process to a child process via anonymous pages shared
during fork().

This issue, including other related COW issues, has been summarized in [2]:
"
  1. Observing Memory Modifications of Private Pages From A Child Process

  Long story short: process-private memory might not be as private as you
  think once you fork(): successive modifications of private memory
  regions in the parent process can still be observed by the child
  process, for example, by smart use of vmsplice()+munmap().

  The core problem is that pinning pages readable in a child process, such
  as done via the vmsplice system call, can result in a child process
  observing memory modifications done in the parent process the child is
  not supposed to observe. [1] contains an excellent summary and [2]
  contains further details. This issue was assigned CVE-2020-29374 [9].

  For this to trigger, it's required to use a fork() without subsequent
  exec(), for example, as used under Android zygote. Without further
  details about an application that forks less-privileged child processes,
  one cannot really say what's actually affected and what's not -- see the
  details section the end of this mail for a short sshd/openssh analysis.

  While commit 17839856fd58 ("gup: document and work around "COW can break
  either way" issue") fixed this issue and resulted in other problems
  (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
  simplification") re-introduced part of the problem unfortunately.

  The original reproducer can be modified quite easily to use THP [3] and
  make the issue appear again on upstream kernels. I modified it to use
  hugetlb [4] and it triggers as well. The problem is certainly less
  severe with hugetlb than with THP; it merely highlights that we still
  have plenty of open holes we should be closing/fixing.

  Regarding vmsplice(), the only known workaround is to disallow the
  vmsplice() system call ... or disable THP and hugetlb. But who knows
  what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
  the end, it's a more generic issue.
"

This security issue was first reported by Jann Horn on 27 May 2020 and it
currently affects anonymous pages during swapin, anonymous THP and hugetlb.
This series tackles anonymous pages during swapin and anonymous THP:
* do_swap_page() for handling COW on PTEs *during* swapin directly
* do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
  faults

With this series, we'll apply the same COW logic we have in do_wp_page()
to all swappable anon pages: don't reuse (map writable) the page in
case there are additional references (page_count() != 1). All users of
reuse_swap_page() are remove, and consequently reuse_swap_page() is
removed.

In general, we're struggling with the following COW-related issues:
(1) "missed COW": we miss to copy on write and reuse the page (map it
    writable) although we must copy because there are pending references
    from another process to this page. The result is a security issue.
(2) "wrong COW": we copy on write although we wouldn't have to and
    shouldn't: if there are valid GUP references, they will become out of
    sync with the pages mapped into the page table. We fail to detect that
    such a page can be reused safely, especially if never more than a
    single process mapped the page. The result is an intra process
    memory corruption.
(3) "unnecessary COW": we copy on write although we wouldn't have to:
    performance degradation and temporary increases swap+memory consumption
    can be the result.

While this series fixes (1) for swappable anon pages, it tries to reduce
reported cases of (3) first as good and easy as possible to limit the
impact when streamlining. The individual patches try to describe in which
cases we will run into (3).

This series certainly makes (2) worse for THP, because a THP will now get
PTE-mapped on write faults if there are additional references, even if
there was only ever a single process involved: once PTE-mapped, we'll copy
each and every subpage and won't reuse any subpage as long as the
underlying compound page wasn't split.

I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to
mark anon pages that are exclusive to a single process, allow GUP pins only
on such exclusive pages, and allow turning exclusive pages shared
(clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with
PageAnonExclusive set never have to be copied during write faults, but
eventually during fork() if they cannot be turned shared. The improved
reuse logic in this series will essentially also be the logic to reset
PageAnonExclusive. This work will certainly take a while, but I'm planning
on sharing details before having code fully ready.

#1-#5 can be applied independently of the rest. #6-#9 are mostly only
cleanups related to reuse_swap_page().

Notes:
* For now, I'll leave hugetlb code untouched: "unnecessary COW" might
  easily break existing setups because hugetlb pages are a scarce resource
  and we could just end up having to crash the application when we run out
  of hugetlb pages. We have to be very careful and the security aspect with
  hugetlb is most certainly less relevant than for unprivileged anon pages.
* Instead of lru_add_drain() we might actually just drain the lru_add list
  or even just remove the single page of interest from the lru_add list.
  This would require a new helper function, and could be added if the
  conditional lru_add_drain() turn out to be a problem.
* I am pretty much confused about the swapcache handling in khugepaged
  code. This certainly needs an additional pair of eyes.
* I extended the test case already included in [1] to also test for the
  newly found do_swap_page() case. I'll send that out separately once/if
  this part was reviewed.

[1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

David Hildenbrand (9):
  mm: optimize do_wp_page() for exclusive pages in the swapcache
  mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
  mm: slightly clarify KSM logic in do_swap_page()
  mm: streamline COW logic in do_swap_page()
  mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
  mm/khugepaged: remove reuse_swap_page() usage
  mm/swapfile: remove reuse_swap_page()
  mm/huge_memory: remove stale page_trans_huge_mapcount()
  mm/huge_memory: remove stale locking logic from __split_huge_pmd()

 include/linux/mm.h   |   5 --
 include/linux/swap.h |   4 --
 mm/huge_memory.c     |  99 ++++++----------------------------
 mm/khugepaged.c      |  16 ++++--
 mm/memory.c          | 123 +++++++++++++++++++++++++++++++------------
 mm/swapfile.c        | 104 ------------------------------------
 6 files changed, 119 insertions(+), 232 deletions(-)

base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
-- 
2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread