All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages
@ 2022-03-29 16:43 ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

More information on the general COW issues can be found at [2]. This series
is based on latest linus/master and [1]:
	[PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of
	anonymous pages

v2 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_3_v2


This series fixes memory corruptions when a GUP R/W reference
(FOLL_WRITE | FOLL_GET) was taken on an anonymous page and COW logic fails
to detect exclusivity of the page to then replacing the anonymous page by
a copy in the page table: The GUP reference lost synchronicity with the
pages mapped into the page tables. This series focuses on x86, arm64,
s390x and ppc64/book3s -- other architectures are fairly easy to support
by implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

This primarily fixes the O_DIRECT memory corruptions that can happen on
concurrent swapout, whereby we lose DMA reads to a page (modifying the user
page by writing to it).

O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM)
DMA from/to a user page. In the long run, we want to convert it to properly
use FOLL_PIN, and John is working on it, but that might take a while and
might not be easy to backport. In the meantime, let's restore what used to
work before we started modifying our COW logic: make R/W FOLL_GET
references reliable as long as there is no fork() after GUP involved.

This is just the natural follow-up of part 2, that will also further
reduce "wrong COW" on the swapin path, for example, when we cannot remove
a page from the swapcache due to concurrent writeback, or if we have two
threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
nice side-product

This issue, including other related COW issues, has been summarized in [3]
under 2):
"
  2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

  It was discovered that we can create a memory corruption by reading a
  file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
  concurrently writing to an unrelated part (e.g., last byte) of the same
  page, and concurrently write-protecting the page via clear_refs
  SOFTDIRTY tracking [6].

  For the reproducer, the issue is that O_DIRECT grabs a reference of the
  target page (via FOLL_GET) and clear_refs write-protects the relevant
  page table entry. On successive write access to the page from the
  process itself, we wrongly COW the page when resolving the write fault,
  resulting in a loss of synchronicity and consequently a memory corruption.

  While some people might think that using clear_refs in this combination
  is a corner cases, it turns out to be a more generic problem unfortunately.

  For example, it was just recently discovered that we can similarly
  create a memory corruption without clear_refs, simply by concurrently
  swapping out the buffer pages [7]. Note that we nowadays even use the
  swap infrastructure in Linux without an actual swap disk/partition: the
  prime example is zram which is enabled as default under Fedora [10].

  The root issue is that a write-fault on a page that has additional
  references results in a COW and thereby a loss of synchronicity
  and consequently a memory corruption if two parties believe they are
  referencing the same page.
"

We don't particularly care about R/O FOLL_GET references: they were never
reliable and O_DIRECT doesn't expect to observe modifications from a page
after DMA was started.

Note that:
* this only fixes the issue on x86, arm64, s390x and ppc64/book3s
  ("enterprise architectures"). Other architectures have to implement
  __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
* this does *not * consider any kind of fork() after taking the reference:
  fork() after GUP never worked reliably with FOLL_GET.
* Not losing PG_anon_exclusive during swapout was the last remaining
  piece. KSM already makes sure that there are no other references on
  a page before considering it for sharing. Page migration maintains
  PG_anon_exclusive and simply fails when there are additional references
  (freezing the refcount fails). Only swapout code dropped the
  PG_anon_exclusive flag because it requires more work to remember +
  restore it.

With this series in place, most COW issues of [3] are fixed on said
architectures. Other architectures can implement
__HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

[1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

v2 -> v3:
* Rebased and retested
* "arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Add RB and a comment to the patch description
* "s390/pgtable: cleanup description of swp pte layout"
  -> Added
* "s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Use new set_pte_bit()/clear_pte_bit()
  -> Fixups comments/patch description

David Hildenbrand (8):
  mm/swap: remember PG_anon_exclusive via a swp pte bit
  mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  s390/pgtable: cleanup description of swp pte layout
  s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s

 arch/arm64/include/asm/pgtable-prot.h        |  1 +
 arch/arm64/include/asm/pgtable.h             | 23 ++++++--
 arch/powerpc/include/asm/book3s/64/pgtable.h | 31 ++++++++---
 arch/s390/include/asm/pgtable.h              | 36 +++++++++----
 arch/x86/include/asm/pgtable.h               | 16 ++++++
 arch/x86/include/asm/pgtable_64.h            |  4 +-
 arch/x86/include/asm/pgtable_types.h         |  5 ++
 include/linux/pgtable.h                      | 29 +++++++++++
 include/linux/swapops.h                      |  2 +
 mm/debug_vm_pgtable.c                        | 15 ++++++
 mm/memory.c                                  | 55 ++++++++++++++++++--
 mm/rmap.c                                    | 19 ++++---
 mm/swapfile.c                                | 13 ++++-
 13 files changed, 216 insertions(+), 33 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-04-20 17:22 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29 16:43 [PATCH v2 0/8] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages David Hildenbrand
2022-03-29 16:43 ` David Hildenbrand
2022-03-29 16:43 ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-04-13  8:58   ` Miaohe Lin
2022-04-13  9:30     ` David Hildenbrand
2022-04-13  9:38       ` Miaohe Lin
2022-04-13 10:46         ` David Hildenbrand
2022-04-13 12:31         ` David Hildenbrand
2022-04-14  2:40           ` Miaohe Lin
2022-04-20 17:10   ` Vlastimil Babka
2022-04-20 17:10     ` Vlastimil Babka
2022-04-20 17:10     ` Vlastimil Babka
2022-04-20 17:13     ` David Hildenbrand
2022-04-20 17:13       ` David Hildenbrand
2022-04-20 17:13       ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-04-20 17:14   ` Vlastimil Babka
2022-04-20 17:14     ` Vlastimil Babka
2022-04-20 17:14     ` Vlastimil Babka
2022-03-29 16:43 ` [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-04-19 12:46   ` David Hildenbrand
2022-04-19 12:46     ` David Hildenbrand
2022-04-19 12:46     ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 4/8] arm64/pgtable: " David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-30 16:48   ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-29 16:43 ` [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-30 16:48   ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-29 16:43 ` [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-30  6:07   ` Christophe Leroy
2022-03-30  6:07     ` Christophe Leroy
2022-03-30  6:07     ` Christophe Leroy
2022-03-30  6:58     ` David Hildenbrand
2022-03-30  6:58       ` David Hildenbrand
2022-03-30  6:58       ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 8/8] powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE " David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.