All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages
@ 2022-03-29 16:43 ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

More information on the general COW issues can be found at [2]. This series
is based on latest linus/master and [1]:
	[PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of
	anonymous pages

v2 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_3_v2


This series fixes memory corruptions when a GUP R/W reference
(FOLL_WRITE | FOLL_GET) was taken on an anonymous page and COW logic fails
to detect exclusivity of the page to then replacing the anonymous page by
a copy in the page table: The GUP reference lost synchronicity with the
pages mapped into the page tables. This series focuses on x86, arm64,
s390x and ppc64/book3s -- other architectures are fairly easy to support
by implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

This primarily fixes the O_DIRECT memory corruptions that can happen on
concurrent swapout, whereby we lose DMA reads to a page (modifying the user
page by writing to it).

O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM)
DMA from/to a user page. In the long run, we want to convert it to properly
use FOLL_PIN, and John is working on it, but that might take a while and
might not be easy to backport. In the meantime, let's restore what used to
work before we started modifying our COW logic: make R/W FOLL_GET
references reliable as long as there is no fork() after GUP involved.

This is just the natural follow-up of part 2, that will also further
reduce "wrong COW" on the swapin path, for example, when we cannot remove
a page from the swapcache due to concurrent writeback, or if we have two
threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
nice side-product

This issue, including other related COW issues, has been summarized in [3]
under 2):
"
  2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

  It was discovered that we can create a memory corruption by reading a
  file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
  concurrently writing to an unrelated part (e.g., last byte) of the same
  page, and concurrently write-protecting the page via clear_refs
  SOFTDIRTY tracking [6].

  For the reproducer, the issue is that O_DIRECT grabs a reference of the
  target page (via FOLL_GET) and clear_refs write-protects the relevant
  page table entry. On successive write access to the page from the
  process itself, we wrongly COW the page when resolving the write fault,
  resulting in a loss of synchronicity and consequently a memory corruption.

  While some people might think that using clear_refs in this combination
  is a corner cases, it turns out to be a more generic problem unfortunately.

  For example, it was just recently discovered that we can similarly
  create a memory corruption without clear_refs, simply by concurrently
  swapping out the buffer pages [7]. Note that we nowadays even use the
  swap infrastructure in Linux without an actual swap disk/partition: the
  prime example is zram which is enabled as default under Fedora [10].

  The root issue is that a write-fault on a page that has additional
  references results in a COW and thereby a loss of synchronicity
  and consequently a memory corruption if two parties believe they are
  referencing the same page.
"

We don't particularly care about R/O FOLL_GET references: they were never
reliable and O_DIRECT doesn't expect to observe modifications from a page
after DMA was started.

Note that:
* this only fixes the issue on x86, arm64, s390x and ppc64/book3s
  ("enterprise architectures"). Other architectures have to implement
  __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
* this does *not * consider any kind of fork() after taking the reference:
  fork() after GUP never worked reliably with FOLL_GET.
* Not losing PG_anon_exclusive during swapout was the last remaining
  piece. KSM already makes sure that there are no other references on
  a page before considering it for sharing. Page migration maintains
  PG_anon_exclusive and simply fails when there are additional references
  (freezing the refcount fails). Only swapout code dropped the
  PG_anon_exclusive flag because it requires more work to remember +
  restore it.

With this series in place, most COW issues of [3] are fixed on said
architectures. Other architectures can implement
__HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

[1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

v2 -> v3:
* Rebased and retested
* "arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Add RB and a comment to the patch description
* "s390/pgtable: cleanup description of swp pte layout"
  -> Added
* "s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Use new set_pte_bit()/clear_pte_bit()
  -> Fixups comments/patch description

David Hildenbrand (8):
  mm/swap: remember PG_anon_exclusive via a swp pte bit
  mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  s390/pgtable: cleanup description of swp pte layout
  s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s

 arch/arm64/include/asm/pgtable-prot.h        |  1 +
 arch/arm64/include/asm/pgtable.h             | 23 ++++++--
 arch/powerpc/include/asm/book3s/64/pgtable.h | 31 ++++++++---
 arch/s390/include/asm/pgtable.h              | 36 +++++++++----
 arch/x86/include/asm/pgtable.h               | 16 ++++++
 arch/x86/include/asm/pgtable_64.h            |  4 +-
 arch/x86/include/asm/pgtable_types.h         |  5 ++
 include/linux/pgtable.h                      | 29 +++++++++++
 include/linux/swapops.h                      |  2 +
 mm/debug_vm_pgtable.c                        | 15 ++++++
 mm/memory.c                                  | 55 ++++++++++++++++++--
 mm/rmap.c                                    | 19 ++++---
 mm/swapfile.c                                | 13 ++++-
 13 files changed, 216 insertions(+), 33 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 0/8] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages
@ 2022-03-29 16:43 ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

More information on the general COW issues can be found at [2]. This series
is based on latest linus/master and [1]:
	[PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of
	anonymous pages

v2 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_3_v2


This series fixes memory corruptions when a GUP R/W reference
(FOLL_WRITE | FOLL_GET) was taken on an anonymous page and COW logic fails
to detect exclusivity of the page to then replacing the anonymous page by
a copy in the page table: The GUP reference lost synchronicity with the
pages mapped into the page tables. This series focuses on x86, arm64,
s390x and ppc64/book3s -- other architectures are fairly easy to support
by implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

This primarily fixes the O_DIRECT memory corruptions that can happen on
concurrent swapout, whereby we lose DMA reads to a page (modifying the user
page by writing to it).

O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM)
DMA from/to a user page. In the long run, we want to convert it to properly
use FOLL_PIN, and John is working on it, but that might take a while and
might not be easy to backport. In the meantime, let's restore what used to
work before we started modifying our COW logic: make R/W FOLL_GET
references reliable as long as there is no fork() after GUP involved.

This is just the natural follow-up of part 2, that will also further
reduce "wrong COW" on the swapin path, for example, when we cannot remove
a page from the swapcache due to concurrent writeback, or if we have two
threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
nice side-product

This issue, including other related COW issues, has been summarized in [3]
under 2):
"
  2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

  It was discovered that we can create a memory corruption by reading a
  file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
  concurrently writing to an unrelated part (e.g., last byte) of the same
  page, and concurrently write-protecting the page via clear_refs
  SOFTDIRTY tracking [6].

  For the reproducer, the issue is that O_DIRECT grabs a reference of the
  target page (via FOLL_GET) and clear_refs write-protects the relevant
  page table entry. On successive write access to the page from the
  process itself, we wrongly COW the page when resolving the write fault,
  resulting in a loss of synchronicity and consequently a memory corruption.

  While some people might think that using clear_refs in this combination
  is a corner cases, it turns out to be a more generic problem unfortunately.

  For example, it was just recently discovered that we can similarly
  create a memory corruption without clear_refs, simply by concurrently
  swapping out the buffer pages [7]. Note that we nowadays even use the
  swap infrastructure in Linux without an actual swap disk/partition: the
  prime example is zram which is enabled as default under Fedora [10].

  The root issue is that a write-fault on a page that has additional
  references results in a COW and thereby a loss of synchronicity
  and consequently a memory corruption if two parties believe they are
  referencing the same page.
"

We don't particularly care about R/O FOLL_GET references: they were never
reliable and O_DIRECT doesn't expect to observe modifications from a page
after DMA was started.

Note that:
* this only fixes the issue on x86, arm64, s390x and ppc64/book3s
  ("enterprise architectures"). Other architectures have to implement
  __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
* this does *not * consider any kind of fork() after taking the reference:
  fork() after GUP never worked reliably with FOLL_GET.
* Not losing PG_anon_exclusive during swapout was the last remaining
  piece. KSM already makes sure that there are no other references on
  a page before considering it for sharing. Page migration maintains
  PG_anon_exclusive and simply fails when there are additional references
  (freezing the refcount fails). Only swapout code dropped the
  PG_anon_exclusive flag because it requires more work to remember +
  restore it.

With this series in place, most COW issues of [3] are fixed on said
architectures. Other architectures can implement
__HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

[1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

v2 -> v3:
* Rebased and retested
* "arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Add RB and a comment to the patch description
* "s390/pgtable: cleanup description of swp pte layout"
  -> Added
* "s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Use new set_pte_bit()/clear_pte_bit()
  -> Fixups comments/patch description

David Hildenbrand (8):
  mm/swap: remember PG_anon_exclusive via a swp pte bit
  mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  s390/pgtable: cleanup description of swp pte layout
  s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s

 arch/arm64/include/asm/pgtable-prot.h        |  1 +
 arch/arm64/include/asm/pgtable.h             | 23 ++++++--
 arch/powerpc/include/asm/book3s/64/pgtable.h | 31 ++++++++---
 arch/s390/include/asm/pgtable.h              | 36 +++++++++----
 arch/x86/include/asm/pgtable.h               | 16 ++++++
 arch/x86/include/asm/pgtable_64.h            |  4 +-
 arch/x86/include/asm/pgtable_types.h         |  5 ++
 include/linux/pgtable.h                      | 29 +++++++++++
 include/linux/swapops.h                      |  2 +
 mm/debug_vm_pgtable.c                        | 15 ++++++
 mm/memory.c                                  | 55 ++++++++++++++++++--
 mm/rmap.c                                    | 19 ++++---
 mm/swapfile.c                                | 13 ++++-
 13 files changed, 216 insertions(+), 33 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 0/8] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages
@ 2022-03-29 16:43 ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

More information on the general COW issues can be found at [2]. This series
is based on latest linus/master and [1]:
	[PATCH v3 00/16] mm: COW fixes part 2: reliable GUP pins of
	anonymous pages

v2 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_3_v2


This series fixes memory corruptions when a GUP R/W reference
(FOLL_WRITE | FOLL_GET) was taken on an anonymous page and COW logic fails
to detect exclusivity of the page to then replacing the anonymous page by
a copy in the page table: The GUP reference lost synchronicity with the
pages mapped into the page tables. This series focuses on x86, arm64,
s390x and ppc64/book3s -- other architectures are fairly easy to support
by implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

This primarily fixes the O_DIRECT memory corruptions that can happen on
concurrent swapout, whereby we lose DMA reads to a page (modifying the user
page by writing to it).

O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM)
DMA from/to a user page. In the long run, we want to convert it to properly
use FOLL_PIN, and John is working on it, but that might take a while and
might not be easy to backport. In the meantime, let's restore what used to
work before we started modifying our COW logic: make R/W FOLL_GET
references reliable as long as there is no fork() after GUP involved.

This is just the natural follow-up of part 2, that will also further
reduce "wrong COW" on the swapin path, for example, when we cannot remove
a page from the swapcache due to concurrent writeback, or if we have two
threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
nice side-product

This issue, including other related COW issues, has been summarized in [3]
under 2):
"
  2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

  It was discovered that we can create a memory corruption by reading a
  file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
  concurrently writing to an unrelated part (e.g., last byte) of the same
  page, and concurrently write-protecting the page via clear_refs
  SOFTDIRTY tracking [6].

  For the reproducer, the issue is that O_DIRECT grabs a reference of the
  target page (via FOLL_GET) and clear_refs write-protects the relevant
  page table entry. On successive write access to the page from the
  process itself, we wrongly COW the page when resolving the write fault,
  resulting in a loss of synchronicity and consequently a memory corruption.

  While some people might think that using clear_refs in this combination
  is a corner cases, it turns out to be a more generic problem unfortunately.

  For example, it was just recently discovered that we can similarly
  create a memory corruption without clear_refs, simply by concurrently
  swapping out the buffer pages [7]. Note that we nowadays even use the
  swap infrastructure in Linux without an actual swap disk/partition: the
  prime example is zram which is enabled as default under Fedora [10].

  The root issue is that a write-fault on a page that has additional
  references results in a COW and thereby a loss of synchronicity
  and consequently a memory corruption if two parties believe they are
  referencing the same page.
"

We don't particularly care about R/O FOLL_GET references: they were never
reliable and O_DIRECT doesn't expect to observe modifications from a page
after DMA was started.

Note that:
* this only fixes the issue on x86, arm64, s390x and ppc64/book3s
  ("enterprise architectures"). Other architectures have to implement
  __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
* this does *not * consider any kind of fork() after taking the reference:
  fork() after GUP never worked reliably with FOLL_GET.
* Not losing PG_anon_exclusive during swapout was the last remaining
  piece. KSM already makes sure that there are no other references on
  a page before considering it for sharing. Page migration maintains
  PG_anon_exclusive and simply fails when there are additional references
  (freezing the refcount fails). Only swapout code dropped the
  PG_anon_exclusive flag because it requires more work to remember +
  restore it.

With this series in place, most COW issues of [3] are fixed on said
architectures. Other architectures can implement
__HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

[1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

v2 -> v3:
* Rebased and retested
* "arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Add RB and a comment to the patch description
* "s390/pgtable: cleanup description of swp pte layout"
  -> Added
* "s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE"
  -> Use new set_pte_bit()/clear_pte_bit()
  -> Fixups comments/patch description

David Hildenbrand (8):
  mm/swap: remember PG_anon_exclusive via a swp pte bit
  mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  s390/pgtable: cleanup description of swp pte layout
  s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s

 arch/arm64/include/asm/pgtable-prot.h        |  1 +
 arch/arm64/include/asm/pgtable.h             | 23 ++++++--
 arch/powerpc/include/asm/book3s/64/pgtable.h | 31 ++++++++---
 arch/s390/include/asm/pgtable.h              | 36 +++++++++----
 arch/x86/include/asm/pgtable.h               | 16 ++++++
 arch/x86/include/asm/pgtable_64.h            |  4 +-
 arch/x86/include/asm/pgtable_types.h         |  5 ++
 include/linux/pgtable.h                      | 29 +++++++++++
 include/linux/swapops.h                      |  2 +
 mm/debug_vm_pgtable.c                        | 15 ++++++
 mm/memory.c                                  | 55 ++++++++++++++++++--
 mm/rmap.c                                    | 19 ++++---
 mm/swapfile.c                                | 13 ++++-
 13 files changed, 216 insertions(+), 33 deletions(-)

-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
it. We do this, to keep fork() logic on swap entries easy and efficient:
for example, if we wouldn't clear it when unmapping, we'd have to lookup
the page in the swapcache for each and every swap entry during fork() and
clear PG_anon_exclusive if set.

Instead, we want to store that information directly in the swap pte,
protected by the page table lock, similarly to how we handle
SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
swap entries, we don't want to mess with the swap type (e.g., still one
bit) because it overcomplicates swap code.

In try_to_unmap(), we already reject to unmap in case the page might be
pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
Checking if there are other unexpected references reliably *before*
completely unmapping a page is unfortunately not really possible: THP
heavily overcomplicate the situation. Once fully unmapped it's easier --
we, for example, make sure that there are no unexpected references
*after* unmapping a page before starting writeback on that page.

So, we currently might end up unmapping a page and clearing
PG_anon_exclusive if that page has additional references, for example,
due to a FOLL_GET.

do_swap_page() has to re-determine if a page is exclusive, which will
easily fail if there are other references on a page, most prominently
GUP references via FOLL_GET. This can currently result in memory
corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
when fork() is never involved: try_to_unmap() will succeed, and when
refaulting the page, it cannot be marked exclusive and will get replaced
by a copy in the page tables on the next write access, resulting in writes
via the GUP reference to the page being lost.

In an ideal world, everybody that uses GUP and wants to modify page
content, such as O_DIRECT, would properly use FOLL_PIN. However, that
conversion will take a while. It's easier to fix what used to work in the
past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
by remembering PG_anon_exclusive we can further reduce unnecessary COW
in some cases, so it's the natural thing to do.

So let's transfer the PG_anon_exclusive information to the swap pte and
store it via an architecture-dependant pte bit; use that information when
restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
simply have to clear the pte bit and are done.

Of course, there is one corner case to handle: swap backends that don't
support concurrent page modifications while the page is under writeback.
Special case these, and drop the exclusive marker. Add a comment why that
is just fine (also, reuse_swap_page() would have done the same in the
past).

In the future, we'll hopefully have all architectures support
__HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
stubs and the define completely. Then, we can also convert
SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
support: either simply use a yet unused pte bit that can be used for swap
entries, steal one from the arch type bits if they exceed 5, or steal one
from the offset bits.

Note: R/O FOLL_GET references were never really reliable, especially
when taking one on a shared page and then writing to the page (e.g., GUP
after fork()). FOLL_GET, including R/W references, were never really
reliable once fork was involved (e.g., GUP before fork(),
GUP during fork()). KSM steps back in case it stumbles over unexpected
references and is, therefore, fine.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pgtable.h | 29 ++++++++++++++++++++++
 include/linux/swapops.h |  2 ++
 mm/memory.c             | 55 ++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c               | 19 ++++++++------
 mm/swapfile.c           | 13 +++++++++-
 5 files changed, 105 insertions(+), 13 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..53750224e176 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1003,6 +1003,35 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+/*
+ * When replacing an anonymous page by a real (!non) swap entry, we clear
+ * PG_anon_exclusive from the page and instead remember whether the flag was
+ * set in the swp pte. During fork(), we have to mark the entry as !exclusive
+ * (possibly shared). On swapin, we use that information to restore
+ * PG_anon_exclusive, which is very helpful in cases where we might have
+ * additional (e.g., FOLL_GET) references on a page and wouldn't be able to
+ * detect exclusivity.
+ *
+ * These functions don't apply to non-swap entries (e.g., migration, hwpoison,
+ * ...).
+ */
+#ifndef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte;
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return false;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte;
+}
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 06280fc1c99b..32d517a28969 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -26,6 +26,8 @@
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
+	if (pte_swp_exclusive(pte))
+		pte = pte_swp_clear_exclusive(pte);
 	if (pte_swp_soft_dirty(pte))
 		pte = pte_swp_clear_soft_dirty(pte);
 	if (pte_swp_uffd_wp(pte))
diff --git a/mm/memory.c b/mm/memory.c
index 14618f446139..9060cc7f2123 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 						&src_mm->mmlist);
 			spin_unlock(&mmlist_lock);
 		}
+		/* Mark the swap entry as shared. */
+		if (pte_swp_exclusive(*src_pte)) {
+			pte = pte_swp_clear_exclusive(*src_pte);
+			set_pte_at(src_mm, addr, src_pte, pte);
+		}
 		rss[MM_SWAPENTS]++;
 	} else if (is_migration_entry(entry)) {
 		page = pfn_swap_entry_to_page(entry);
@@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct page *page = NULL, *swapcache;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
+	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
@@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
 	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
 
+	/*
+	 * Check under PT lock (to protect against concurrent fork() sharing
+	 * the swap entry concurrently) for certainly exclusive pages.
+	 */
+	if (!PageKsm(page)) {
+		/*
+		 * Note that pte_swp_exclusive() == false for architectures
+		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
+		 */
+		exclusive = pte_swp_exclusive(vmf->orig_pte);
+		if (page != swapcache) {
+			/*
+			 * We have a fresh page that is not exposed to the
+			 * swapcache -> certainly exclusive.
+			 */
+			exclusive = true;
+		} else if (exclusive && PageWriteback(page) &&
+			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
+			/*
+			 * This is tricky: not all swap backends support
+			 * concurrent page modifications while under writeback.
+			 *
+			 * So if we stumble over such a page in the swapcache
+			 * we must not set the page exclusive, otherwise we can
+			 * map it writable without further checks and modify it
+			 * while still under writeback.
+			 *
+			 * For these problematic swap backends, simply drop the
+			 * exclusive marker: this is perfectly fine as we start
+			 * writeback only if we fully unmapped the page and
+			 * there are no unexpected references on the page after
+			 * unmapping succeeded. After fully unmapped, no
+			 * further GUP references (FOLL_GET and FOLL_PIN) can
+			 * appear, so dropping the exclusive marker and mapping
+			 * it only R/O is fine.
+			 */
+			exclusive = false;
+		}
+	}
+
 	/*
 	 * Remove the swap entry and conditionally try to free up the swapcache.
 	 * We're already holding a reference on the page but haven't mapped it
@@ -3738,11 +3784,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
-	 * Same logic as in do_wp_page(); however, optimize for fresh pages
-	 * that are certainly not shared because we just allocated them without
-	 * exposing them to the swapcache.
+	 * Same logic as in do_wp_page(); however, optimize for pages that are
+	 * certainly not shared either because we just allocated them without
+	 * exposing them to the swapcache or because the swap entry indicates
+	 * exclusivity.
 	 */
-	if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) {
+	if (!PageKsm(page) && (exclusive || page_count(page) == 1)) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 4de07234cbcf..c8c257d94962 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,14 +1656,15 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 			/*
-			 * Note: We *don't* remember yet if the page was mapped
-			 * exclusively in the swap entry, so swapin code has
-			 * to re-determine that manually and might detect the
-			 * page as possibly shared, for example, if there are
-			 * other references on the page or if the page is under
-			 * writeback. We made sure that there are no GUP pins
-			 * on the page that would rely on it, so for GUP pins
-			 * this is fine.
+			 * Note: We *don't* remember if the page was mapped
+			 * exclusively in the swap pte if the architecture
+			 * doesn't support __HAVE_ARCH_PTE_SWP_EXCLUSIVE. In
+			 * that case, swapin code has to re-determine that
+			 * manually and might detect the page as possibly
+			 * shared, for example, if there are other references on
+			 * the page or if the page is under writeback. We made
+			 * sure that there are no GUP pins on the page that
+			 * would rely on it, so for GUP pins this is fine.
 			 */
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
@@ -1674,6 +1675,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, MM_ANONPAGES);
 			inc_mm_counter(mm, MM_SWAPENTS);
 			swp_pte = swp_entry_to_pte(entry);
+			if (anon_exclusive)
+				swp_pte = pte_swp_mkexclusive(swp_pte);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			if (pte_uffd_wp(pteval))
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7847324d476..7279b2d2d71d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1804,7 +1804,18 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
+		rmap_t rmap_flags = RMAP_NONE;
+
+		/*
+		 * See do_swap_page(): PageWriteback() would be problematic.
+		 * However, we do a wait_on_page_writeback() just before this
+		 * call and have the page locked.
+		 */
+		VM_BUG_ON_PAGE(PageWriteback(page), page);
+		if (pte_swp_exclusive(*pte))
+			rmap_flags |= RMAP_EXCLUSIVE;
+
+		page_add_anon_rmap(page, vma, addr, rmap_flags);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
it. We do this, to keep fork() logic on swap entries easy and efficient:
for example, if we wouldn't clear it when unmapping, we'd have to lookup
the page in the swapcache for each and every swap entry during fork() and
clear PG_anon_exclusive if set.

Instead, we want to store that information directly in the swap pte,
protected by the page table lock, similarly to how we handle
SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
swap entries, we don't want to mess with the swap type (e.g., still one
bit) because it overcomplicates swap code.

In try_to_unmap(), we already reject to unmap in case the page might be
pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
Checking if there are other unexpected references reliably *before*
completely unmapping a page is unfortunately not really possible: THP
heavily overcomplicate the situation. Once fully unmapped it's easier --
we, for example, make sure that there are no unexpected references
*after* unmapping a page before starting writeback on that page.

So, we currently might end up unmapping a page and clearing
PG_anon_exclusive if that page has additional references, for example,
due to a FOLL_GET.

do_swap_page() has to re-determine if a page is exclusive, which will
easily fail if there are other references on a page, most prominently
GUP references via FOLL_GET. This can currently result in memory
corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
when fork() is never involved: try_to_unmap() will succeed, and when
refaulting the page, it cannot be marked exclusive and will get replaced
by a copy in the page tables on the next write access, resulting in writes
via the GUP reference to the page being lost.

In an ideal world, everybody that uses GUP and wants to modify page
content, such as O_DIRECT, would properly use FOLL_PIN. However, that
conversion will take a while. It's easier to fix what used to work in the
past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
by remembering PG_anon_exclusive we can further reduce unnecessary COW
in some cases, so it's the natural thing to do.

So let's transfer the PG_anon_exclusive information to the swap pte and
store it via an architecture-dependant pte bit; use that information when
restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
simply have to clear the pte bit and are done.

Of course, there is one corner case to handle: swap backends that don't
support concurrent page modifications while the page is under writeback.
Special case these, and drop the exclusive marker. Add a comment why that
is just fine (also, reuse_swap_page() would have done the same in the
past).

In the future, we'll hopefully have all architectures support
__HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
stubs and the define completely. Then, we can also convert
SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
support: either simply use a yet unused pte bit that can be used for swap
entries, steal one from the arch type bits if they exceed 5, or steal one
from the offset bits.

Note: R/O FOLL_GET references were never really reliable, especially
when taking one on a shared page and then writing to the page (e.g., GUP
after fork()). FOLL_GET, including R/W references, were never really
reliable once fork was involved (e.g., GUP before fork(),
GUP during fork()). KSM steps back in case it stumbles over unexpected
references and is, therefore, fine.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pgtable.h | 29 ++++++++++++++++++++++
 include/linux/swapops.h |  2 ++
 mm/memory.c             | 55 ++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c               | 19 ++++++++------
 mm/swapfile.c           | 13 +++++++++-
 5 files changed, 105 insertions(+), 13 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..53750224e176 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1003,6 +1003,35 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+/*
+ * When replacing an anonymous page by a real (!non) swap entry, we clear
+ * PG_anon_exclusive from the page and instead remember whether the flag was
+ * set in the swp pte. During fork(), we have to mark the entry as !exclusive
+ * (possibly shared). On swapin, we use that information to restore
+ * PG_anon_exclusive, which is very helpful in cases where we might have
+ * additional (e.g., FOLL_GET) references on a page and wouldn't be able to
+ * detect exclusivity.
+ *
+ * These functions don't apply to non-swap entries (e.g., migration, hwpoison,
+ * ...).
+ */
+#ifndef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte;
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return false;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte;
+}
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 06280fc1c99b..32d517a28969 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -26,6 +26,8 @@
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
+	if (pte_swp_exclusive(pte))
+		pte = pte_swp_clear_exclusive(pte);
 	if (pte_swp_soft_dirty(pte))
 		pte = pte_swp_clear_soft_dirty(pte);
 	if (pte_swp_uffd_wp(pte))
diff --git a/mm/memory.c b/mm/memory.c
index 14618f446139..9060cc7f2123 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 						&src_mm->mmlist);
 			spin_unlock(&mmlist_lock);
 		}
+		/* Mark the swap entry as shared. */
+		if (pte_swp_exclusive(*src_pte)) {
+			pte = pte_swp_clear_exclusive(*src_pte);
+			set_pte_at(src_mm, addr, src_pte, pte);
+		}
 		rss[MM_SWAPENTS]++;
 	} else if (is_migration_entry(entry)) {
 		page = pfn_swap_entry_to_page(entry);
@@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct page *page = NULL, *swapcache;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
+	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
@@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
 	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
 
+	/*
+	 * Check under PT lock (to protect against concurrent fork() sharing
+	 * the swap entry concurrently) for certainly exclusive pages.
+	 */
+	if (!PageKsm(page)) {
+		/*
+		 * Note that pte_swp_exclusive() == false for architectures
+		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
+		 */
+		exclusive = pte_swp_exclusive(vmf->orig_pte);
+		if (page != swapcache) {
+			/*
+			 * We have a fresh page that is not exposed to the
+			 * swapcache -> certainly exclusive.
+			 */
+			exclusive = true;
+		} else if (exclusive && PageWriteback(page) &&
+			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
+			/*
+			 * This is tricky: not all swap backends support
+			 * concurrent page modifications while under writeback.
+			 *
+			 * So if we stumble over such a page in the swapcache
+			 * we must not set the page exclusive, otherwise we can
+			 * map it writable without further checks and modify it
+			 * while still under writeback.
+			 *
+			 * For these problematic swap backends, simply drop the
+			 * exclusive marker: this is perfectly fine as we start
+			 * writeback only if we fully unmapped the page and
+			 * there are no unexpected references on the page after
+			 * unmapping succeeded. After fully unmapped, no
+			 * further GUP references (FOLL_GET and FOLL_PIN) can
+			 * appear, so dropping the exclusive marker and mapping
+			 * it only R/O is fine.
+			 */
+			exclusive = false;
+		}
+	}
+
 	/*
 	 * Remove the swap entry and conditionally try to free up the swapcache.
 	 * We're already holding a reference on the page but haven't mapped it
@@ -3738,11 +3784,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
-	 * Same logic as in do_wp_page(); however, optimize for fresh pages
-	 * that are certainly not shared because we just allocated them without
-	 * exposing them to the swapcache.
+	 * Same logic as in do_wp_page(); however, optimize for pages that are
+	 * certainly not shared either because we just allocated them without
+	 * exposing them to the swapcache or because the swap entry indicates
+	 * exclusivity.
 	 */
-	if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) {
+	if (!PageKsm(page) && (exclusive || page_count(page) == 1)) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 4de07234cbcf..c8c257d94962 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,14 +1656,15 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 			/*
-			 * Note: We *don't* remember yet if the page was mapped
-			 * exclusively in the swap entry, so swapin code has
-			 * to re-determine that manually and might detect the
-			 * page as possibly shared, for example, if there are
-			 * other references on the page or if the page is under
-			 * writeback. We made sure that there are no GUP pins
-			 * on the page that would rely on it, so for GUP pins
-			 * this is fine.
+			 * Note: We *don't* remember if the page was mapped
+			 * exclusively in the swap pte if the architecture
+			 * doesn't support __HAVE_ARCH_PTE_SWP_EXCLUSIVE. In
+			 * that case, swapin code has to re-determine that
+			 * manually and might detect the page as possibly
+			 * shared, for example, if there are other references on
+			 * the page or if the page is under writeback. We made
+			 * sure that there are no GUP pins on the page that
+			 * would rely on it, so for GUP pins this is fine.
 			 */
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
@@ -1674,6 +1675,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, MM_ANONPAGES);
 			inc_mm_counter(mm, MM_SWAPENTS);
 			swp_pte = swp_entry_to_pte(entry);
+			if (anon_exclusive)
+				swp_pte = pte_swp_mkexclusive(swp_pte);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			if (pte_uffd_wp(pteval))
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7847324d476..7279b2d2d71d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1804,7 +1804,18 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
+		rmap_t rmap_flags = RMAP_NONE;
+
+		/*
+		 * See do_swap_page(): PageWriteback() would be problematic.
+		 * However, we do a wait_on_page_writeback() just before this
+		 * call and have the page locked.
+		 */
+		VM_BUG_ON_PAGE(PageWriteback(page), page);
+		if (pte_swp_exclusive(*pte))
+			rmap_flags |= RMAP_EXCLUSIVE;
+
+		page_add_anon_rmap(page, vma, addr, rmap_flags);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
it. We do this, to keep fork() logic on swap entries easy and efficient:
for example, if we wouldn't clear it when unmapping, we'd have to lookup
the page in the swapcache for each and every swap entry during fork() and
clear PG_anon_exclusive if set.

Instead, we want to store that information directly in the swap pte,
protected by the page table lock, similarly to how we handle
SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
swap entries, we don't want to mess with the swap type (e.g., still one
bit) because it overcomplicates swap code.

In try_to_unmap(), we already reject to unmap in case the page might be
pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
Checking if there are other unexpected references reliably *before*
completely unmapping a page is unfortunately not really possible: THP
heavily overcomplicate the situation. Once fully unmapped it's easier --
we, for example, make sure that there are no unexpected references
*after* unmapping a page before starting writeback on that page.

So, we currently might end up unmapping a page and clearing
PG_anon_exclusive if that page has additional references, for example,
due to a FOLL_GET.

do_swap_page() has to re-determine if a page is exclusive, which will
easily fail if there are other references on a page, most prominently
GUP references via FOLL_GET. This can currently result in memory
corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
when fork() is never involved: try_to_unmap() will succeed, and when
refaulting the page, it cannot be marked exclusive and will get replaced
by a copy in the page tables on the next write access, resulting in writes
via the GUP reference to the page being lost.

In an ideal world, everybody that uses GUP and wants to modify page
content, such as O_DIRECT, would properly use FOLL_PIN. However, that
conversion will take a while. It's easier to fix what used to work in the
past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
by remembering PG_anon_exclusive we can further reduce unnecessary COW
in some cases, so it's the natural thing to do.

So let's transfer the PG_anon_exclusive information to the swap pte and
store it via an architecture-dependant pte bit; use that information when
restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
simply have to clear the pte bit and are done.

Of course, there is one corner case to handle: swap backends that don't
support concurrent page modifications while the page is under writeback.
Special case these, and drop the exclusive marker. Add a comment why that
is just fine (also, reuse_swap_page() would have done the same in the
past).

In the future, we'll hopefully have all architectures support
__HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
stubs and the define completely. Then, we can also convert
SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
support: either simply use a yet unused pte bit that can be used for swap
entries, steal one from the arch type bits if they exceed 5, or steal one
from the offset bits.

Note: R/O FOLL_GET references were never really reliable, especially
when taking one on a shared page and then writing to the page (e.g., GUP
after fork()). FOLL_GET, including R/W references, were never really
reliable once fork was involved (e.g., GUP before fork(),
GUP during fork()). KSM steps back in case it stumbles over unexpected
references and is, therefore, fine.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pgtable.h | 29 ++++++++++++++++++++++
 include/linux/swapops.h |  2 ++
 mm/memory.c             | 55 ++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c               | 19 ++++++++------
 mm/swapfile.c           | 13 +++++++++-
 5 files changed, 105 insertions(+), 13 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..53750224e176 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1003,6 +1003,35 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+/*
+ * When replacing an anonymous page by a real (!non) swap entry, we clear
+ * PG_anon_exclusive from the page and instead remember whether the flag was
+ * set in the swp pte. During fork(), we have to mark the entry as !exclusive
+ * (possibly shared). On swapin, we use that information to restore
+ * PG_anon_exclusive, which is very helpful in cases where we might have
+ * additional (e.g., FOLL_GET) references on a page and wouldn't be able to
+ * detect exclusivity.
+ *
+ * These functions don't apply to non-swap entries (e.g., migration, hwpoison,
+ * ...).
+ */
+#ifndef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte;
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return false;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte;
+}
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 06280fc1c99b..32d517a28969 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -26,6 +26,8 @@
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
+	if (pte_swp_exclusive(pte))
+		pte = pte_swp_clear_exclusive(pte);
 	if (pte_swp_soft_dirty(pte))
 		pte = pte_swp_clear_soft_dirty(pte);
 	if (pte_swp_uffd_wp(pte))
diff --git a/mm/memory.c b/mm/memory.c
index 14618f446139..9060cc7f2123 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 						&src_mm->mmlist);
 			spin_unlock(&mmlist_lock);
 		}
+		/* Mark the swap entry as shared. */
+		if (pte_swp_exclusive(*src_pte)) {
+			pte = pte_swp_clear_exclusive(*src_pte);
+			set_pte_at(src_mm, addr, src_pte, pte);
+		}
 		rss[MM_SWAPENTS]++;
 	} else if (is_migration_entry(entry)) {
 		page = pfn_swap_entry_to_page(entry);
@@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct page *page = NULL, *swapcache;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
+	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
@@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
 	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
 
+	/*
+	 * Check under PT lock (to protect against concurrent fork() sharing
+	 * the swap entry concurrently) for certainly exclusive pages.
+	 */
+	if (!PageKsm(page)) {
+		/*
+		 * Note that pte_swp_exclusive() == false for architectures
+		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
+		 */
+		exclusive = pte_swp_exclusive(vmf->orig_pte);
+		if (page != swapcache) {
+			/*
+			 * We have a fresh page that is not exposed to the
+			 * swapcache -> certainly exclusive.
+			 */
+			exclusive = true;
+		} else if (exclusive && PageWriteback(page) &&
+			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
+			/*
+			 * This is tricky: not all swap backends support
+			 * concurrent page modifications while under writeback.
+			 *
+			 * So if we stumble over such a page in the swapcache
+			 * we must not set the page exclusive, otherwise we can
+			 * map it writable without further checks and modify it
+			 * while still under writeback.
+			 *
+			 * For these problematic swap backends, simply drop the
+			 * exclusive marker: this is perfectly fine as we start
+			 * writeback only if we fully unmapped the page and
+			 * there are no unexpected references on the page after
+			 * unmapping succeeded. After fully unmapped, no
+			 * further GUP references (FOLL_GET and FOLL_PIN) can
+			 * appear, so dropping the exclusive marker and mapping
+			 * it only R/O is fine.
+			 */
+			exclusive = false;
+		}
+	}
+
 	/*
 	 * Remove the swap entry and conditionally try to free up the swapcache.
 	 * We're already holding a reference on the page but haven't mapped it
@@ -3738,11 +3784,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
-	 * Same logic as in do_wp_page(); however, optimize for fresh pages
-	 * that are certainly not shared because we just allocated them without
-	 * exposing them to the swapcache.
+	 * Same logic as in do_wp_page(); however, optimize for pages that are
+	 * certainly not shared either because we just allocated them without
+	 * exposing them to the swapcache or because the swap entry indicates
+	 * exclusivity.
 	 */
-	if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) {
+	if (!PageKsm(page) && (exclusive || page_count(page) == 1)) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
diff --git a/mm/rmap.c b/mm/rmap.c
index 4de07234cbcf..c8c257d94962 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,14 +1656,15 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 			/*
-			 * Note: We *don't* remember yet if the page was mapped
-			 * exclusively in the swap entry, so swapin code has
-			 * to re-determine that manually and might detect the
-			 * page as possibly shared, for example, if there are
-			 * other references on the page or if the page is under
-			 * writeback. We made sure that there are no GUP pins
-			 * on the page that would rely on it, so for GUP pins
-			 * this is fine.
+			 * Note: We *don't* remember if the page was mapped
+			 * exclusively in the swap pte if the architecture
+			 * doesn't support __HAVE_ARCH_PTE_SWP_EXCLUSIVE. In
+			 * that case, swapin code has to re-determine that
+			 * manually and might detect the page as possibly
+			 * shared, for example, if there are other references on
+			 * the page or if the page is under writeback. We made
+			 * sure that there are no GUP pins on the page that
+			 * would rely on it, so for GUP pins this is fine.
 			 */
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
@@ -1674,6 +1675,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			dec_mm_counter(mm, MM_ANONPAGES);
 			inc_mm_counter(mm, MM_SWAPENTS);
 			swp_pte = swp_entry_to_pte(entry);
+			if (anon_exclusive)
+				swp_pte = pte_swp_mkexclusive(swp_pte);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			if (pte_uffd_wp(pteval))
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7847324d476..7279b2d2d71d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1804,7 +1804,18 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
+		rmap_t rmap_flags = RMAP_NONE;
+
+		/*
+		 * See do_swap_page(): PageWriteback() would be problematic.
+		 * However, we do a wait_on_page_writeback() just before this
+		 * call and have the page locked.
+		 */
+		VM_BUG_ON_PAGE(PageWriteback(page), page);
+		if (pte_swp_exclusive(*pte))
+			rmap_flags |= RMAP_EXCLUSIVE;
+
+		page_add_anon_rmap(page, vma, addr, rmap_flags);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/debug_vm_pgtable.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index db2abd9e415b..55f1a8dc716f 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { }
 static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
+{
+#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
+
+	pr_debug("Validating PTE swap exclusive\n");
+	pte = pte_swp_mkexclusive(pte);
+	WARN_ON(!pte_swp_exclusive(pte));
+	pte = pte_swp_clear_exclusive(pte);
+	WARN_ON(pte_swp_exclusive(pte));
+#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
+}
+
 static void __init pte_swap_tests(struct pgtable_debug_args *args)
 {
 	swp_entry_t swp;
@@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
 	pte_swap_soft_dirty_tests(&args);
 	pmd_swap_soft_dirty_tests(&args);
 
+	pte_swap_exclusive_tests(&args);
+
 	pte_swap_tests(&args);
 	pmd_swap_tests(&args);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/debug_vm_pgtable.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index db2abd9e415b..55f1a8dc716f 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { }
 static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
+{
+#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
+
+	pr_debug("Validating PTE swap exclusive\n");
+	pte = pte_swp_mkexclusive(pte);
+	WARN_ON(!pte_swp_exclusive(pte));
+	pte = pte_swp_clear_exclusive(pte);
+	WARN_ON(pte_swp_exclusive(pte));
+#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
+}
+
 static void __init pte_swap_tests(struct pgtable_debug_args *args)
 {
 	swp_entry_t swp;
@@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
 	pte_swap_soft_dirty_tests(&args);
 	pmd_swap_soft_dirty_tests(&args);
 
+	pte_swap_exclusive_tests(&args);
+
 	pte_swap_tests(&args);
 	pmd_swap_tests(&args);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/debug_vm_pgtable.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index db2abd9e415b..55f1a8dc716f 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { }
 static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
+{
+#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
+
+	pr_debug("Validating PTE swap exclusive\n");
+	pte = pte_swp_mkexclusive(pte);
+	WARN_ON(!pte_swp_exclusive(pte));
+	pte = pte_swp_clear_exclusive(pte);
+	WARN_ON(pte_swp_exclusive(pte));
+#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
+}
+
 static void __init pte_swap_tests(struct pgtable_debug_args *args)
 {
 	swp_entry_t swp;
@@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
 	pte_swap_soft_dirty_tests(&args);
 	pmd_swap_soft_dirty_tests(&args);
 
+	pte_swap_exclusive_tests(&args);
+
 	pte_swap_tests(&args);
 	pmd_swap_tests(&args);
 
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/pgtable.h       | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_64.h    |  4 +++-
 arch/x86/include/asm/pgtable_types.h |  5 +++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..e42e668153e9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1292,6 +1292,22 @@ static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
 {
 }
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..54a8f370046d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -83,6 +83,11 @@
 #define _PAGE_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_PWT
+
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/pgtable.h       | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_64.h    |  4 +++-
 arch/x86/include/asm/pgtable_types.h |  5 +++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..e42e668153e9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1292,6 +1292,22 @@ static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
 {
 }
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..54a8f370046d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -83,6 +83,11 @@
 #define _PAGE_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_PWT
+
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/pgtable.h       | 16 ++++++++++++++++
 arch/x86/include/asm/pgtable_64.h    |  4 +++-
 arch/x86/include/asm/pgtable_types.h |  5 +++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..e42e668153e9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1292,6 +1292,22 @@ static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
 {
 }
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..54a8f370046d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -83,6 +83,11 @@
 #define _PAGE_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_PWT
+
 /*
  * Tracking soft dirty bit when a page goes to a swap is tricky.
  * We need a bit which can be stored in pte _and_ not conflict
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 4/8] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's use one of the type bits: core-mm only supports 5, so there is no
need to consume 6.

Note that we might be able to reuse bit 1, but reusing bit 1 turned out
problematic in the past for PROT_NONE handling; so let's play safe and
use another bit.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  1 +
 arch/arm64/include/asm/pgtable.h      | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index b1e1b74d993c..62e0ebeed720 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -14,6 +14,7 @@
  * Software defined PTE bits definition.
  */
 #define PTE_WRITE		(PTE_DBM)		 /* same as DBM (51) */
+#define PTE_SWP_EXCLUSIVE	(_AT(pteval_t, 1) << 2)	 /* only for swp ptes */
 #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 94e147e5456c..ad9b221963d4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -402,6 +402,22 @@ static inline pgprot_t mk_pmd_sect_prot(pgprot_t prot)
 	return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT);
 }
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & PTE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * See the comment in include/linux/pgtable.h
@@ -909,12 +925,13 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 /*
  * Encode and decode a swap entry:
  *	bits 0-1:	present (must be zero)
- *	bits 2-7:	swap type
+ *	bits 2:		remember PG_anon_exclusive
+ *	bits 3-7:	swap type
  *	bits 8-57:	swap offset
  *	bit  58:	PTE_PROT_NONE (must be zero)
  */
-#define __SWP_TYPE_SHIFT	2
-#define __SWP_TYPE_BITS		6
+#define __SWP_TYPE_SHIFT	3
+#define __SWP_TYPE_BITS		5
 #define __SWP_OFFSET_BITS	50
 #define __SWP_TYPE_MASK		((1 << __SWP_TYPE_BITS) - 1)
 #define __SWP_OFFSET_SHIFT	(__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 4/8] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Let's use one of the type bits: core-mm only supports 5, so there is no
need to consume 6.

Note that we might be able to reuse bit 1, but reusing bit 1 turned out
problematic in the past for PROT_NONE handling; so let's play safe and
use another bit.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  1 +
 arch/arm64/include/asm/pgtable.h      | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index b1e1b74d993c..62e0ebeed720 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -14,6 +14,7 @@
  * Software defined PTE bits definition.
  */
 #define PTE_WRITE		(PTE_DBM)		 /* same as DBM (51) */
+#define PTE_SWP_EXCLUSIVE	(_AT(pteval_t, 1) << 2)	 /* only for swp ptes */
 #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 94e147e5456c..ad9b221963d4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -402,6 +402,22 @@ static inline pgprot_t mk_pmd_sect_prot(pgprot_t prot)
 	return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT);
 }
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & PTE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * See the comment in include/linux/pgtable.h
@@ -909,12 +925,13 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 /*
  * Encode and decode a swap entry:
  *	bits 0-1:	present (must be zero)
- *	bits 2-7:	swap type
+ *	bits 2:		remember PG_anon_exclusive
+ *	bits 3-7:	swap type
  *	bits 8-57:	swap offset
  *	bit  58:	PTE_PROT_NONE (must be zero)
  */
-#define __SWP_TYPE_SHIFT	2
-#define __SWP_TYPE_BITS		6
+#define __SWP_TYPE_SHIFT	3
+#define __SWP_TYPE_BITS		5
 #define __SWP_OFFSET_BITS	50
 #define __SWP_TYPE_MASK		((1 << __SWP_TYPE_BITS) - 1)
 #define __SWP_OFFSET_SHIFT	(__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 4/8] arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's use one of the type bits: core-mm only supports 5, so there is no
need to consume 6.

Note that we might be able to reuse bit 1, but reusing bit 1 turned out
problematic in the past for PROT_NONE handling; so let's play safe and
use another bit.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  1 +
 arch/arm64/include/asm/pgtable.h      | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index b1e1b74d993c..62e0ebeed720 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -14,6 +14,7 @@
  * Software defined PTE bits definition.
  */
 #define PTE_WRITE		(PTE_DBM)		 /* same as DBM (51) */
+#define PTE_SWP_EXCLUSIVE	(_AT(pteval_t, 1) << 2)	 /* only for swp ptes */
 #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 94e147e5456c..ad9b221963d4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -402,6 +402,22 @@ static inline pgprot_t mk_pmd_sect_prot(pgprot_t prot)
 	return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT);
 }
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & PTE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * See the comment in include/linux/pgtable.h
@@ -909,12 +925,13 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 /*
  * Encode and decode a swap entry:
  *	bits 0-1:	present (must be zero)
- *	bits 2-7:	swap type
+ *	bits 2:		remember PG_anon_exclusive
+ *	bits 3-7:	swap type
  *	bits 8-57:	swap offset
  *	bit  58:	PTE_PROT_NONE (must be zero)
  */
-#define __SWP_TYPE_SHIFT	2
-#define __SWP_TYPE_BITS		6
+#define __SWP_TYPE_SHIFT	3
+#define __SWP_TYPE_BITS		5
 #define __SWP_OFFSET_BITS	50
 #define __SWP_TYPE_MASK		((1 << __SWP_TYPE_BITS) - 1)
 #define __SWP_OFFSET_SHIFT	(__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Bit 52 and bit 55 don't have to be zero: they only trigger a
translation-specifiation exception if the PTE is marked as valid, which
is not the case for swap ptes.

Document which bits are used for what, and which ones are unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 9df679152620..3982575bb586 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
 /*
  * 64 bit swap entry format:
  * A page-table entry has some bits we have to treat in a special way.
- * Bits 52 and bit 55 have to be zero, otherwise a specification
- * exception will occur instead of a page translation exception. The
- * specification exception has the bad habit not to store necessary
- * information in the lowcore.
- * Bits 54 and 63 are used to indicate the page type.
+ * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
+ * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * This leaves the bits 0-51 and bits 56-62 to store type and offset.
- * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
- * for the offset.
- * |			  offset			|01100|type |00|
+ * |			  offset			|X11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
+ *
+ * Bits 0-51 store the offset.
+ * Bits 57-61 store the type.
+ * Bit 62 (S) is used for softdirty tracking.
+ * Bits 52, 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Bit 52 and bit 55 don't have to be zero: they only trigger a
translation-specifiation exception if the PTE is marked as valid, which
is not the case for swap ptes.

Document which bits are used for what, and which ones are unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 9df679152620..3982575bb586 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
 /*
  * 64 bit swap entry format:
  * A page-table entry has some bits we have to treat in a special way.
- * Bits 52 and bit 55 have to be zero, otherwise a specification
- * exception will occur instead of a page translation exception. The
- * specification exception has the bad habit not to store necessary
- * information in the lowcore.
- * Bits 54 and 63 are used to indicate the page type.
+ * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
+ * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * This leaves the bits 0-51 and bits 56-62 to store type and offset.
- * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
- * for the offset.
- * |			  offset			|01100|type |00|
+ * |			  offset			|X11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
+ *
+ * Bits 0-51 store the offset.
+ * Bits 57-61 store the type.
+ * Bit 62 (S) is used for softdirty tracking.
+ * Bits 52, 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Bit 52 and bit 55 don't have to be zero: they only trigger a
translation-specifiation exception if the PTE is marked as valid, which
is not the case for swap ptes.

Document which bits are used for what, and which ones are unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 9df679152620..3982575bb586 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
 /*
  * 64 bit swap entry format:
  * A page-table entry has some bits we have to treat in a special way.
- * Bits 52 and bit 55 have to be zero, otherwise a specification
- * exception will occur instead of a page translation exception. The
- * specification exception has the bad habit not to store necessary
- * information in the lowcore.
- * Bits 54 and 63 are used to indicate the page type.
+ * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
+ * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * This leaves the bits 0-51 and bits 56-62 to store type and offset.
- * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
- * for the offset.
- * |			  offset			|01100|type |00|
+ * |			  offset			|X11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
+ *
+ * Bits 0-51 store the offset.
+ * Bits 57-61 store the type.
+ * Bit 62 (S) is used for softdirty tracking.
+ * Bits 52, 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's use bit 52, which is unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3982575bb586..a397b072a580 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
 #define _PAGE_SOFT_DIRTY 0x000
 #endif
 
+#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
+
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
 				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
@@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_SOFT_DIRTY;
@@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
  * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
  * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * |			  offset			|X11XX|type |S0|
+ * |			  offset			|E11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
  *
  * Bits 0-51 store the offset.
+ * Bit 52 (E) is used to remember PG_anon_exclusive.
  * Bits 57-61 store the type.
  * Bit 62 (S) is used for softdirty tracking.
- * Bits 52, 55 and 56 (X) are unused.
+ * Bits 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Let's use bit 52, which is unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3982575bb586..a397b072a580 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
 #define _PAGE_SOFT_DIRTY 0x000
 #endif
 
+#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
+
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
 				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
@@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_SOFT_DIRTY;
@@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
  * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
  * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * |			  offset			|X11XX|type |S0|
+ * |			  offset			|E11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
  *
  * Bits 0-51 store the offset.
+ * Bit 52 (E) is used to remember PG_anon_exclusive.
  * Bits 57-61 store the type.
  * Bit 62 (S) is used for softdirty tracking.
- * Bits 52, 55 and 56 (X) are unused.
+ * Bits 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Let's use bit 52, which is unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3982575bb586..a397b072a580 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
 #define _PAGE_SOFT_DIRTY 0x000
 #endif
 
+#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
+
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
 				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
@@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline int pte_soft_dirty(pte_t pte)
 {
 	return pte_val(pte) & _PAGE_SOFT_DIRTY;
@@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
  * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
  * as invalid.
  * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
- * |			  offset			|X11XX|type |S0|
+ * |			  offset			|E11XX|type |S0|
  * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
  * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
  *
  * Bits 0-51 store the offset.
+ * Bit 52 (E) is used to remember PG_anon_exclusive.
  * Bits 57-61 store the type.
  * Bit 62 (S) is used for softdirty tracking.
- * Bits 52, 55 and 56 (X) are unused.
+ * Bits 55 and 56 (X) are unused.
  */
 
 #define __SWP_OFFSET_MASK	((1UL << 52) - 1)
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

The swap type is simply stored in bits 0x1f of the swap pte. Let's
simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
_RPAGE_RSV1, which isn't possible and would make the
BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.

While at it, make it clearer which bit we're actually using for
_PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
use SWP_TYPE_MASK.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 875730d5af40..8e98375d5c4a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -13,7 +13,6 @@
 /*
  * Common bits between hash and Radix page table
  */
-#define _PAGE_BIT_SWAP_TYPE	0
 
 #define _PAGE_EXEC		0x00001 /* execute permission */
 #define _PAGE_WRITE		0x00002 /* write access allowed */
@@ -751,17 +750,16 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * Don't have overlapping bits with _PAGE_HPTEFLAGS	\
 	 * We filter HPTEFLAGS on set_pte.			\
 	 */							\
-	BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
+	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
 	} while (0)
 
 #define SWP_TYPE_BITS 5
-#define __swp_type(x)		(((x).val >> _PAGE_BIT_SWAP_TYPE) \
-				& ((1UL << SWP_TYPE_BITS) - 1))
+#define SWP_TYPE_MASK		((1UL << SWP_TYPE_BITS) - 1)
+#define __swp_type(x)		((x).val & SWP_TYPE_MASK)
 #define __swp_offset(x)		(((x).val & PTE_RPN_MASK) >> PAGE_SHIFT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
-				((type) << _PAGE_BIT_SWAP_TYPE) \
-				| (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
+				(type) | (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
 /*
  * swp_entry_t must be independent of pte bits. We build a swp_entry_t from
  * swap type and offset we get from swap and convert that to pte to find a
@@ -774,7 +772,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
 #else
 #define _PAGE_SWP_SOFT_DIRTY	0UL
 #endif /* CONFIG_MEM_SOFT_DIRTY */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

The swap type is simply stored in bits 0x1f of the swap pte. Let's
simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
_RPAGE_RSV1, which isn't possible and would make the
BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.

While at it, make it clearer which bit we're actually using for
_PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
use SWP_TYPE_MASK.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 875730d5af40..8e98375d5c4a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -13,7 +13,6 @@
 /*
  * Common bits between hash and Radix page table
  */
-#define _PAGE_BIT_SWAP_TYPE	0
 
 #define _PAGE_EXEC		0x00001 /* execute permission */
 #define _PAGE_WRITE		0x00002 /* write access allowed */
@@ -751,17 +750,16 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * Don't have overlapping bits with _PAGE_HPTEFLAGS	\
 	 * We filter HPTEFLAGS on set_pte.			\
 	 */							\
-	BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
+	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
 	} while (0)
 
 #define SWP_TYPE_BITS 5
-#define __swp_type(x)		(((x).val >> _PAGE_BIT_SWAP_TYPE) \
-				& ((1UL << SWP_TYPE_BITS) - 1))
+#define SWP_TYPE_MASK		((1UL << SWP_TYPE_BITS) - 1)
+#define __swp_type(x)		((x).val & SWP_TYPE_MASK)
 #define __swp_offset(x)		(((x).val & PTE_RPN_MASK) >> PAGE_SHIFT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
-				((type) << _PAGE_BIT_SWAP_TYPE) \
-				| (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
+				(type) | (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
 /*
  * swp_entry_t must be independent of pte bits. We build a swp_entry_t from
  * swap type and offset we get from swap and convert that to pte to find a
@@ -774,7 +772,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
 #else
 #define _PAGE_SWP_SOFT_DIRTY	0UL
 #endif /* CONFIG_MEM_SOFT_DIRTY */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

The swap type is simply stored in bits 0x1f of the swap pte. Let's
simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
_RPAGE_RSV1, which isn't possible and would make the
BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.

While at it, make it clearer which bit we're actually using for
_PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
use SWP_TYPE_MASK.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 875730d5af40..8e98375d5c4a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -13,7 +13,6 @@
 /*
  * Common bits between hash and Radix page table
  */
-#define _PAGE_BIT_SWAP_TYPE	0
 
 #define _PAGE_EXEC		0x00001 /* execute permission */
 #define _PAGE_WRITE		0x00002 /* write access allowed */
@@ -751,17 +750,16 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * Don't have overlapping bits with _PAGE_HPTEFLAGS	\
 	 * We filter HPTEFLAGS on set_pte.			\
 	 */							\
-	BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
+	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
 	} while (0)
 
 #define SWP_TYPE_BITS 5
-#define __swp_type(x)		(((x).val >> _PAGE_BIT_SWAP_TYPE) \
-				& ((1UL << SWP_TYPE_BITS) - 1))
+#define SWP_TYPE_MASK		((1UL << SWP_TYPE_BITS) - 1)
+#define __swp_type(x)		((x).val & SWP_TYPE_MASK)
 #define __swp_offset(x)		(((x).val & PTE_RPN_MASK) >> PAGE_SHIFT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
-				((type) << _PAGE_BIT_SWAP_TYPE) \
-				| (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
+				(type) | (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
 /*
  * swp_entry_t must be independent of pte bits. We build a swp_entry_t from
  * swap type and offset we get from swap and convert that to pte to find a
@@ -774,7 +772,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
 #else
 #define _PAGE_SWP_SOFT_DIRTY	0UL
 #endif /* CONFIG_MEM_SOFT_DIRTY */
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 8/8] powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s
  2022-03-29 16:43 ` David Hildenbrand
  (?)
@ 2022-03-29 16:43   ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Right now, the last 5 bits (0x1f) of the swap entry are used for the
type and the bit before that (0x20) is used for _PAGE_SWP_SOFT_DIRTY. We
cannot use 0x40, as that collides with _RPAGE_RSV1 -- contained in
_PAGE_HPTEFLAGS. The next candidate would be _RPAGE_SW3 (0x200) -- which is
used for _PAGE_SOFT_DIRTY for !swp ptes.

So let's just use _PAGE_SOFT_DIRTY for _PAGE_SWP_SOFT_DIRTY (to make it
easier to grasp) and use 0x20 now for _PAGE_SWP_EXCLUSIVE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 21 +++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8e98375d5c4a..eecff2036869 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -752,6 +752,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */							\
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
+	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_EXCLUSIVE);	\
 	} while (0)
 
 #define SWP_TYPE_BITS 5
@@ -772,11 +773,13 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_SOFT_DIRTY
 #else
 #define _PAGE_SWP_SOFT_DIRTY	0UL
 #endif /* CONFIG_MEM_SOFT_DIRTY */
 
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_NON_IDEMPOTENT
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
@@ -794,6 +797,22 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 }
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline bool check_pte_access(unsigned long access, unsigned long ptev)
 {
 	/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 8/8] powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, David Hildenbrand, Catalin Marinas, Yang Shi,
	Dave Hansen, Peter Xu, Michal Hocko, linux-mm, Donald Dutile,
	Liang Zhang, Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

Right now, the last 5 bits (0x1f) of the swap entry are used for the
type and the bit before that (0x20) is used for _PAGE_SWP_SOFT_DIRTY. We
cannot use 0x40, as that collides with _RPAGE_RSV1 -- contained in
_PAGE_HPTEFLAGS. The next candidate would be _RPAGE_SW3 (0x200) -- which is
used for _PAGE_SOFT_DIRTY for !swp ptes.

So let's just use _PAGE_SOFT_DIRTY for _PAGE_SWP_SOFT_DIRTY (to make it
easier to grasp) and use 0x20 now for _PAGE_SWP_EXCLUSIVE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 21 +++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8e98375d5c4a..eecff2036869 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -752,6 +752,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */							\
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
+	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_EXCLUSIVE);	\
 	} while (0)
 
 #define SWP_TYPE_BITS 5
@@ -772,11 +773,13 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_SOFT_DIRTY
 #else
 #define _PAGE_SWP_SOFT_DIRTY	0UL
 #endif /* CONFIG_MEM_SOFT_DIRTY */
 
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_NON_IDEMPOTENT
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
@@ -794,6 +797,22 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 }
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline bool check_pte_access(unsigned long access, unsigned long ptev)
 {
 	/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 8/8] powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s
@ 2022-03-29 16:43   ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-29 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390, David Hildenbrand

Right now, the last 5 bits (0x1f) of the swap entry are used for the
type and the bit before that (0x20) is used for _PAGE_SWP_SOFT_DIRTY. We
cannot use 0x40, as that collides with _RPAGE_RSV1 -- contained in
_PAGE_HPTEFLAGS. The next candidate would be _RPAGE_SW3 (0x200) -- which is
used for _PAGE_SOFT_DIRTY for !swp ptes.

So let's just use _PAGE_SOFT_DIRTY for _PAGE_SWP_SOFT_DIRTY (to make it
easier to grasp) and use 0x20 now for _PAGE_SWP_EXCLUSIVE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 21 +++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8e98375d5c4a..eecff2036869 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -752,6 +752,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */							\
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
 	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
+	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_EXCLUSIVE);	\
 	} while (0)
 
 #define SWP_TYPE_BITS 5
@@ -772,11 +773,13 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
-#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
+#define _PAGE_SWP_SOFT_DIRTY	_PAGE_SOFT_DIRTY
 #else
 #define _PAGE_SWP_SOFT_DIRTY	0UL
 #endif /* CONFIG_MEM_SOFT_DIRTY */
 
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_NON_IDEMPOTENT
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
 {
@@ -794,6 +797,22 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 }
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return !!(pte_raw(pte) & cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_SWP_EXCLUSIVE));
+}
+
 static inline bool check_pte_access(unsigned long access, unsigned long ptev)
 {
 	/*
-- 
2.35.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  2022-03-29 16:43   ` David Hildenbrand
  (?)
@ 2022-03-30  6:07     ` Christophe Leroy
  -1 siblings, 0 replies; 57+ messages in thread
From: Christophe Leroy @ 2022-03-30  6:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz



Le 29/03/2022 à 18:43, David Hildenbrand a écrit :
> The swap type is simply stored in bits 0x1f of the swap pte. Let's
> simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
> we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
> _RPAGE_RSV1, which isn't possible and would make the
> BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
> 
> While at it, make it clearer which bit we're actually using for
> _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
> use SWP_TYPE_MASK.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------

Why only BOOK3S ? Why not BOOK3E as well ?

Christophe

>   1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 875730d5af40..8e98375d5c4a 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -13,7 +13,6 @@
>   /*
>    * Common bits between hash and Radix page table
>    */
> -#define _PAGE_BIT_SWAP_TYPE	0
>   
>   #define _PAGE_EXEC		0x00001 /* execute permission */
>   #define _PAGE_WRITE		0x00002 /* write access allowed */
> @@ -751,17 +750,16 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   	 * Don't have overlapping bits with _PAGE_HPTEFLAGS	\
>   	 * We filter HPTEFLAGS on set_pte.			\
>   	 */							\
> -	BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
> +	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
>   	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
>   	} while (0)
>   
>   #define SWP_TYPE_BITS 5
> -#define __swp_type(x)		(((x).val >> _PAGE_BIT_SWAP_TYPE) \
> -				& ((1UL << SWP_TYPE_BITS) - 1))
> +#define SWP_TYPE_MASK		((1UL << SWP_TYPE_BITS) - 1)
> +#define __swp_type(x)		((x).val & SWP_TYPE_MASK)
>   #define __swp_offset(x)		(((x).val & PTE_RPN_MASK) >> PAGE_SHIFT)
>   #define __swp_entry(type, offset)	((swp_entry_t) { \
> -				((type) << _PAGE_BIT_SWAP_TYPE) \
> -				| (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
> +				(type) | (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
>   /*
>    * swp_entry_t must be independent of pte bits. We build a swp_entry_t from
>    * swap type and offset we get from swap and convert that to pte to find a
> @@ -774,7 +772,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
>   
>   #ifdef CONFIG_MEM_SOFT_DIRTY
> -#define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
> +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
>   #else
>   #define _PAGE_SWP_SOFT_DIRTY	0UL
>   #endif /* CONFIG_MEM_SOFT_DIRTY */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
@ 2022-03-30  6:07     ` Christophe Leroy
  0 siblings, 0 replies; 57+ messages in thread
From: Christophe Leroy @ 2022-03-30  6:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Alexander Gordeev, Will Deacon, Christoph Hellwig,
	Andrea Arcangeli, linux-s390, Shakeel Butt, Pedro Gomes, x86,
	Hugh Dickins, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Vlastimil Babka, Jason Gunthorpe, David Rientjes,
	Gerald Schaefer, Nadav Amit, Vasily Gorbik, Rik van Riel,
	John Hubbard, Heiko Carstens, Borislav Petkov, Thomas Gleixner,
	linux-arm-kernel, Oded Gabbay, Jann Horn, Linus Torvalds,
	Oleg Nesterov, Paul Mackerras, Andrew Morton, linuxppc-dev,
	Roman Gushchin, Kirill A . Shutemov, Mike Kravetz



Le 29/03/2022 à 18:43, David Hildenbrand a écrit :
> The swap type is simply stored in bits 0x1f of the swap pte. Let's
> simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
> we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
> _RPAGE_RSV1, which isn't possible and would make the
> BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
> 
> While at it, make it clearer which bit we're actually using for
> _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
> use SWP_TYPE_MASK.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------

Why only BOOK3S ? Why not BOOK3E as well ?

Christophe

>   1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 875730d5af40..8e98375d5c4a 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -13,7 +13,6 @@
>   /*
>    * Common bits between hash and Radix page table
>    */
> -#define _PAGE_BIT_SWAP_TYPE	0
>   
>   #define _PAGE_EXEC		0x00001 /* execute permission */
>   #define _PAGE_WRITE		0x00002 /* write access allowed */
> @@ -751,17 +750,16 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   	 * Don't have overlapping bits with _PAGE_HPTEFLAGS	\
>   	 * We filter HPTEFLAGS on set_pte.			\
>   	 */							\
> -	BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
> +	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
>   	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
>   	} while (0)
>   
>   #define SWP_TYPE_BITS 5
> -#define __swp_type(x)		(((x).val >> _PAGE_BIT_SWAP_TYPE) \
> -				& ((1UL << SWP_TYPE_BITS) - 1))
> +#define SWP_TYPE_MASK		((1UL << SWP_TYPE_BITS) - 1)
> +#define __swp_type(x)		((x).val & SWP_TYPE_MASK)
>   #define __swp_offset(x)		(((x).val & PTE_RPN_MASK) >> PAGE_SHIFT)
>   #define __swp_entry(type, offset)	((swp_entry_t) { \
> -				((type) << _PAGE_BIT_SWAP_TYPE) \
> -				| (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
> +				(type) | (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
>   /*
>    * swp_entry_t must be independent of pte bits. We build a swp_entry_t from
>    * swap type and offset we get from swap and convert that to pte to find a
> @@ -774,7 +772,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
>   
>   #ifdef CONFIG_MEM_SOFT_DIRTY
> -#define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
> +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
>   #else
>   #define _PAGE_SWP_SOFT_DIRTY	0UL
>   #endif /* CONFIG_MEM_SOFT_DIRTY */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
@ 2022-03-30  6:07     ` Christophe Leroy
  0 siblings, 0 replies; 57+ messages in thread
From: Christophe Leroy @ 2022-03-30  6:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz



Le 29/03/2022 à 18:43, David Hildenbrand a écrit :
> The swap type is simply stored in bits 0x1f of the swap pte. Let's
> simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
> we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
> _RPAGE_RSV1, which isn't possible and would make the
> BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
> 
> While at it, make it clearer which bit we're actually using for
> _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
> use SWP_TYPE_MASK.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------

Why only BOOK3S ? Why not BOOK3E as well ?

Christophe

>   1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 875730d5af40..8e98375d5c4a 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -13,7 +13,6 @@
>   /*
>    * Common bits between hash and Radix page table
>    */
> -#define _PAGE_BIT_SWAP_TYPE	0
>   
>   #define _PAGE_EXEC		0x00001 /* execute permission */
>   #define _PAGE_WRITE		0x00002 /* write access allowed */
> @@ -751,17 +750,16 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   	 * Don't have overlapping bits with _PAGE_HPTEFLAGS	\
>   	 * We filter HPTEFLAGS on set_pte.			\
>   	 */							\
> -	BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
> +	BUILD_BUG_ON(_PAGE_HPTEFLAGS & SWP_TYPE_MASK); \
>   	BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY);	\
>   	} while (0)
>   
>   #define SWP_TYPE_BITS 5
> -#define __swp_type(x)		(((x).val >> _PAGE_BIT_SWAP_TYPE) \
> -				& ((1UL << SWP_TYPE_BITS) - 1))
> +#define SWP_TYPE_MASK		((1UL << SWP_TYPE_BITS) - 1)
> +#define __swp_type(x)		((x).val & SWP_TYPE_MASK)
>   #define __swp_offset(x)		(((x).val & PTE_RPN_MASK) >> PAGE_SHIFT)
>   #define __swp_entry(type, offset)	((swp_entry_t) { \
> -				((type) << _PAGE_BIT_SWAP_TYPE) \
> -				| (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
> +				(type) | (((offset) << PAGE_SHIFT) & PTE_RPN_MASK)})
>   /*
>    * swp_entry_t must be independent of pte bits. We build a swp_entry_t from
>    * swap type and offset we get from swap and convert that to pte to find a
> @@ -774,7 +772,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>   #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
>   
>   #ifdef CONFIG_MEM_SOFT_DIRTY
> -#define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
> +#define _PAGE_SWP_SOFT_DIRTY	_PAGE_NON_IDEMPOTENT
>   #else
>   #define _PAGE_SWP_SOFT_DIRTY	0UL
>   #endif /* CONFIG_MEM_SOFT_DIRTY */
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  2022-03-30  6:07     ` Christophe Leroy
  (?)
@ 2022-03-30  6:58       ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-30  6:58 UTC (permalink / raw)
  To: Christophe Leroy, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On 30.03.22 08:07, Christophe Leroy wrote:
> 
> 
> Le 29/03/2022 à 18:43, David Hildenbrand a écrit :
>> The swap type is simply stored in bits 0x1f of the swap pte. Let's
>> simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
>> we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
>> _RPAGE_RSV1, which isn't possible and would make the
>> BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
>>
>> While at it, make it clearer which bit we're actually using for
>> _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
>> use SWP_TYPE_MASK.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------
> 
> Why only BOOK3S ? Why not BOOK3E as well ?

Hi Cristophe,

I'm focusing on the most relevant enterprise architectures for now. I
don't have the capacity to convert each and every architecture at this
point (especially, I don't to waste my time in case this doesn't get
merged, and book3e didn't look straight forward to me).

Once this series hits upstream, I can look into other architectures --
and I'll be happy if other people jump in that have more familiarity
with the architecture-specific swp pte layouts.

Thanks

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
@ 2022-03-30  6:58       ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-30  6:58 UTC (permalink / raw)
  To: Christophe Leroy, linux-kernel
  Cc: Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Alexander Gordeev, Will Deacon, Christoph Hellwig,
	Andrea Arcangeli, linux-s390, Shakeel Butt, Pedro Gomes, x86,
	Hugh Dickins, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Vlastimil Babka, Jason Gunthorpe, David Rientjes,
	Gerald Schaefer, Nadav Amit, Vasily Gorbik, Rik van Riel,
	John Hubbard, Heiko Carstens, Borislav Petkov, Thomas Gleixner,
	linux-arm-kernel, Oded Gabbay, Jann Horn, Linus Torvalds,
	Oleg Nesterov, Paul Mackerras, Andrew Morton, linuxppc-dev,
	Roman Gushchin, Kirill A . Shutemov, Mike Kravetz

On 30.03.22 08:07, Christophe Leroy wrote:
> 
> 
> Le 29/03/2022 à 18:43, David Hildenbrand a écrit :
>> The swap type is simply stored in bits 0x1f of the swap pte. Let's
>> simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
>> we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
>> _RPAGE_RSV1, which isn't possible and would make the
>> BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
>>
>> While at it, make it clearer which bit we're actually using for
>> _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
>> use SWP_TYPE_MASK.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------
> 
> Why only BOOK3S ? Why not BOOK3E as well ?

Hi Cristophe,

I'm focusing on the most relevant enterprise architectures for now. I
don't have the capacity to convert each and every architecture at this
point (especially, I don't to waste my time in case this doesn't get
merged, and book3e didn't look straight forward to me).

Once this series hits upstream, I can look into other architectures --
and I'll be happy if other people jump in that have more familiarity
with the architecture-specific swp pte layouts.

Thanks

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
@ 2022-03-30  6:58       ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-03-30  6:58 UTC (permalink / raw)
  To: Christophe Leroy, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On 30.03.22 08:07, Christophe Leroy wrote:
> 
> 
> Le 29/03/2022 à 18:43, David Hildenbrand a écrit :
>> The swap type is simply stored in bits 0x1f of the swap pte. Let's
>> simplify by just getting rid of _PAGE_BIT_SWAP_TYPE. It's not like that
>> we can simply change it: _PAGE_SWP_SOFT_DIRTY would suddenly fall into
>> _RPAGE_RSV1, which isn't possible and would make the
>> BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY) angry.
>>
>> While at it, make it clearer which bit we're actually using for
>> _PAGE_SWP_SOFT_DIRTY by just using the proper define and introduce and
>> use SWP_TYPE_MASK.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/powerpc/include/asm/book3s/64/pgtable.h | 12 +++++-------
> 
> Why only BOOK3S ? Why not BOOK3E as well ?

Hi Cristophe,

I'm focusing on the most relevant enterprise architectures for now. I
don't have the capacity to convert each and every architecture at this
point (especially, I don't to waste my time in case this doesn't get
merged, and book3e didn't look straight forward to me).

Once this series hits upstream, I can look into other architectures --
and I'll be happy if other people jump in that have more familiarity
with the architecture-specific swp pte layouts.

Thanks

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43   ` David Hildenbrand
  (?)
@ 2022-03-30 16:48     ` Gerald Schaefer
  -1 siblings, 0 replies; 57+ messages in thread
From: Gerald Schaefer @ 2022-03-30 16:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Matthew Wilcox, Vlastimil Babka, Jann Horn, Michal Hocko,
	Nadav Amit, Rik van Riel, Roman Gushchin, Andrea Arcangeli,
	Peter Xu, Donald Dutile, Christoph Hellwig, Oleg Nesterov,
	Jan Kara, Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	linux-mm, x86, linux-arm-kernel, linuxppc-dev, linux-s390

On Tue, 29 Mar 2022 18:43:27 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's use bit 52, which is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 3982575bb586..a397b072a580 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>  #define _PAGE_SOFT_DIRTY 0x000
>  #endif
>  
> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
> +
>  /* Set of bits not changed in pte_modify */
>  #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
>  				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
> @@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> +	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> +	return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> +	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
>  static inline int pte_soft_dirty(pte_t pte)
>  {
>  	return pte_val(pte) & _PAGE_SOFT_DIRTY;
> @@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
>   * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
>   * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * |			  offset			|X11XX|type |S0|
> + * |			  offset			|E11XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>   *
>   * Bits 0-51 store the offset.
> + * Bit 52 (E) is used to remember PG_anon_exclusive.
>   * Bits 57-61 store the type.
>   * Bit 62 (S) is used for softdirty tracking.
> - * Bits 52, 55 and 56 (X) are unused.
> + * Bits 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK	((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-30 16:48     ` Gerald Schaefer
  0 siblings, 0 replies; 57+ messages in thread
From: Gerald Schaefer @ 2022-03-30 16:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Pedro Gomes, Jann Horn, John Hubbard,
	Heiko Carstens, Shakeel Butt, Oleg Nesterov, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, linux-kernel,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On Tue, 29 Mar 2022 18:43:27 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's use bit 52, which is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 3982575bb586..a397b072a580 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>  #define _PAGE_SOFT_DIRTY 0x000
>  #endif
>  
> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
> +
>  /* Set of bits not changed in pte_modify */
>  #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
>  				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
> @@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> +	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> +	return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> +	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
>  static inline int pte_soft_dirty(pte_t pte)
>  {
>  	return pte_val(pte) & _PAGE_SOFT_DIRTY;
> @@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
>   * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
>   * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * |			  offset			|X11XX|type |S0|
> + * |			  offset			|E11XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>   *
>   * Bits 0-51 store the offset.
> + * Bit 52 (E) is used to remember PG_anon_exclusive.
>   * Bits 57-61 store the type.
>   * Bit 62 (S) is used for softdirty tracking.
> - * Bits 52, 55 and 56 (X) are unused.
> + * Bits 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK	((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-03-30 16:48     ` Gerald Schaefer
  0 siblings, 0 replies; 57+ messages in thread
From: Gerald Schaefer @ 2022-03-30 16:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Matthew Wilcox, Vlastimil Babka, Jann Horn, Michal Hocko,
	Nadav Amit, Rik van Riel, Roman Gushchin, Andrea Arcangeli,
	Peter Xu, Donald Dutile, Christoph Hellwig, Oleg Nesterov,
	Jan Kara, Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	linux-mm, x86, linux-arm-kernel, linuxppc-dev, linux-s390

On Tue, 29 Mar 2022 18:43:27 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's use bit 52, which is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 3982575bb586..a397b072a580 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>  #define _PAGE_SOFT_DIRTY 0x000
>  #endif
>  
> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE	/* SW pte exclusive swap bit */
> +
>  /* Set of bits not changed in pte_modify */
>  #define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_DIRTY | \
>  				 _PAGE_YOUNG | _PAGE_SOFT_DIRTY)
> @@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> +	return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> +	return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> +	return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
>  static inline int pte_soft_dirty(pte_t pte)
>  {
>  	return pte_val(pte) & _PAGE_SOFT_DIRTY;
> @@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
>   * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
>   * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * |			  offset			|X11XX|type |S0|
> + * |			  offset			|E11XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>   *
>   * Bits 0-51 store the offset.
> + * Bit 52 (E) is used to remember PG_anon_exclusive.
>   * Bits 57-61 store the type.
>   * Bit 62 (S) is used for softdirty tracking.
> - * Bits 52, 55 and 56 (X) are unused.
> + * Bits 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK	((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout
  2022-03-29 16:43   ` David Hildenbrand
  (?)
@ 2022-03-30 16:48     ` Gerald Schaefer
  -1 siblings, 0 replies; 57+ messages in thread
From: Gerald Schaefer @ 2022-03-30 16:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Matthew Wilcox, Vlastimil Babka, Jann Horn, Michal Hocko,
	Nadav Amit, Rik van Riel, Roman Gushchin, Andrea Arcangeli,
	Peter Xu, Donald Dutile, Christoph Hellwig, Oleg Nesterov,
	Jan Kara, Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	linux-mm, x86, linux-arm-kernel, linuxppc-dev, linux-s390

On Tue, 29 Mar 2022 18:43:26 +0200
David Hildenbrand <david@redhat.com> wrote:

> Bit 52 and bit 55 don't have to be zero: they only trigger a
> translation-specifiation exception if the PTE is marked as valid, which
> is not the case for swap ptes.
> 
> Document which bits are used for what, and which ones are unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 9df679152620..3982575bb586 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
> - * Bits 54 and 63 are used to indicate the page type.
> + * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
> + * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * |			  offset			|01100|type |00|
> + * |			  offset			|X11XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-61 store the type.
> + * Bit 62 (S) is used for softdirty tracking.
> + * Bits 52, 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK	((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout
@ 2022-03-30 16:48     ` Gerald Schaefer
  0 siblings, 0 replies; 57+ messages in thread
From: Gerald Schaefer @ 2022-03-30 16:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Pedro Gomes, Jann Horn, John Hubbard,
	Heiko Carstens, Shakeel Butt, Oleg Nesterov, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, linux-kernel,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On Tue, 29 Mar 2022 18:43:26 +0200
David Hildenbrand <david@redhat.com> wrote:

> Bit 52 and bit 55 don't have to be zero: they only trigger a
> translation-specifiation exception if the PTE is marked as valid, which
> is not the case for swap ptes.
> 
> Document which bits are used for what, and which ones are unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 9df679152620..3982575bb586 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
> - * Bits 54 and 63 are used to indicate the page type.
> + * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
> + * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * |			  offset			|01100|type |00|
> + * |			  offset			|X11XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-61 store the type.
> + * Bit 62 (S) is used for softdirty tracking.
> + * Bits 52, 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK	((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout
@ 2022-03-30 16:48     ` Gerald Schaefer
  0 siblings, 0 replies; 57+ messages in thread
From: Gerald Schaefer @ 2022-03-30 16:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Linus Torvalds,
	David Rientjes, Shakeel Butt, John Hubbard, Jason Gunthorpe,
	Mike Kravetz, Mike Rapoport, Yang Shi, Kirill A . Shutemov,
	Matthew Wilcox, Vlastimil Babka, Jann Horn, Michal Hocko,
	Nadav Amit, Rik van Riel, Roman Gushchin, Andrea Arcangeli,
	Peter Xu, Donald Dutile, Christoph Hellwig, Oleg Nesterov,
	Jan Kara, Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	linux-mm, x86, linux-arm-kernel, linuxppc-dev, linux-s390

On Tue, 29 Mar 2022 18:43:26 +0200
David Hildenbrand <david@redhat.com> wrote:

> Bit 52 and bit 55 don't have to be zero: they only trigger a
> translation-specifiation exception if the PTE is marked as valid, which
> is not the case for swap ptes.
> 
> Document which bits are used for what, and which ones are unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/pgtable.h | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 9df679152620..3982575bb586 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
> - * Bits 54 and 63 are used to indicate the page type.
> + * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
> + * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * |			  offset			|01100|type |00|
> + * |			  offset			|X11XX|type |S0|
>   * |0000000000111111111122222222223333333333444444444455|55555|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-61 store the type.
> + * Bit 62 (S) is used for softdirty tracking.
> + * Bits 52, 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK	((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-03-29 16:43   ` David Hildenbrand
  (?)
  (?)
@ 2022-04-13  8:58   ` Miaohe Lin
  2022-04-13  9:30     ` David Hildenbrand
  -1 siblings, 1 reply; 57+ messages in thread
From: Miaohe Lin @ 2022-04-13  8:58 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, Linux-MM

On 2022/3/30 0:43, David Hildenbrand wrote:
> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
...
> diff --git a/mm/memory.c b/mm/memory.c
> index 14618f446139..9060cc7f2123 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  						&src_mm->mmlist);
>  			spin_unlock(&mmlist_lock);
>  		}
> +		/* Mark the swap entry as shared. */
> +		if (pte_swp_exclusive(*src_pte)) {
> +			pte = pte_swp_clear_exclusive(*src_pte);
> +			set_pte_at(src_mm, addr, src_pte, pte);
> +		}
>  		rss[MM_SWAPENTS]++;
>  	} else if (is_migration_entry(entry)) {
>  		page = pfn_swap_entry_to_page(entry);
> @@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	struct page *page = NULL, *swapcache;
>  	struct swap_info_struct *si = NULL;
>  	rmap_t rmap_flags = RMAP_NONE;
> +	bool exclusive = false;
>  	swp_entry_t entry;
>  	pte_t pte;
>  	int locked;
> @@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>  	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>  
> +	/*
> +	 * Check under PT lock (to protect against concurrent fork() sharing
> +	 * the swap entry concurrently) for certainly exclusive pages.
> +	 */
> +	if (!PageKsm(page)) {
> +		/*
> +		 * Note that pte_swp_exclusive() == false for architectures
> +		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
> +		 */
> +		exclusive = pte_swp_exclusive(vmf->orig_pte);
> +		if (page != swapcache) {
> +			/*
> +			 * We have a fresh page that is not exposed to the
> +			 * swapcache -> certainly exclusive.
> +			 */
> +			exclusive = true;
> +		} else if (exclusive && PageWriteback(page) &&
> +			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {

Really sorry for late respond and a newbie question. IIUC, if SWP_STABLE_WRITES is set,
it means concurrent page modifications while under writeback is not supported. For these
problematic swap backends, exclusive marker is dropped. So the above if statement is to
filter out these problematic swap backends which have SWP_STABLE_WRITES set. If so, the
above check should be && (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)), i.e. no "!".
Or am I miss something?

It's very kind of you if you can answer this question. Many thanks!

> +			/*
> +			 * This is tricky: not all swap backends support
> +			 * concurrent page modifications while under writeback.
> +			 *
> +			 * So if we stumble over such a page in the swapcache
> +			 * we must not set the page exclusive, otherwise we can
> +			 * map it writable without further checks and modify it
> +			 * while still under writeback.
> +			 *
> +			 * For these problematic swap backends, simply drop the
> +			 * exclusive marker: this is perfectly fine as we start
> +			 * writeback only if we fully unmapped the page and
> +			 * there are no unexpected references on the page after
> +			 * unmapping succeeded. After fully unmapped, no
> +			 * further GUP references (FOLL_GET and FOLL_PIN) can
> +			 * appear, so dropping the exclusive marker and mapping
> +			 * it only R/O is fine.
> +			 */
> +			exclusive = false;
> +		}
> +	}
> +
>  	/*
>  	 * Remove the swap entry and conditionally try to free up the swapcache.
>  	 * We're already holding a reference on the page but haven't mapped it
> @@ -3738,11 +3784,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte = mk_pte(page, vma->vm_page_prot);
>  
>  	/*
> -	 * Same logic as in do_wp_page(); however, optimize for fresh pages
> -	 * that are certainly not shared because we just allocated them without
> -	 * exposing them to the swapcache.
> +	 * Same logic as in do_wp_page(); however, optimize for pages that are
> +	 * certainly not shared either because we just allocated them without
> +	 * exposing them to the swapcache or because the swap entry indicates
> +	 * exclusivity.
>  	 */
> -	if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) {
> +	if (!PageKsm(page) && (exclusive || page_count(page) == 1)) {
>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  			vmf->flags &= ~FAULT_FLAG_WRITE;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4de07234cbcf..c8c257d94962 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1656,14 +1656,15 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  				break;
>  			}
>  			/*
> -			 * Note: We *don't* remember yet if the page was mapped
> -			 * exclusively in the swap entry, so swapin code has
> -			 * to re-determine that manually and might detect the
> -			 * page as possibly shared, for example, if there are
> -			 * other references on the page or if the page is under
> -			 * writeback. We made sure that there are no GUP pins
> -			 * on the page that would rely on it, so for GUP pins
> -			 * this is fine.
> +			 * Note: We *don't* remember if the page was mapped
> +			 * exclusively in the swap pte if the architecture
> +			 * doesn't support __HAVE_ARCH_PTE_SWP_EXCLUSIVE. In
> +			 * that case, swapin code has to re-determine that
> +			 * manually and might detect the page as possibly
> +			 * shared, for example, if there are other references on
> +			 * the page or if the page is under writeback. We made
> +			 * sure that there are no GUP pins on the page that
> +			 * would rely on it, so for GUP pins this is fine.
>  			 */
>  			if (list_empty(&mm->mmlist)) {
>  				spin_lock(&mmlist_lock);
> @@ -1674,6 +1675,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			dec_mm_counter(mm, MM_ANONPAGES);
>  			inc_mm_counter(mm, MM_SWAPENTS);
>  			swp_pte = swp_entry_to_pte(entry);
> +			if (anon_exclusive)
> +				swp_pte = pte_swp_mkexclusive(swp_pte);
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			if (pte_uffd_wp(pteval))
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a7847324d476..7279b2d2d71d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1804,7 +1804,18 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>  	get_page(page);
>  	if (page == swapcache) {
> -		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
> +		rmap_t rmap_flags = RMAP_NONE;
> +
> +		/*
> +		 * See do_swap_page(): PageWriteback() would be problematic.
> +		 * However, we do a wait_on_page_writeback() just before this
> +		 * call and have the page locked.
> +		 */
> +		VM_BUG_ON_PAGE(PageWriteback(page), page);
> +		if (pte_swp_exclusive(*pte))
> +			rmap_flags |= RMAP_EXCLUSIVE;
> +
> +		page_add_anon_rmap(page, vma, addr, rmap_flags);
>  	} else { /* ksm created a completely new copy */
>  		page_add_new_anon_rmap(page, vma, addr);
>  		lru_cache_add_inactive_or_unevictable(page, vma);
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-04-13  8:58   ` Miaohe Lin
@ 2022-04-13  9:30     ` David Hildenbrand
  2022-04-13  9:38       ` Miaohe Lin
  0 siblings, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2022-04-13  9:30 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: linux-kernel, Linux-MM

On 13.04.22 10:58, Miaohe Lin wrote:
> On 2022/3/30 0:43, David Hildenbrand wrote:
>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
> ...
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 14618f446139..9060cc7f2123 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  						&src_mm->mmlist);
>>  			spin_unlock(&mmlist_lock);
>>  		}
>> +		/* Mark the swap entry as shared. */
>> +		if (pte_swp_exclusive(*src_pte)) {
>> +			pte = pte_swp_clear_exclusive(*src_pte);
>> +			set_pte_at(src_mm, addr, src_pte, pte);
>> +		}
>>  		rss[MM_SWAPENTS]++;
>>  	} else if (is_migration_entry(entry)) {
>>  		page = pfn_swap_entry_to_page(entry);
>> @@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	struct page *page = NULL, *swapcache;
>>  	struct swap_info_struct *si = NULL;
>>  	rmap_t rmap_flags = RMAP_NONE;
>> +	bool exclusive = false;
>>  	swp_entry_t entry;
>>  	pte_t pte;
>>  	int locked;
>> @@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>>  	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>>  
>> +	/*
>> +	 * Check under PT lock (to protect against concurrent fork() sharing
>> +	 * the swap entry concurrently) for certainly exclusive pages.
>> +	 */
>> +	if (!PageKsm(page)) {
>> +		/*
>> +		 * Note that pte_swp_exclusive() == false for architectures
>> +		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
>> +		 */
>> +		exclusive = pte_swp_exclusive(vmf->orig_pte);
>> +		if (page != swapcache) {
>> +			/*
>> +			 * We have a fresh page that is not exposed to the
>> +			 * swapcache -> certainly exclusive.
>> +			 */
>> +			exclusive = true;
>> +		} else if (exclusive && PageWriteback(page) &&
>> +			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
> 
> Really sorry for late respond and a newbie question. IIUC, if SWP_STABLE_WRITES is set,
> it means concurrent page modifications while under writeback is not supported. For these
> problematic swap backends, exclusive marker is dropped. So the above if statement is to
> filter out these problematic swap backends which have SWP_STABLE_WRITES set. If so, the
> above check should be && (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)), i.e. no "!".
> Or am I miss something?

Oh, thanks for your careful eyes!

Indeed, SWP_STABLE_WRITES indicates that the backend *requires* stable
writes, meaning, we must not modify the page while writeback is active.

So if and only if that is set, we must drop the exclusive marker.

This essentially corresponds to previous reuse_swap_page() logic:

bool reuse_swap_page(struct page *page)
{
...
	if (!PageWriteback(page)) {
		...
	} else {
		...
		if (p->flags & SWP_STABLE_WRITES) {
			spin_unlock(&p->lock);
			return false;
		}
...
}

Fortunately, this only affects such backends. For backends without
SWP_STABLE_WRITES, the current code is simply sub-optimal.


So yes, this has to be

} else if (exclusive && PageWriteback(page) &&
	   (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {


Let me try finding a way to test this, the tests I was running so far
were apparently not using a backend with SWP_STABLE_WRITES.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-04-13  9:30     ` David Hildenbrand
@ 2022-04-13  9:38       ` Miaohe Lin
  2022-04-13 10:46         ` David Hildenbrand
  2022-04-13 12:31         ` David Hildenbrand
  0 siblings, 2 replies; 57+ messages in thread
From: Miaohe Lin @ 2022-04-13  9:38 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, Linux-MM

On 2022/4/13 17:30, David Hildenbrand wrote:
> On 13.04.22 10:58, Miaohe Lin wrote:
>> On 2022/3/30 0:43, David Hildenbrand wrote:
>>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>> ...
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 14618f446139..9060cc7f2123 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>  						&src_mm->mmlist);
>>>  			spin_unlock(&mmlist_lock);
>>>  		}
>>> +		/* Mark the swap entry as shared. */
>>> +		if (pte_swp_exclusive(*src_pte)) {
>>> +			pte = pte_swp_clear_exclusive(*src_pte);
>>> +			set_pte_at(src_mm, addr, src_pte, pte);
>>> +		}
>>>  		rss[MM_SWAPENTS]++;
>>>  	} else if (is_migration_entry(entry)) {
>>>  		page = pfn_swap_entry_to_page(entry);
>>> @@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  	struct page *page = NULL, *swapcache;
>>>  	struct swap_info_struct *si = NULL;
>>>  	rmap_t rmap_flags = RMAP_NONE;
>>> +	bool exclusive = false;
>>>  	swp_entry_t entry;
>>>  	pte_t pte;
>>>  	int locked;
>>> @@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>>>  	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>>>  
>>> +	/*
>>> +	 * Check under PT lock (to protect against concurrent fork() sharing
>>> +	 * the swap entry concurrently) for certainly exclusive pages.
>>> +	 */
>>> +	if (!PageKsm(page)) {
>>> +		/*
>>> +		 * Note that pte_swp_exclusive() == false for architectures
>>> +		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
>>> +		 */
>>> +		exclusive = pte_swp_exclusive(vmf->orig_pte);
>>> +		if (page != swapcache) {
>>> +			/*
>>> +			 * We have a fresh page that is not exposed to the
>>> +			 * swapcache -> certainly exclusive.
>>> +			 */
>>> +			exclusive = true;
>>> +		} else if (exclusive && PageWriteback(page) &&
>>> +			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>
>> Really sorry for late respond and a newbie question. IIUC, if SWP_STABLE_WRITES is set,
>> it means concurrent page modifications while under writeback is not supported. For these
>> problematic swap backends, exclusive marker is dropped. So the above if statement is to
>> filter out these problematic swap backends which have SWP_STABLE_WRITES set. If so, the
>> above check should be && (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)), i.e. no "!".
>> Or am I miss something?
> 
> Oh, thanks for your careful eyes!
> 
> Indeed, SWP_STABLE_WRITES indicates that the backend *requires* stable
> writes, meaning, we must not modify the page while writeback is active.
> 
> So if and only if that is set, we must drop the exclusive marker.
> 
> This essentially corresponds to previous reuse_swap_page() logic:
> 
> bool reuse_swap_page(struct page *page)
> {
> ...
> 	if (!PageWriteback(page)) {
> 		...
> 	} else {
> 		...
> 		if (p->flags & SWP_STABLE_WRITES) {
> 			spin_unlock(&p->lock);
> 			return false;
> 		}
> ...
> }
> 
> Fortunately, this only affects such backends. For backends without
> SWP_STABLE_WRITES, the current code is simply sub-optimal.
> 
> 
> So yes, this has to be
> 
> } else if (exclusive && PageWriteback(page) &&
> 	   (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
> 

I am glad that my question helps. :)

> 
> Let me try finding a way to test this, the tests I was running so far
> were apparently not using a backend with SWP_STABLE_WRITES.
> 

That will be really helpful. Many thanks for your hard work!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-04-13  9:38       ` Miaohe Lin
@ 2022-04-13 10:46         ` David Hildenbrand
  2022-04-13 12:31         ` David Hildenbrand
  1 sibling, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-13 10:46 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: linux-kernel, Linux-MM

On 13.04.22 11:38, Miaohe Lin wrote:
> On 2022/4/13 17:30, David Hildenbrand wrote:
>> On 13.04.22 10:58, Miaohe Lin wrote:
>>> On 2022/3/30 0:43, David Hildenbrand wrote:
>>>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>>> ...
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 14618f446139..9060cc7f2123 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>  						&src_mm->mmlist);
>>>>  			spin_unlock(&mmlist_lock);
>>>>  		}
>>>> +		/* Mark the swap entry as shared. */
>>>> +		if (pte_swp_exclusive(*src_pte)) {
>>>> +			pte = pte_swp_clear_exclusive(*src_pte);
>>>> +			set_pte_at(src_mm, addr, src_pte, pte);
>>>> +		}
>>>>  		rss[MM_SWAPENTS]++;
>>>>  	} else if (is_migration_entry(entry)) {
>>>>  		page = pfn_swap_entry_to_page(entry);
>>>> @@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	struct page *page = NULL, *swapcache;
>>>>  	struct swap_info_struct *si = NULL;
>>>>  	rmap_t rmap_flags = RMAP_NONE;
>>>> +	bool exclusive = false;
>>>>  	swp_entry_t entry;
>>>>  	pte_t pte;
>>>>  	int locked;
>>>> @@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>>>>  	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>>>>  
>>>> +	/*
>>>> +	 * Check under PT lock (to protect against concurrent fork() sharing
>>>> +	 * the swap entry concurrently) for certainly exclusive pages.
>>>> +	 */
>>>> +	if (!PageKsm(page)) {
>>>> +		/*
>>>> +		 * Note that pte_swp_exclusive() == false for architectures
>>>> +		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
>>>> +		 */
>>>> +		exclusive = pte_swp_exclusive(vmf->orig_pte);
>>>> +		if (page != swapcache) {
>>>> +			/*
>>>> +			 * We have a fresh page that is not exposed to the
>>>> +			 * swapcache -> certainly exclusive.
>>>> +			 */
>>>> +			exclusive = true;
>>>> +		} else if (exclusive && PageWriteback(page) &&
>>>> +			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>>
>>> Really sorry for late respond and a newbie question. IIUC, if SWP_STABLE_WRITES is set,
>>> it means concurrent page modifications while under writeback is not supported. For these
>>> problematic swap backends, exclusive marker is dropped. So the above if statement is to
>>> filter out these problematic swap backends which have SWP_STABLE_WRITES set. If so, the
>>> above check should be && (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)), i.e. no "!".
>>> Or am I miss something?
>>
>> Oh, thanks for your careful eyes!
>>
>> Indeed, SWP_STABLE_WRITES indicates that the backend *requires* stable
>> writes, meaning, we must not modify the page while writeback is active.
>>
>> So if and only if that is set, we must drop the exclusive marker.
>>
>> This essentially corresponds to previous reuse_swap_page() logic:
>>
>> bool reuse_swap_page(struct page *page)
>> {
>> ...
>> 	if (!PageWriteback(page)) {
>> 		...
>> 	} else {
>> 		...
>> 		if (p->flags & SWP_STABLE_WRITES) {
>> 			spin_unlock(&p->lock);
>> 			return false;
>> 		}
>> ...
>> }
>>
>> Fortunately, this only affects such backends. For backends without
>> SWP_STABLE_WRITES, the current code is simply sub-optimal.
>>
>>
>> So yes, this has to be
>>
>> } else if (exclusive && PageWriteback(page) &&
>> 	   (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>
> 
> I am glad that my question helps. :)
> 

This is the kind of review I was hoping for :)


@Andrew, the following change is necessary:

diff --git a/mm/memory.c b/mm/memory.c
index 3ad39bd66203..8b3cb73f5e44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3747,7 +3747,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                         */
                        exclusive = true;
                } else if (exclusive && PageWriteback(page) &&
-                          !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
+                          (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
                        /*
                         * This is tricky: not all swap backends support
                         * concurrent page modifications while under writeback.


Do you:

a) Want to squash it
b) Want me to resend a new version of this patch only
c) Want me to resend a new version of the patch set

In the meantime, I'll try testing with a suitable backend. IIRC, zram should do the trick.

-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-04-13  9:38       ` Miaohe Lin
  2022-04-13 10:46         ` David Hildenbrand
@ 2022-04-13 12:31         ` David Hildenbrand
  2022-04-14  2:40           ` Miaohe Lin
  1 sibling, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2022-04-13 12:31 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: linux-kernel, Linux-MM, Minchan Kim

On 13.04.22 11:38, Miaohe Lin wrote:
> On 2022/4/13 17:30, David Hildenbrand wrote:
>> On 13.04.22 10:58, Miaohe Lin wrote:
>>> On 2022/3/30 0:43, David Hildenbrand wrote:
>>>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>>> ...
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 14618f446139..9060cc7f2123 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>  						&src_mm->mmlist);
>>>>  			spin_unlock(&mmlist_lock);
>>>>  		}
>>>> +		/* Mark the swap entry as shared. */
>>>> +		if (pte_swp_exclusive(*src_pte)) {
>>>> +			pte = pte_swp_clear_exclusive(*src_pte);
>>>> +			set_pte_at(src_mm, addr, src_pte, pte);
>>>> +		}
>>>>  		rss[MM_SWAPENTS]++;
>>>>  	} else if (is_migration_entry(entry)) {
>>>>  		page = pfn_swap_entry_to_page(entry);
>>>> @@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	struct page *page = NULL, *swapcache;
>>>>  	struct swap_info_struct *si = NULL;
>>>>  	rmap_t rmap_flags = RMAP_NONE;
>>>> +	bool exclusive = false;
>>>>  	swp_entry_t entry;
>>>>  	pte_t pte;
>>>>  	int locked;
>>>> @@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>>>>  	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>>>>  
>>>> +	/*
>>>> +	 * Check under PT lock (to protect against concurrent fork() sharing
>>>> +	 * the swap entry concurrently) for certainly exclusive pages.
>>>> +	 */
>>>> +	if (!PageKsm(page)) {
>>>> +		/*
>>>> +		 * Note that pte_swp_exclusive() == false for architectures
>>>> +		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
>>>> +		 */
>>>> +		exclusive = pte_swp_exclusive(vmf->orig_pte);
>>>> +		if (page != swapcache) {
>>>> +			/*
>>>> +			 * We have a fresh page that is not exposed to the
>>>> +			 * swapcache -> certainly exclusive.
>>>> +			 */
>>>> +			exclusive = true;
>>>> +		} else if (exclusive && PageWriteback(page) &&
>>>> +			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>>
>>> Really sorry for late respond and a newbie question. IIUC, if SWP_STABLE_WRITES is set,
>>> it means concurrent page modifications while under writeback is not supported. For these
>>> problematic swap backends, exclusive marker is dropped. So the above if statement is to
>>> filter out these problematic swap backends which have SWP_STABLE_WRITES set. If so, the
>>> above check should be && (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)), i.e. no "!".
>>> Or am I miss something?
>>
>> Oh, thanks for your careful eyes!
>>
>> Indeed, SWP_STABLE_WRITES indicates that the backend *requires* stable
>> writes, meaning, we must not modify the page while writeback is active.
>>
>> So if and only if that is set, we must drop the exclusive marker.
>>
>> This essentially corresponds to previous reuse_swap_page() logic:
>>
>> bool reuse_swap_page(struct page *page)
>> {
>> ...
>> 	if (!PageWriteback(page)) {
>> 		...
>> 	} else {
>> 		...
>> 		if (p->flags & SWP_STABLE_WRITES) {
>> 			spin_unlock(&p->lock);
>> 			return false;
>> 		}
>> ...
>> }
>>
>> Fortunately, this only affects such backends. For backends without
>> SWP_STABLE_WRITES, the current code is simply sub-optimal.
>>
>>
>> So yes, this has to be
>>
>> } else if (exclusive && PageWriteback(page) &&
>> 	   (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>
> 
> I am glad that my question helps. :)
> 
>>
>> Let me try finding a way to test this, the tests I was running so far
>> were apparently not using a backend with SWP_STABLE_WRITES.
>>
> 
> That will be really helpful. Many thanks for your hard work!
> 

FWIW, I tried with zram, which sets SWP_STABLE_WRITES ... but, it seems
to always do a synchronous writeback, so it cannot really trigger this
code path.

commit f05714293a591038304ddae7cb0dd747bb3786cc
Author: Minchan Kim <minchan@kernel.org>
Date:   Tue Jan 10 16:58:15 2017 -0800

    mm: support anonymous stable page


mentions "During developemnt for zram-swap asynchronous writeback,";
maybe that can be activated somehow? Putting Minchan on CC.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-04-13 12:31         ` David Hildenbrand
@ 2022-04-14  2:40           ` Miaohe Lin
  0 siblings, 0 replies; 57+ messages in thread
From: Miaohe Lin @ 2022-04-14  2:40 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-kernel, Linux-MM, Minchan Kim

On 2022/4/13 20:31, David Hildenbrand wrote:
> On 13.04.22 11:38, Miaohe Lin wrote:
>> On 2022/4/13 17:30, David Hildenbrand wrote:
>>> On 13.04.22 10:58, Miaohe Lin wrote:
>>>> On 2022/3/30 0:43, David Hildenbrand wrote:
>>>>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>>>> ...
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 14618f446139..9060cc7f2123 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -792,6 +792,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>>  						&src_mm->mmlist);
>>>>>  			spin_unlock(&mmlist_lock);
>>>>>  		}
>>>>> +		/* Mark the swap entry as shared. */
>>>>> +		if (pte_swp_exclusive(*src_pte)) {
>>>>> +			pte = pte_swp_clear_exclusive(*src_pte);
>>>>> +			set_pte_at(src_mm, addr, src_pte, pte);
>>>>> +		}
>>>>>  		rss[MM_SWAPENTS]++;
>>>>>  	} else if (is_migration_entry(entry)) {
>>>>>  		page = pfn_swap_entry_to_page(entry);
>>>>> @@ -3559,6 +3564,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>  	struct page *page = NULL, *swapcache;
>>>>>  	struct swap_info_struct *si = NULL;
>>>>>  	rmap_t rmap_flags = RMAP_NONE;
>>>>> +	bool exclusive = false;
>>>>>  	swp_entry_t entry;
>>>>>  	pte_t pte;
>>>>>  	int locked;
>>>>> @@ -3724,6 +3730,46 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>  	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
>>>>>  	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
>>>>>  
>>>>> +	/*
>>>>> +	 * Check under PT lock (to protect against concurrent fork() sharing
>>>>> +	 * the swap entry concurrently) for certainly exclusive pages.
>>>>> +	 */
>>>>> +	if (!PageKsm(page)) {
>>>>> +		/*
>>>>> +		 * Note that pte_swp_exclusive() == false for architectures
>>>>> +		 * without __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
>>>>> +		 */
>>>>> +		exclusive = pte_swp_exclusive(vmf->orig_pte);
>>>>> +		if (page != swapcache) {
>>>>> +			/*
>>>>> +			 * We have a fresh page that is not exposed to the
>>>>> +			 * swapcache -> certainly exclusive.
>>>>> +			 */
>>>>> +			exclusive = true;
>>>>> +		} else if (exclusive && PageWriteback(page) &&
>>>>> +			   !(swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>>>
>>>> Really sorry for late respond and a newbie question. IIUC, if SWP_STABLE_WRITES is set,
>>>> it means concurrent page modifications while under writeback is not supported. For these
>>>> problematic swap backends, exclusive marker is dropped. So the above if statement is to
>>>> filter out these problematic swap backends which have SWP_STABLE_WRITES set. If so, the
>>>> above check should be && (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)), i.e. no "!".
>>>> Or am I miss something?
>>>
>>> Oh, thanks for your careful eyes!
>>>
>>> Indeed, SWP_STABLE_WRITES indicates that the backend *requires* stable
>>> writes, meaning, we must not modify the page while writeback is active.
>>>
>>> So if and only if that is set, we must drop the exclusive marker.
>>>
>>> This essentially corresponds to previous reuse_swap_page() logic:
>>>
>>> bool reuse_swap_page(struct page *page)
>>> {
>>> ...
>>> 	if (!PageWriteback(page)) {
>>> 		...
>>> 	} else {
>>> 		...
>>> 		if (p->flags & SWP_STABLE_WRITES) {
>>> 			spin_unlock(&p->lock);
>>> 			return false;
>>> 		}
>>> ...
>>> }
>>>
>>> Fortunately, this only affects such backends. For backends without
>>> SWP_STABLE_WRITES, the current code is simply sub-optimal.
>>>
>>>
>>> So yes, this has to be
>>>
>>> } else if (exclusive && PageWriteback(page) &&
>>> 	   (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
>>>
>>
>> I am glad that my question helps. :)
>>
>>>
>>> Let me try finding a way to test this, the tests I was running so far
>>> were apparently not using a backend with SWP_STABLE_WRITES.
>>>
>>
>> That will be really helpful. Many thanks for your hard work!
>>
> 
> FWIW, I tried with zram, which sets SWP_STABLE_WRITES ... but, it seems
> to always do a synchronous writeback, so it cannot really trigger this
> code path.

That's a pity. We really need a asynchronous writeback to trigger this code path.

> 
> commit f05714293a591038304ddae7cb0dd747bb3786cc
> Author: Minchan Kim <minchan@kernel.org>
> Date:   Tue Jan 10 16:58:15 2017 -0800
> 
>     mm: support anonymous stable page
> 
> 
> mentions "During developemnt for zram-swap asynchronous writeback,";
> maybe that can be activated somehow? Putting Minchan on CC.
> 

ZRAM_WRITEBACK might need to be configured to enable asynchronous IO:

+
+config ZRAM_WRITEBACK
+       bool "Write back incompressible page to backing device"
+       depends on ZRAM
+       default n
+       help
+        With incompressible page, there is no memory saving to keep it
+        in memory. Instead, write it out to backing device.
+        For this feature, admin should set up backing device via
+        /sys/block/zramX/backing_dev.
+
+        See zram.txt for more infomration.

It seems there is only asynchronous IO for swapin ops. I browsed the source code
and I can only found read_from_bdev_async. But I'm not familiar with the zram code.
Minchan might could kindly help us solving this question.

Thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43   ` David Hildenbrand
  (?)
@ 2022-04-19 12:46     ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-19 12:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 29.03.22 18:43, David Hildenbrand wrote:
> Let's use bit 3 to remember PG_anon_exclusive in swap ptes.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Looks like I ignored that 32bit uses a different (undocumented) swap layout
and bit 3 falls into the swp type. We'll restrict this to x86-64 for now, just
like for the other architectures.

The following seems to fly. @Andrew, let me know if you prefer a v3.


From bafb5ba914e89ad20c46f4e841a36909e610b81e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 9 Mar 2022 09:47:29 +0100
Subject: [PATCH] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on x86_64

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/pgtable.h          | 17 +++++++++++++++++
 arch/x86/include/asm/pgtable_64.h       |  4 +++-
 arch/x86/include/asm/pgtable_64_types.h |  5 +++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..a1c555abed26 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1291,6 +1291,23 @@ static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
 		unsigned long addr, pud_t *pud)
 {
 }
+#ifdef _PAGE_SWP_EXCLUSIVE
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+#endif /* _PAGE_SWP_EXCLUSIVE */
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 91ac10654570..70e360a2e5fb 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -163,4 +163,9 @@ extern unsigned int ptrs_per_p4d;
 
 #define PGD_KERNEL_START	((PAGE_SIZE / 2) / sizeof(pgd_t))
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_PWT
+
 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
-- 
2.35.1





-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-04-19 12:46     ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-19 12:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, linux-arm-kernel, Jason Gunthorpe,
	David Rientjes, Gerald Schaefer, Pedro Gomes, Jann Horn,
	John Hubbard, Heiko Carstens, Shakeel Butt, Thomas Gleixner,
	Vlastimil Babka, Oded Gabbay, linuxppc-dev, Oleg Nesterov,
	Nadav Amit, Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On 29.03.22 18:43, David Hildenbrand wrote:
> Let's use bit 3 to remember PG_anon_exclusive in swap ptes.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Looks like I ignored that 32bit uses a different (undocumented) swap layout
and bit 3 falls into the swp type. We'll restrict this to x86-64 for now, just
like for the other architectures.

The following seems to fly. @Andrew, let me know if you prefer a v3.


From bafb5ba914e89ad20c46f4e841a36909e610b81e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 9 Mar 2022 09:47:29 +0100
Subject: [PATCH] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on x86_64

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/pgtable.h          | 17 +++++++++++++++++
 arch/x86/include/asm/pgtable_64.h       |  4 +++-
 arch/x86/include/asm/pgtable_64_types.h |  5 +++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..a1c555abed26 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1291,6 +1291,23 @@ static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
 		unsigned long addr, pud_t *pud)
 {
 }
+#ifdef _PAGE_SWP_EXCLUSIVE
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+#endif /* _PAGE_SWP_EXCLUSIVE */
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 91ac10654570..70e360a2e5fb 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -163,4 +163,9 @@ extern unsigned int ptrs_per_p4d;
 
 #define PGD_KERNEL_START	((PAGE_SIZE / 2) / sizeof(pgd_t))
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_PWT
+
 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
-- 
2.35.1





-- 
Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-04-19 12:46     ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-19 12:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Vlastimil Babka, Jann Horn, Michal Hocko, Nadav Amit,
	Rik van Riel, Roman Gushchin, Andrea Arcangeli, Peter Xu,
	Donald Dutile, Christoph Hellwig, Oleg Nesterov, Jan Kara,
	Liang Zhang, Pedro Gomes, Oded Gabbay, Catalin Marinas,
	Will Deacon, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 29.03.22 18:43, David Hildenbrand wrote:
> Let's use bit 3 to remember PG_anon_exclusive in swap ptes.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Looks like I ignored that 32bit uses a different (undocumented) swap layout
and bit 3 falls into the swp type. We'll restrict this to x86-64 for now, just
like for the other architectures.

The following seems to fly. @Andrew, let me know if you prefer a v3.


From bafb5ba914e89ad20c46f4e841a36909e610b81e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 9 Mar 2022 09:47:29 +0100
Subject: [PATCH] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on x86_64

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/include/asm/pgtable.h          | 17 +++++++++++++++++
 arch/x86/include/asm/pgtable_64.h       |  4 +++-
 arch/x86/include/asm/pgtable_64_types.h |  5 +++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..a1c555abed26 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1291,6 +1291,23 @@ static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
 		unsigned long addr, pud_t *pud)
 {
 }
+#ifdef _PAGE_SWP_EXCLUSIVE
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+#endif /* _PAGE_SWP_EXCLUSIVE */
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 91ac10654570..70e360a2e5fb 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -163,4 +163,9 @@ extern unsigned int ptrs_per_p4d;
 
 #define PGD_KERNEL_START	((PAGE_SIZE / 2) / sizeof(pgd_t))
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE	_PAGE_PWT
+
 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
-- 
2.35.1





-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-03-29 16:43   ` David Hildenbrand
  (?)
@ 2022-04-20 17:10     ` Vlastimil Babka
  -1 siblings, 0 replies; 57+ messages in thread
From: Vlastimil Babka @ 2022-04-20 17:10 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 3/29/22 18:43, David Hildenbrand wrote:
> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
> it. We do this, to keep fork() logic on swap entries easy and efficient:
> for example, if we wouldn't clear it when unmapping, we'd have to lookup
> the page in the swapcache for each and every swap entry during fork() and
> clear PG_anon_exclusive if set.
> 
> Instead, we want to store that information directly in the swap pte,
> protected by the page table lock, similarly to how we handle
> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
> swap entries, we don't want to mess with the swap type (e.g., still one
> bit) because it overcomplicates swap code.
> 
> In try_to_unmap(), we already reject to unmap in case the page might be
> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
> Checking if there are other unexpected references reliably *before*
> completely unmapping a page is unfortunately not really possible: THP
> heavily overcomplicate the situation. Once fully unmapped it's easier --
> we, for example, make sure that there are no unexpected references
> *after* unmapping a page before starting writeback on that page.
> 
> So, we currently might end up unmapping a page and clearing
> PG_anon_exclusive if that page has additional references, for example,
> due to a FOLL_GET.
> 
> do_swap_page() has to re-determine if a page is exclusive, which will
> easily fail if there are other references on a page, most prominently
> GUP references via FOLL_GET. This can currently result in memory
> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
> when fork() is never involved: try_to_unmap() will succeed, and when
> refaulting the page, it cannot be marked exclusive and will get replaced
> by a copy in the page tables on the next write access, resulting in writes
> via the GUP reference to the page being lost.
> 
> In an ideal world, everybody that uses GUP and wants to modify page
> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
> conversion will take a while. It's easier to fix what used to work in the
> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
> by remembering PG_anon_exclusive we can further reduce unnecessary COW
> in some cases, so it's the natural thing to do.
> 
> So let's transfer the PG_anon_exclusive information to the swap pte and
> store it via an architecture-dependant pte bit; use that information when
> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
> simply have to clear the pte bit and are done.
> 
> Of course, there is one corner case to handle: swap backends that don't
> support concurrent page modifications while the page is under writeback.
> Special case these, and drop the exclusive marker. Add a comment why that
> is just fine (also, reuse_swap_page() would have done the same in the
> past).
> 
> In the future, we'll hopefully have all architectures support
> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
> stubs and the define completely. Then, we can also convert
> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
> support: either simply use a yet unused pte bit that can be used for swap
> entries, steal one from the arch type bits if they exceed 5, or steal one
> from the offset bits.
> 
> Note: R/O FOLL_GET references were never really reliable, especially
> when taking one on a shared page and then writing to the page (e.g., GUP
> after fork()). FOLL_GET, including R/W references, were never really
> reliable once fork was involved (e.g., GUP before fork(),
> GUP during fork()). KSM steps back in case it stumbles over unexpected
> references and is, therefore, fine.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

With the fixup as reportedy by Miaohe Lin

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(sent a separate mm-commits mail to inquire about the fix going missing from
mmotm)

https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4baec94@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
@ 2022-04-20 17:10     ` Vlastimil Babka
  0 siblings, 0 replies; 57+ messages in thread
From: Vlastimil Babka @ 2022-04-20 17:10 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Jason Gunthorpe, David Rientjes,
	Gerald Schaefer, Pedro Gomes, Jann Horn, John Hubbard,
	Heiko Carstens, Shakeel Butt, Thomas Gleixner, linux-arm-kernel,
	Oded Gabbay, linuxppc-dev, Oleg Nesterov, Nadav Amit,
	Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On 3/29/22 18:43, David Hildenbrand wrote:
> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
> it. We do this, to keep fork() logic on swap entries easy and efficient:
> for example, if we wouldn't clear it when unmapping, we'd have to lookup
> the page in the swapcache for each and every swap entry during fork() and
> clear PG_anon_exclusive if set.
> 
> Instead, we want to store that information directly in the swap pte,
> protected by the page table lock, similarly to how we handle
> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
> swap entries, we don't want to mess with the swap type (e.g., still one
> bit) because it overcomplicates swap code.
> 
> In try_to_unmap(), we already reject to unmap in case the page might be
> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
> Checking if there are other unexpected references reliably *before*
> completely unmapping a page is unfortunately not really possible: THP
> heavily overcomplicate the situation. Once fully unmapped it's easier --
> we, for example, make sure that there are no unexpected references
> *after* unmapping a page before starting writeback on that page.
> 
> So, we currently might end up unmapping a page and clearing
> PG_anon_exclusive if that page has additional references, for example,
> due to a FOLL_GET.
> 
> do_swap_page() has to re-determine if a page is exclusive, which will
> easily fail if there are other references on a page, most prominently
> GUP references via FOLL_GET. This can currently result in memory
> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
> when fork() is never involved: try_to_unmap() will succeed, and when
> refaulting the page, it cannot be marked exclusive and will get replaced
> by a copy in the page tables on the next write access, resulting in writes
> via the GUP reference to the page being lost.
> 
> In an ideal world, everybody that uses GUP and wants to modify page
> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
> conversion will take a while. It's easier to fix what used to work in the
> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
> by remembering PG_anon_exclusive we can further reduce unnecessary COW
> in some cases, so it's the natural thing to do.
> 
> So let's transfer the PG_anon_exclusive information to the swap pte and
> store it via an architecture-dependant pte bit; use that information when
> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
> simply have to clear the pte bit and are done.
> 
> Of course, there is one corner case to handle: swap backends that don't
> support concurrent page modifications while the page is under writeback.
> Special case these, and drop the exclusive marker. Add a comment why that
> is just fine (also, reuse_swap_page() would have done the same in the
> past).
> 
> In the future, we'll hopefully have all architectures support
> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
> stubs and the define completely. Then, we can also convert
> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
> support: either simply use a yet unused pte bit that can be used for swap
> entries, steal one from the arch type bits if they exceed 5, or steal one
> from the offset bits.
> 
> Note: R/O FOLL_GET references were never really reliable, especially
> when taking one on a shared page and then writing to the page (e.g., GUP
> after fork()). FOLL_GET, including R/W references, were never really
> reliable once fork was involved (e.g., GUP before fork(),
> GUP during fork()). KSM steps back in case it stumbles over unexpected
> references and is, therefore, fine.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

With the fixup as reportedy by Miaohe Lin

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(sent a separate mm-commits mail to inquire about the fix going missing from
mmotm)

https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4baec94@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
@ 2022-04-20 17:10     ` Vlastimil Babka
  0 siblings, 0 replies; 57+ messages in thread
From: Vlastimil Babka @ 2022-04-20 17:10 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 3/29/22 18:43, David Hildenbrand wrote:
> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
> it. We do this, to keep fork() logic on swap entries easy and efficient:
> for example, if we wouldn't clear it when unmapping, we'd have to lookup
> the page in the swapcache for each and every swap entry during fork() and
> clear PG_anon_exclusive if set.
> 
> Instead, we want to store that information directly in the swap pte,
> protected by the page table lock, similarly to how we handle
> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
> swap entries, we don't want to mess with the swap type (e.g., still one
> bit) because it overcomplicates swap code.
> 
> In try_to_unmap(), we already reject to unmap in case the page might be
> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
> Checking if there are other unexpected references reliably *before*
> completely unmapping a page is unfortunately not really possible: THP
> heavily overcomplicate the situation. Once fully unmapped it's easier --
> we, for example, make sure that there are no unexpected references
> *after* unmapping a page before starting writeback on that page.
> 
> So, we currently might end up unmapping a page and clearing
> PG_anon_exclusive if that page has additional references, for example,
> due to a FOLL_GET.
> 
> do_swap_page() has to re-determine if a page is exclusive, which will
> easily fail if there are other references on a page, most prominently
> GUP references via FOLL_GET. This can currently result in memory
> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
> when fork() is never involved: try_to_unmap() will succeed, and when
> refaulting the page, it cannot be marked exclusive and will get replaced
> by a copy in the page tables on the next write access, resulting in writes
> via the GUP reference to the page being lost.
> 
> In an ideal world, everybody that uses GUP and wants to modify page
> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
> conversion will take a while. It's easier to fix what used to work in the
> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
> by remembering PG_anon_exclusive we can further reduce unnecessary COW
> in some cases, so it's the natural thing to do.
> 
> So let's transfer the PG_anon_exclusive information to the swap pte and
> store it via an architecture-dependant pte bit; use that information when
> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
> simply have to clear the pte bit and are done.
> 
> Of course, there is one corner case to handle: swap backends that don't
> support concurrent page modifications while the page is under writeback.
> Special case these, and drop the exclusive marker. Add a comment why that
> is just fine (also, reuse_swap_page() would have done the same in the
> past).
> 
> In the future, we'll hopefully have all architectures support
> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
> stubs and the define completely. Then, we can also convert
> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
> support: either simply use a yet unused pte bit that can be used for swap
> entries, steal one from the arch type bits if they exceed 5, or steal one
> from the offset bits.
> 
> Note: R/O FOLL_GET references were never really reliable, especially
> when taking one on a shared page and then writing to the page (e.g., GUP
> after fork()). FOLL_GET, including R/W references, were never really
> reliable once fork was involved (e.g., GUP before fork(),
> GUP during fork()). KSM steps back in case it stumbles over unexpected
> references and is, therefore, fine.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

With the fixup as reportedy by Miaohe Lin

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(sent a separate mm-commits mail to inquire about the fix going missing from
mmotm)

https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4baec94@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
  2022-04-20 17:10     ` Vlastimil Babka
  (?)
@ 2022-04-20 17:13       ` David Hildenbrand
  -1 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-20 17:13 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 20.04.22 19:10, Vlastimil Babka wrote:
> On 3/29/22 18:43, David Hildenbrand wrote:
>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>> it. We do this, to keep fork() logic on swap entries easy and efficient:
>> for example, if we wouldn't clear it when unmapping, we'd have to lookup
>> the page in the swapcache for each and every swap entry during fork() and
>> clear PG_anon_exclusive if set.
>>
>> Instead, we want to store that information directly in the swap pte,
>> protected by the page table lock, similarly to how we handle
>> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
>> swap entries, we don't want to mess with the swap type (e.g., still one
>> bit) because it overcomplicates swap code.
>>
>> In try_to_unmap(), we already reject to unmap in case the page might be
>> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
>> Checking if there are other unexpected references reliably *before*
>> completely unmapping a page is unfortunately not really possible: THP
>> heavily overcomplicate the situation. Once fully unmapped it's easier --
>> we, for example, make sure that there are no unexpected references
>> *after* unmapping a page before starting writeback on that page.
>>
>> So, we currently might end up unmapping a page and clearing
>> PG_anon_exclusive if that page has additional references, for example,
>> due to a FOLL_GET.
>>
>> do_swap_page() has to re-determine if a page is exclusive, which will
>> easily fail if there are other references on a page, most prominently
>> GUP references via FOLL_GET. This can currently result in memory
>> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
>> when fork() is never involved: try_to_unmap() will succeed, and when
>> refaulting the page, it cannot be marked exclusive and will get replaced
>> by a copy in the page tables on the next write access, resulting in writes
>> via the GUP reference to the page being lost.
>>
>> In an ideal world, everybody that uses GUP and wants to modify page
>> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
>> conversion will take a while. It's easier to fix what used to work in the
>> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
>> by remembering PG_anon_exclusive we can further reduce unnecessary COW
>> in some cases, so it's the natural thing to do.
>>
>> So let's transfer the PG_anon_exclusive information to the swap pte and
>> store it via an architecture-dependant pte bit; use that information when
>> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
>> simply have to clear the pte bit and are done.
>>
>> Of course, there is one corner case to handle: swap backends that don't
>> support concurrent page modifications while the page is under writeback.
>> Special case these, and drop the exclusive marker. Add a comment why that
>> is just fine (also, reuse_swap_page() would have done the same in the
>> past).
>>
>> In the future, we'll hopefully have all architectures support
>> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
>> stubs and the define completely. Then, we can also convert
>> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
>> support: either simply use a yet unused pte bit that can be used for swap
>> entries, steal one from the arch type bits if they exceed 5, or steal one
>> from the offset bits.
>>
>> Note: R/O FOLL_GET references were never really reliable, especially
>> when taking one on a shared page and then writing to the page (e.g., GUP
>> after fork()). FOLL_GET, including R/W references, were never really
>> reliable once fork was involved (e.g., GUP before fork(),
>> GUP during fork()). KSM steps back in case it stumbles over unexpected
>> references and is, therefore, fine.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> With the fixup as reportedy by Miaohe Lin
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> (sent a separate mm-commits mail to inquire about the fix going missing from
> mmotm)
> 
> https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4baec94@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb

Yes I saw that, thanks for catching that!


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
@ 2022-04-20 17:13       ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-20 17:13 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Jason Gunthorpe, David Rientjes,
	Gerald Schaefer, Pedro Gomes, Jann Horn, John Hubbard,
	Heiko Carstens, Shakeel Butt, Thomas Gleixner, linux-arm-kernel,
	Oded Gabbay, linuxppc-dev, Oleg Nesterov, Nadav Amit,
	Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On 20.04.22 19:10, Vlastimil Babka wrote:
> On 3/29/22 18:43, David Hildenbrand wrote:
>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>> it. We do this, to keep fork() logic on swap entries easy and efficient:
>> for example, if we wouldn't clear it when unmapping, we'd have to lookup
>> the page in the swapcache for each and every swap entry during fork() and
>> clear PG_anon_exclusive if set.
>>
>> Instead, we want to store that information directly in the swap pte,
>> protected by the page table lock, similarly to how we handle
>> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
>> swap entries, we don't want to mess with the swap type (e.g., still one
>> bit) because it overcomplicates swap code.
>>
>> In try_to_unmap(), we already reject to unmap in case the page might be
>> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
>> Checking if there are other unexpected references reliably *before*
>> completely unmapping a page is unfortunately not really possible: THP
>> heavily overcomplicate the situation. Once fully unmapped it's easier --
>> we, for example, make sure that there are no unexpected references
>> *after* unmapping a page before starting writeback on that page.
>>
>> So, we currently might end up unmapping a page and clearing
>> PG_anon_exclusive if that page has additional references, for example,
>> due to a FOLL_GET.
>>
>> do_swap_page() has to re-determine if a page is exclusive, which will
>> easily fail if there are other references on a page, most prominently
>> GUP references via FOLL_GET. This can currently result in memory
>> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
>> when fork() is never involved: try_to_unmap() will succeed, and when
>> refaulting the page, it cannot be marked exclusive and will get replaced
>> by a copy in the page tables on the next write access, resulting in writes
>> via the GUP reference to the page being lost.
>>
>> In an ideal world, everybody that uses GUP and wants to modify page
>> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
>> conversion will take a while. It's easier to fix what used to work in the
>> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
>> by remembering PG_anon_exclusive we can further reduce unnecessary COW
>> in some cases, so it's the natural thing to do.
>>
>> So let's transfer the PG_anon_exclusive information to the swap pte and
>> store it via an architecture-dependant pte bit; use that information when
>> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
>> simply have to clear the pte bit and are done.
>>
>> Of course, there is one corner case to handle: swap backends that don't
>> support concurrent page modifications while the page is under writeback.
>> Special case these, and drop the exclusive marker. Add a comment why that
>> is just fine (also, reuse_swap_page() would have done the same in the
>> past).
>>
>> In the future, we'll hopefully have all architectures support
>> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
>> stubs and the define completely. Then, we can also convert
>> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
>> support: either simply use a yet unused pte bit that can be used for swap
>> entries, steal one from the arch type bits if they exceed 5, or steal one
>> from the offset bits.
>>
>> Note: R/O FOLL_GET references were never really reliable, especially
>> when taking one on a shared page and then writing to the page (e.g., GUP
>> after fork()). FOLL_GET, including R/W references, were never really
>> reliable once fork was involved (e.g., GUP before fork(),
>> GUP during fork()). KSM steps back in case it stumbles over unexpected
>> references and is, therefore, fine.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> With the fixup as reportedy by Miaohe Lin
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> (sent a separate mm-commits mail to inquire about the fix going missing from
> mmotm)
> 
> https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4baec94@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb

Yes I saw that, thanks for catching that!


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
@ 2022-04-20 17:13       ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2022-04-20 17:13 UTC (permalink / raw)
  To: Vlastimil Babka, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 20.04.22 19:10, Vlastimil Babka wrote:
> On 3/29/22 18:43, David Hildenbrand wrote:
>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>> it. We do this, to keep fork() logic on swap entries easy and efficient:
>> for example, if we wouldn't clear it when unmapping, we'd have to lookup
>> the page in the swapcache for each and every swap entry during fork() and
>> clear PG_anon_exclusive if set.
>>
>> Instead, we want to store that information directly in the swap pte,
>> protected by the page table lock, similarly to how we handle
>> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
>> swap entries, we don't want to mess with the swap type (e.g., still one
>> bit) because it overcomplicates swap code.
>>
>> In try_to_unmap(), we already reject to unmap in case the page might be
>> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
>> Checking if there are other unexpected references reliably *before*
>> completely unmapping a page is unfortunately not really possible: THP
>> heavily overcomplicate the situation. Once fully unmapped it's easier --
>> we, for example, make sure that there are no unexpected references
>> *after* unmapping a page before starting writeback on that page.
>>
>> So, we currently might end up unmapping a page and clearing
>> PG_anon_exclusive if that page has additional references, for example,
>> due to a FOLL_GET.
>>
>> do_swap_page() has to re-determine if a page is exclusive, which will
>> easily fail if there are other references on a page, most prominently
>> GUP references via FOLL_GET. This can currently result in memory
>> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
>> when fork() is never involved: try_to_unmap() will succeed, and when
>> refaulting the page, it cannot be marked exclusive and will get replaced
>> by a copy in the page tables on the next write access, resulting in writes
>> via the GUP reference to the page being lost.
>>
>> In an ideal world, everybody that uses GUP and wants to modify page
>> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
>> conversion will take a while. It's easier to fix what used to work in the
>> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
>> by remembering PG_anon_exclusive we can further reduce unnecessary COW
>> in some cases, so it's the natural thing to do.
>>
>> So let's transfer the PG_anon_exclusive information to the swap pte and
>> store it via an architecture-dependant pte bit; use that information when
>> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
>> simply have to clear the pte bit and are done.
>>
>> Of course, there is one corner case to handle: swap backends that don't
>> support concurrent page modifications while the page is under writeback.
>> Special case these, and drop the exclusive marker. Add a comment why that
>> is just fine (also, reuse_swap_page() would have done the same in the
>> past).
>>
>> In the future, we'll hopefully have all architectures support
>> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
>> stubs and the define completely. Then, we can also convert
>> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
>> support: either simply use a yet unused pte bit that can be used for swap
>> entries, steal one from the arch type bits if they exceed 5, or steal one
>> from the offset bits.
>>
>> Note: R/O FOLL_GET references were never really reliable, especially
>> when taking one on a shared page and then writing to the page (e.g., GUP
>> after fork()). FOLL_GET, including R/W references, were never really
>> reliable once fork was involved (e.g., GUP before fork(),
>> GUP during fork()). KSM steps back in case it stumbles over unexpected
>> references and is, therefore, fine.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> With the fixup as reportedy by Miaohe Lin
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> (sent a separate mm-commits mail to inquire about the fix going missing from
> mmotm)
> 
> https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4baec94@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb

Yes I saw that, thanks for catching that!


-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  2022-03-29 16:43   ` David Hildenbrand
  (?)
@ 2022-04-20 17:14     ` Vlastimil Babka
  -1 siblings, 0 replies; 57+ messages in thread
From: Vlastimil Babka @ 2022-04-20 17:14 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 3/29/22 18:43, David Hildenbrand wrote:
> Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/debug_vm_pgtable.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index db2abd9e415b..55f1a8dc716f 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { }
>  static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
> +{
> +#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> +
> +	pr_debug("Validating PTE swap exclusive\n");
> +	pte = pte_swp_mkexclusive(pte);
> +	WARN_ON(!pte_swp_exclusive(pte));

I guess only this WARN_ON must be guarded by the #ifdef, but doesn't matter
that much - won't gain significantly more test coverage.

> +	pte = pte_swp_clear_exclusive(pte);
> +	WARN_ON(pte_swp_exclusive(pte));
> +#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
> +}
> +
>  static void __init pte_swap_tests(struct pgtable_debug_args *args)
>  {
>  	swp_entry_t swp;
> @@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
>  	pte_swap_soft_dirty_tests(&args);
>  	pmd_swap_soft_dirty_tests(&args);
>  
> +	pte_swap_exclusive_tests(&args);
> +
>  	pte_swap_tests(&args);
>  	pmd_swap_tests(&args);
>  


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-04-20 17:14     ` Vlastimil Babka
  0 siblings, 0 replies; 57+ messages in thread
From: Vlastimil Babka @ 2022-04-20 17:14 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, David Rientjes,
	Shakeel Butt, John Hubbard, Jason Gunthorpe, Mike Kravetz,
	Mike Rapoport, Yang Shi, Kirill A . Shutemov, Matthew Wilcox,
	Jann Horn, Michal Hocko, Nadav Amit, Rik van Riel,
	Roman Gushchin, Andrea Arcangeli, Peter Xu, Donald Dutile,
	Christoph Hellwig, Oleg Nesterov, Jan Kara, Liang Zhang,
	Pedro Gomes, Oded Gabbay, Catalin Marinas, Will Deacon,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Gerald Schaefer, linux-mm, x86, linux-arm-kernel, linuxppc-dev,
	linux-s390

On 3/29/22 18:43, David Hildenbrand wrote:
> Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/debug_vm_pgtable.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index db2abd9e415b..55f1a8dc716f 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { }
>  static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
> +{
> +#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> +
> +	pr_debug("Validating PTE swap exclusive\n");
> +	pte = pte_swp_mkexclusive(pte);
> +	WARN_ON(!pte_swp_exclusive(pte));

I guess only this WARN_ON must be guarded by the #ifdef, but doesn't matter
that much - won't gain significantly more test coverage.

> +	pte = pte_swp_clear_exclusive(pte);
> +	WARN_ON(pte_swp_exclusive(pte));
> +#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
> +}
> +
>  static void __init pte_swap_tests(struct pgtable_debug_args *args)
>  {
>  	swp_entry_t swp;
> @@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
>  	pte_swap_soft_dirty_tests(&args);
>  	pmd_swap_soft_dirty_tests(&args);
>  
> +	pte_swap_exclusive_tests(&args);
> +
>  	pte_swap_tests(&args);
>  	pmd_swap_tests(&args);
>  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
@ 2022-04-20 17:14     ` Vlastimil Babka
  0 siblings, 0 replies; 57+ messages in thread
From: Vlastimil Babka @ 2022-04-20 17:14 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: x86, Jan Kara, Catalin Marinas, Yang Shi, Dave Hansen, Peter Xu,
	Michal Hocko, linux-mm, Donald Dutile, Liang Zhang,
	Borislav Petkov, Alexander Gordeev, Will Deacon,
	Christoph Hellwig, Paul Mackerras, Andrea Arcangeli, linux-s390,
	Vasily Gorbik, Rik van Riel, Hugh Dickins, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Jason Gunthorpe, David Rientjes,
	Gerald Schaefer, Pedro Gomes, Jann Horn, John Hubbard,
	Heiko Carstens, Shakeel Butt, Thomas Gleixner, linux-arm-kernel,
	Oded Gabbay, linuxppc-dev, Oleg Nesterov, Nadav Amit,
	Andrew Morton, Linus Torvalds, Roman Gushchin,
	Kirill A . Shutemov, Mike Kravetz

On 3/29/22 18:43, David Hildenbrand wrote:
> Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/debug_vm_pgtable.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index db2abd9e415b..55f1a8dc716f 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { }
>  static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
> +{
> +#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> +
> +	pr_debug("Validating PTE swap exclusive\n");
> +	pte = pte_swp_mkexclusive(pte);
> +	WARN_ON(!pte_swp_exclusive(pte));

I guess only this WARN_ON must be guarded by the #ifdef, but doesn't matter
that much - won't gain significantly more test coverage.

> +	pte = pte_swp_clear_exclusive(pte);
> +	WARN_ON(pte_swp_exclusive(pte));
> +#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
> +}
> +
>  static void __init pte_swap_tests(struct pgtable_debug_args *args)
>  {
>  	swp_entry_t swp;
> @@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
>  	pte_swap_soft_dirty_tests(&args);
>  	pmd_swap_soft_dirty_tests(&args);
>  
> +	pte_swap_exclusive_tests(&args);
> +
>  	pte_swap_tests(&args);
>  	pmd_swap_tests(&args);
>  


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-04-20 17:22 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29 16:43 [PATCH v2 0/8] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages David Hildenbrand
2022-03-29 16:43 ` David Hildenbrand
2022-03-29 16:43 ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-04-13  8:58   ` Miaohe Lin
2022-04-13  9:30     ` David Hildenbrand
2022-04-13  9:38       ` Miaohe Lin
2022-04-13 10:46         ` David Hildenbrand
2022-04-13 12:31         ` David Hildenbrand
2022-04-14  2:40           ` Miaohe Lin
2022-04-20 17:10   ` Vlastimil Babka
2022-04-20 17:10     ` Vlastimil Babka
2022-04-20 17:10     ` Vlastimil Babka
2022-04-20 17:13     ` David Hildenbrand
2022-04-20 17:13       ` David Hildenbrand
2022-04-20 17:13       ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-04-20 17:14   ` Vlastimil Babka
2022-04-20 17:14     ` Vlastimil Babka
2022-04-20 17:14     ` Vlastimil Babka
2022-03-29 16:43 ` [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-04-19 12:46   ` David Hildenbrand
2022-04-19 12:46     ` David Hildenbrand
2022-04-19 12:46     ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 4/8] arm64/pgtable: " David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-30 16:48   ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-29 16:43 ` [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-30 16:48   ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-30 16:48     ` Gerald Schaefer
2022-03-29 16:43 ` [PATCH v2 7/8] powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-30  6:07   ` Christophe Leroy
2022-03-30  6:07     ` Christophe Leroy
2022-03-30  6:07     ` Christophe Leroy
2022-03-30  6:58     ` David Hildenbrand
2022-03-30  6:58       ` David Hildenbrand
2022-03-30  6:58       ` David Hildenbrand
2022-03-29 16:43 ` [PATCH v2 8/8] powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE " David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand
2022-03-29 16:43   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.