All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-05-29  6:11 ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Here is the third series of patches to mm (and a few architectures), based
on v6.4-rc3 with the preceding two series applied: in which khugepaged
takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.

This follows on from the "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/
series of 23 posted on 2023-05-09,
and the "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com/
series of 31 posted on 2023-05-21.

Those two series were "independent": neither depending for build or
correctness on the other, but both series needed before this third one
can safely make the effective changes.  I'll send v2 of those two series
in a couple of days, incorporating Acks and Revieweds and the minor fixes.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on the preceding two series over v6.4-rc3, but good over
v6.4-rc[1-4], current mm-everything or current linux-next.

01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
02/12 mm/pgtable: add PAE safety to __pte_offset_map()
03/12 arm: adjust_pte() use pte_offset_map_nolock()
04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock()
05/12 powerpc: add pte_free_defer() for pgtables sharing page
06/12 sparc: add pte_free_defer() for pgtables sharing page
07/12 s390: add pte_free_defer(), with use of mmdrop_async()
08/12 mm/pgtable: add pte_free_defer() for pgtable as page
09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock
10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
12/12 mm: delete mmap_write_trylock() and vma_try_start_write()

 arch/arm/mm/fault-armv.c            |   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c      |  18 ++
 arch/powerpc/mm/pgtable.c           |  16 +-
 arch/s390/include/asm/pgalloc.h     |   4 +
 arch/s390/mm/pgalloc.c              |  34 +++
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c             |  16 ++
 include/linux/mm.h                  |  17 --
 include/linux/mm_types.h            |   2 +-
 include/linux/mmap_lock.h           |  10 -
 include/linux/pgtable.h             |   6 +-
 include/linux/sched/mm.h            |   1 +
 kernel/fork.c                       |   2 +-
 mm/khugepaged.c                     | 425 ++++++++----------------------
 mm/pgtable-generic.c                |  44 +++-
 16 files changed, 253 insertions(+), 353 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-05-29  6:11 ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Here is the third series of patches to mm (and a few architectures), based
on v6.4-rc3 with the preceding two series applied: in which khugepaged
takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.

This follows on from the "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/
series of 23 posted on 2023-05-09,
and the "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com/
series of 31 posted on 2023-05-21.

Those two series were "independent": neither depending for build or
correctness on the other, but both series needed before this third one
can safely make the effective changes.  I'll send v2 of those two series
in a couple of days, incorporating Acks and Revieweds and the minor fixes.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on the preceding two series over v6.4-rc3, but good over
v6.4-rc[1-4], current mm-everything or current linux-next.

01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
02/12 mm/pgtable: add PAE safety to __pte_offset_map()
03/12 arm: adjust_pte() use pte_offset_map_nolock()
04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock()
05/12 powerpc: add pte_free_defer() for pgtables sharing page
06/12 sparc: add pte_free_defer() for pgtables sharing page
07/12 s390: add pte_free_defer(), with use of mmdrop_async()
08/12 mm/pgtable: add pte_free_defer() for pgtable as page
09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock
10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
12/12 mm: delete mmap_write_trylock() and vma_try_start_write()

 arch/arm/mm/fault-armv.c            |   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c      |  18 ++
 arch/powerpc/mm/pgtable.c           |  16 +-
 arch/s390/include/asm/pgalloc.h     |   4 +
 arch/s390/mm/pgalloc.c              |  34 +++
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c             |  16 ++
 include/linux/mm.h                  |  17 --
 include/linux/mm_types.h            |   2 +-
 include/linux/mmap_lock.h           |  10 -
 include/linux/pgtable.h             |   6 +-
 include/linux/sched/mm.h            |   1 +
 kernel/fork.c                       |   2 +-
 mm/khugepaged.c                     | 425 ++++++++----------------------
 mm/pgtable-generic.c                |  44 +++-
 16 files changed, 253 insertions(+), 353 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-05-29  6:11 ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Here is the third series of patches to mm (and a few architectures), based
on v6.4-rc3 with the preceding two series applied: in which khugepaged
takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.

This follows on from the "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com/
series of 23 posted on 2023-05-09,
and the "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com/
series of 31 posted on 2023-05-21.

Those two series were "independent": neither depending for build or
correctness on the other, but both series needed before this third one
can safely make the effective changes.  I'll send v2 of those two series
in a couple of days, incorporating Acks and Revieweds and the minor fixes.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on the preceding two series over v6.4-rc3, but good over
v6.4-rc[1-4], current mm-everything or current linux-next.

01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
02/12 mm/pgtable: add PAE safety to __pte_offset_map()
03/12 arm: adjust_pte() use pte_offset_map_nolock()
04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock()
05/12 powerpc: add pte_free_defer() for pgtables sharing page
06/12 sparc: add pte_free_defer() for pgtables sharing page
07/12 s390: add pte_free_defer(), with use of mmdrop_async()
08/12 mm/pgtable: add pte_free_defer() for pgtable as page
09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock
10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
12/12 mm: delete mmap_write_trylock() and vma_try_start_write()

 arch/arm/mm/fault-armv.c            |   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c      |  18 ++
 arch/powerpc/mm/pgtable.c           |  16 +-
 arch/s390/include/asm/pgalloc.h     |   4 +
 arch/s390/mm/pgalloc.c              |  34 +++
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c             |  16 ++
 include/linux/mm.h                  |  17 --
 include/linux/mm_types.h            |   2 +-
 include/linux/mmap_lock.h           |  10 -
 include/linux/pgtable.h             |   6 +-
 include/linux/sched/mm.h            |   1 +
 kernel/fork.c                       |   2 +-
 mm/khugepaged.c                     | 425 ++++++++----------------------
 mm/pgtable-generic.c                |  44 +++-
 16 files changed, 253 insertions(+), 353 deletions(-)

Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:14   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c    | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a1326e61d7ee..8b0fc7fdc46f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte)	do {	\
 	kunmap_local((pte));	\
-	/* rcu_read_unlock() to be added later */	\
+	rcu_read_unlock();	\
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c7ab18a5fb77..674671835631 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
 	pmd_t pmdval;
 
-	/* rcu_read_lock() to be added later */
+	rcu_read_lock();
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmdvalp)
 		*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	}
 	return __pte_map(&pmdval, addr);
 nomap:
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 	return NULL;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-05-29  6:14   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c    | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a1326e61d7ee..8b0fc7fdc46f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte)	do {	\
 	kunmap_local((pte));	\
-	/* rcu_read_unlock() to be added later */	\
+	rcu_read_unlock();	\
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c7ab18a5fb77..674671835631 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
 	pmd_t pmdval;
 
-	/* rcu_read_lock() to be added later */
+	rcu_read_lock();
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmdvalp)
 		*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	}
 	return __pte_map(&pmdval, addr);
 nomap:
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 	return NULL;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-05-29  6:14   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c    | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a1326e61d7ee..8b0fc7fdc46f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte)	do {	\
 	kunmap_local((pte));	\
-	/* rcu_read_unlock() to be added later */	\
+	rcu_read_unlock();	\
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c7ab18a5fb77..674671835631 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
 	pmd_t pmdval;
 
-	/* rcu_read_lock() to be added later */
+	rcu_read_lock();
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmdvalp)
 		*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	}
 	return __pte_map(&pmdval, addr);
 nomap:
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 	return NULL;
 }
 
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:16   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/pgtable-generic.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..d28b63386cef 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,32 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+#define config_might_irq_save(flags)	local_irq_save(flags)
+#define config_might_irq_restore(flags)	local_irq_restore(flags)
+#else
+#define config_might_irq_save(flags)
+#define config_might_irq_restore(flags)
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+	unsigned long __maybe_unused flags;
 	pmd_t pmdval;
 
 	rcu_read_lock();
+	config_might_irq_save(flags);
 	pmdval = pmdp_get_lockless(pmd);
+	config_might_irq_restore(flags);
+
 	if (pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
@ 2023-05-29  6:16   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/pgtable-generic.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..d28b63386cef 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,32 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+#define config_might_irq_save(flags)	local_irq_save(flags)
+#define config_might_irq_restore(flags)	local_irq_restore(flags)
+#else
+#define config_might_irq_save(flags)
+#define config_might_irq_restore(flags)
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+	unsigned long __maybe_unused flags;
 	pmd_t pmdval;
 
 	rcu_read_lock();
+	config_might_irq_save(flags);
 	pmdval = pmdp_get_lockless(pmd);
+	config_might_irq_restore(flags);
+
 	if (pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
@ 2023-05-29  6:16   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/pgtable-generic.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..d28b63386cef 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,32 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+#define config_might_irq_save(flags)	local_irq_save(flags)
+#define config_might_irq_restore(flags)	local_irq_restore(flags)
+#else
+#define config_might_irq_save(flags)
+#define config_might_irq_restore(flags)
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+	unsigned long __maybe_unused flags;
 	pmd_t pmdval;
 
 	rcu_read_lock();
+	config_might_irq_save(flags);
 	pmdval = pmdp_get_lockless(pmd);
+	config_might_irq_restore(flags);
+
 	if (pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:17   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map(pmd, address);
+	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
 	if (!pte)
 		return 0;
 
-	ptl = pte_lockptr(vma->vm_mm, pmd);
 	do_pte_lock(ptl);
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock()
@ 2023-05-29  6:17   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map(pmd, address);
+	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
 	if (!pte)
 		return 0;
 
-	ptl = pte_lockptr(vma->vm_mm, pmd);
 	do_pte_lock(ptl);
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock()
@ 2023-05-29  6:17   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map(pmd, address);
+	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
 	if (!pte)
 		return 0;
 
-	ptl = pte_lockptr(vma->vm_mm, pmd);
 	do_pte_lock(ptl);
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:18   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (mm == &init_mm)
 		return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pud = pud_offset(p4d, addr);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, addr);
-	/*
-	 * khugepaged to collapse normal pages to hugepage, first set
-	 * pmd to none to force page fault/gup to take mmap_lock. After
-	 * pmd is set to none, we do a pte_clear which does this assertion
-	 * so if we find pmd none, return.
-	 */
-	if (pmd_none(*pmd))
-		return;
-	BUG_ON(!pmd_present(*pmd));
-	assert_spin_locked(pte_lockptr(mm, pmd));
+	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	BUG_ON(!pte);
+	assert_spin_locked(ptl);
+	pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock()
@ 2023-05-29  6:18   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (mm == &init_mm)
 		return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pud = pud_offset(p4d, addr);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, addr);
-	/*
-	 * khugepaged to collapse normal pages to hugepage, first set
-	 * pmd to none to force page fault/gup to take mmap_lock. After
-	 * pmd is set to none, we do a pte_clear which does this assertion
-	 * so if we find pmd none, return.
-	 */
-	if (pmd_none(*pmd))
-		return;
-	BUG_ON(!pmd_present(*pmd));
-	assert_spin_locked(pte_lockptr(mm, pmd));
+	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	BUG_ON(!pte);
+	assert_spin_locked(ptl);
+	pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock()
@ 2023-05-29  6:18   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (mm == &init_mm)
 		return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pud = pud_offset(p4d, addr);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, addr);
-	/*
-	 * khugepaged to collapse normal pages to hugepage, first set
-	 * pmd to none to force page fault/gup to take mmap_lock. After
-	 * pmd is set to none, we do a pte_clear which does this assertion
-	 * so if we find pmd none, return.
-	 */
-	if (pmd_none(*pmd))
-		return;
-	BUG_ON(!pmd_present(*pmd));
-	assert_spin_locked(pte_lockptr(mm, pmd));
+	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	BUG_ON(!pte);
+	assert_spin_locked(ptl);
+	pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:20   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 ++++
 arch/powerpc/mm/pgtable-frag.c     | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..3a3dac77faf2 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,21 @@ void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_fragment_free((unsigned long *)page_to_virt(page), 0);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29  6:20   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 ++++
 arch/powerpc/mm/pgtable-frag.c     | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..3a3dac77faf2 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,21 @@ void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_fragment_free((unsigned long *)page_to_virt(page), 0);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29  6:20   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 ++++
 arch/powerpc/mm/pgtable-frag.c     | 18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..3a3dac77faf2 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,21 @@ void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_fragment_free((unsigned long *)page_to_virt(page), 0);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:21   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/sparc/include/asm/pgalloc_64.h |  4 ++++
 arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..b7c6aa085ef6 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	__pte_free((pgtable_t)page_to_virt(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29  6:21   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/sparc/include/asm/pgalloc_64.h |  4 ++++
 arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..b7c6aa085ef6 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	__pte_free((pgtable_t)page_to_virt(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29  6:21   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/sparc/include/asm/pgalloc_64.h |  4 ++++
 arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..b7c6aa085ef6 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	__pte_free((pgtable_t)page_to_virt(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd)
 {
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:22   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because page_table_free()
needs to know which fragment is being freed, and which mm to link it to.

page_table_free()'s fragment handling is clever, but I could too easily
break it: what's done here in pte_free_defer() and pte_free_now() might
be better integrated with page_table_free()'s cleverness, but not by me!

By the time that page_table_free() gets called via RCU, it's conceivable
that mm would already have been freed: so mmgrab() in pte_free_defer()
and mmdrop() in pte_free_now().  No, that is not a good context to call
mmdrop() from, so make mmdrop_async() public and use that.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/pgalloc.h |  4 ++++
 arch/s390/mm/pgalloc.c          | 34 +++++++++++++++++++++++++++++++++
 include/linux/mm_types.h        |  2 +-
 include/linux/sched/mm.h        |  1 +
 kernel/fork.c                   |  2 +-
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..0129de9addfd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -346,6 +346,40 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+	unsigned long mm_bit;
+	struct mm_struct *mm;
+	unsigned long *table;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	mm_bit = (unsigned long)page->pt_mm;
+	/* 4K page has only two 2K fragments, but alignment allows eight */
+	mm = (struct mm_struct *)(mm_bit & ~7);
+	table += PTRS_PER_PTE * (mm_bit & 7);
+	page_table_free(mm, table);
+	mmdrop_async(mm);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+	unsigned long mm_bit;
+
+	mmgrab(mm);
+	page = virt_to_page(pgtable);
+	/* Which 2K page table fragment of a 4K page? */
+	mm_bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+			(PTRS_PER_PTE * sizeof(pte_t));
+	mm_bit += (unsigned long)mm;
+	page->pt_mm = (struct mm_struct *)mm_bit;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
-				struct mm_struct *pt_mm; /* x86 pgds only */
+				struct mm_struct *pt_mm; /* x86 pgd, s390 */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
 #if ALLOC_SPLIT_PTLOCKS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..a9043d1a0d55 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -41,6 +41,7 @@ static inline void smp_mb__after_mmgrab(void)
 	smp_mb__after_atomic();
 }
 
+extern void mmdrop_async(struct mm_struct *mm);
 extern void __mmdrop(struct mm_struct *mm);
 
 static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fa4486b65c56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -942,7 +942,7 @@ static void mmdrop_async_fn(struct work_struct *work)
 	__mmdrop(mm);
 }
 
-static void mmdrop_async(struct mm_struct *mm)
+void mmdrop_async(struct mm_struct *mm)
 {
 	if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
 		INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-05-29  6:22   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because page_table_free()
needs to know which fragment is being freed, and which mm to link it to.

page_table_free()'s fragment handling is clever, but I could too easily
break it: what's done here in pte_free_defer() and pte_free_now() might
be better integrated with page_table_free()'s cleverness, but not by me!

By the time that page_table_free() gets called via RCU, it's conceivable
that mm would already have been freed: so mmgrab() in pte_free_defer()
and mmdrop() in pte_free_now().  No, that is not a good context to call
mmdrop() from, so make mmdrop_async() public and use that.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/pgalloc.h |  4 ++++
 arch/s390/mm/pgalloc.c          | 34 +++++++++++++++++++++++++++++++++
 include/linux/mm_types.h        |  2 +-
 include/linux/sched/mm.h        |  1 +
 kernel/fork.c                   |  2 +-
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..0129de9addfd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -346,6 +346,40 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+	unsigned long mm_bit;
+	struct mm_struct *mm;
+	unsigned long *table;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	mm_bit = (unsigned long)page->pt_mm;
+	/* 4K page has only two 2K fragments, but alignment allows eight */
+	mm = (struct mm_struct *)(mm_bit & ~7);
+	table += PTRS_PER_PTE * (mm_bit & 7);
+	page_table_free(mm, table);
+	mmdrop_async(mm);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+	unsigned long mm_bit;
+
+	mmgrab(mm);
+	page = virt_to_page(pgtable);
+	/* Which 2K page table fragment of a 4K page? */
+	mm_bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+			(PTRS_PER_PTE * sizeof(pte_t));
+	mm_bit += (unsigned long)mm;
+	page->pt_mm = (struct mm_struct *)mm_bit;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
-				struct mm_struct *pt_mm; /* x86 pgds only */
+				struct mm_struct *pt_mm; /* x86 pgd, s390 */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
 #if ALLOC_SPLIT_PTLOCKS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..a9043d1a0d55 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -41,6 +41,7 @@ static inline void smp_mb__after_mmgrab(void)
 	smp_mb__after_atomic();
 }
 
+extern void mmdrop_async(struct mm_struct *mm);
 extern void __mmdrop(struct mm_struct *mm);
 
 static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fa4486b65c56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -942,7 +942,7 @@ static void mmdrop_async_fn(struct work_struct *work)
 	__mmdrop(mm);
 }
 
-static void mmdrop_async(struct mm_struct *mm)
+void mmdrop_async(struct mm_struct *mm)
 {
 	if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
 		INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-05-29  6:22   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because page_table_free()
needs to know which fragment is being freed, and which mm to link it to.

page_table_free()'s fragment handling is clever, but I could too easily
break it: what's done here in pte_free_defer() and pte_free_now() might
be better integrated with page_table_free()'s cleverness, but not by me!

By the time that page_table_free() gets called via RCU, it's conceivable
that mm would already have been freed: so mmgrab() in pte_free_defer()
and mmdrop() in pte_free_now().  No, that is not a good context to call
mmdrop() from, so make mmdrop_async() public and use that.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/pgalloc.h |  4 ++++
 arch/s390/mm/pgalloc.c          | 34 +++++++++++++++++++++++++++++++++
 include/linux/mm_types.h        |  2 +-
 include/linux/sched/mm.h        |  1 +
 kernel/fork.c                   |  2 +-
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..0129de9addfd 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -346,6 +346,40 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+	unsigned long mm_bit;
+	struct mm_struct *mm;
+	unsigned long *table;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	mm_bit = (unsigned long)page->pt_mm;
+	/* 4K page has only two 2K fragments, but alignment allows eight */
+	mm = (struct mm_struct *)(mm_bit & ~7);
+	table += PTRS_PER_PTE * (mm_bit & 7);
+	page_table_free(mm, table);
+	mmdrop_async(mm);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+	unsigned long mm_bit;
+
+	mmgrab(mm);
+	page = virt_to_page(pgtable);
+	/* Which 2K page table fragment of a 4K page? */
+	mm_bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+			(PTRS_PER_PTE * sizeof(pte_t));
+	mm_bit += (unsigned long)mm;
+	page->pt_mm = (struct mm_struct *)mm_bit;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
-				struct mm_struct *pt_mm; /* x86 pgds only */
+				struct mm_struct *pt_mm; /* x86 pgd, s390 */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
 #if ALLOC_SPLIT_PTLOCKS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 8d89c8c4fac1..a9043d1a0d55 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -41,6 +41,7 @@ static inline void smp_mb__after_mmgrab(void)
 	smp_mb__after_atomic();
 }
 
+extern void mmdrop_async(struct mm_struct *mm);
 extern void __mmdrop(struct mm_struct *mm);
 
 static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fa4486b65c56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -942,7 +942,7 @@ static void mmdrop_async_fn(struct work_struct *work)
 	__mmdrop(mm);
 }
 
-static void mmdrop_async(struct mm_struct *mm)
+void mmdrop_async(struct mm_struct *mm)
 {
 	if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
 		INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:23   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h |  2 ++
 mm/pgtable-generic.c    | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b0fc7fdc46f..62a8732d92f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d28b63386cef..471697dcb244 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mm_inline.h>
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 	return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = pgtable;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-05-29  6:23   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h |  2 ++
 mm/pgtable-generic.c    | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b0fc7fdc46f..62a8732d92f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d28b63386cef..471697dcb244 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mm_inline.h>
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 	return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = pgtable;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-05-29  6:23   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h |  2 ++
 mm/pgtable-generic.c    | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b0fc7fdc46f..62a8732d92f0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d28b63386cef..471697dcb244 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mm_inline.h>
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 	return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = pgtable;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:25   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that does raise some
questions, and requires a more complicated pte_free_defer() for powerpc
(when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
that enhancement to a later release.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1083f0e38a07..4fd408154692 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		break;
 	case SCAN_PMD_NONE:
 		/*
-		 * In MADV_COLLAPSE path, possible race with khugepaged where
-		 * all pte entries have been removed and pmd cleared.  If so,
-		 * skip all the pte checks and just update the pmd mapping.
+		 * All pte entries have been removed and pmd cleared.
+		 * Skip all the pte checks and just update the pmd mapping.
 		 */
 		goto maybe_install_pmd;
 	default:
@@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
 	mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-			       struct mm_struct *target_mm,
-			       unsigned long target_addr, struct page *hpage,
-			       struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
-	int target_result = SCAN_FAIL;
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		int result = SCAN_FAIL;
-		struct mm_struct *mm = NULL;
-		unsigned long addr = 0;
-		pmd_t *pmd;
-		bool is_target = false;
+		struct mm_struct *mm;
+		unsigned long addr;
+		pmd_t *pmd, pgt_pmd;
+		spinlock_t *pml;
+		spinlock_t *ptl;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-		 * got written to. These VMAs are likely not worth investing
-		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
-		 * later.
+		 * got written to. These VMAs are likely not worth removing
+		 * page tables from, as PMD-mapping is likely to be split later.
 		 *
-		 * Note that vma->anon_vma check is racy: it can be set up after
-		 * the check but before we took mmap_lock by the fault path.
-		 * But page lock would prevent establishing any new ptes of the
-		 * page, so we are safe.
-		 *
-		 * An alternative would be drop the check, but check that page
-		 * table is clear before calling pmdp_collapse_flush() under
-		 * ptl. It has higher chance to recover THP for the VMA, but
-		 * has higher cost too. It would also probably require locking
-		 * the anon_vma.
+		 * Note that vma->anon_vma check is racy: it can be set after
+		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
+		 * prevented establishing new ptes of the page. So we are safe
+		 * to remove page table below, without even checking it's empty.
 		 */
-		if (READ_ONCE(vma->anon_vma)) {
-			result = SCAN_PAGE_ANON;
-			goto next;
-		}
+		if (READ_ONCE(vma->anon_vma))
+			continue;
+
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
-		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
-			result = SCAN_VMA_CHECK;
-			goto next;
-		}
-		mm = vma->vm_mm;
-		is_target = mm == target_mm && addr == target_addr;
-		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
-		if (result != SCAN_SUCCEED)
-			goto next;
-		/*
-		 * We need exclusive mmap_lock to retract page table.
-		 *
-		 * We use trylock due to lock inversion: we need to acquire
-		 * mmap_lock while holding page lock. Fault path does it in
-		 * reverse order. Trylock is a way to avoid deadlock.
-		 *
-		 * Also, it's not MADV_COLLAPSE's job to collapse other
-		 * mappings - let khugepaged take care of them later.
-		 */
-		result = SCAN_PTE_MAPPED_HUGEPAGE;
-		if ((cc->is_khugepaged || is_target) &&
-		    mmap_write_trylock(mm)) {
-			/* trylock for the same lock inversion as above */
-			if (!vma_try_start_write(vma))
-				goto unlock_next;
-
-			/*
-			 * Re-check whether we have an ->anon_vma, because
-			 * collapse_and_free_pmd() requires that either no
-			 * ->anon_vma exists or the anon_vma is locked.
-			 * We already checked ->anon_vma above, but that check
-			 * is racy because ->anon_vma can be populated under the
-			 * mmap lock in read mode.
-			 */
-			if (vma->anon_vma) {
-				result = SCAN_PAGE_ANON;
-				goto unlock_next;
-			}
-			/*
-			 * When a vma is registered with uffd-wp, we can't
-			 * recycle the pmd pgtable because there can be pte
-			 * markers installed.  Skip it only, so the rest mm/vma
-			 * can still have the same file mapped hugely, however
-			 * it'll always mapped in small page size for uffd-wp
-			 * registered ranges.
-			 */
-			if (hpage_collapse_test_exit(mm)) {
-				result = SCAN_ANY_PROCESS;
-				goto unlock_next;
-			}
-			if (userfaultfd_wp(vma)) {
-				result = SCAN_PTE_UFFD_WP;
-				goto unlock_next;
-			}
-			collapse_and_free_pmd(mm, vma, addr, pmd);
-			if (!cc->is_khugepaged && is_target)
-				result = set_huge_pmd(vma, addr, pmd, hpage);
-			else
-				result = SCAN_SUCCEED;
-
-unlock_next:
-			mmap_write_unlock(mm);
-			goto next;
-		}
-		/*
-		 * Calling context will handle target mm/addr. Otherwise, let
-		 * khugepaged try again later.
-		 */
-		if (!is_target) {
-			khugepaged_add_pte_mapped_thp(mm, addr);
+		    vma->vm_end < addr + HPAGE_PMD_SIZE)
 			continue;
-		}
-next:
-		if (is_target)
-			target_result = result;
+
+		mm = vma->vm_mm;
+		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
+			continue;
+
+		if (hpage_collapse_test_exit(mm))
+			continue;
+		/*
+		 * When a vma is registered with uffd-wp, we cannot recycle
+		 * the page table because there may be pte markers installed.
+		 * Other vmas can still have the same file mapped hugely, but
+		 * skip this one: it will always be mapped in small page size
+		 * for uffd-wp registered ranges.
+		 *
+		 * What if VM_UFFD_WP is set a moment after this check?  No
+		 * problem, huge page lock is still held, stopping new mappings
+		 * of page which might then get replaced by pte markers: only
+		 * existing markers need to be protected here.  (We could check
+		 * after getting ptl below, but this comment distracting there!)
+		 */
+		if (userfaultfd_wp(vma))
+			continue;
+
+		/* Huge page lock is still held, so page table must be empty */
+		pml = pmd_lock(mm, pmd);
+		ptl = pte_lockptr(mm, pmd);
+		if (ptl != pml)
+			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+		pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
+		if (ptl != pml)
+			spin_unlock(ptl);
+		spin_unlock(pml);
+
+		mm_dec_nr_ptes(mm);
+		page_table_check_pte_clear_range(mm, addr, pgt_pmd);
+		pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 	}
-	i_mmap_unlock_write(mapping);
-	return target_result;
+	i_mmap_unlock_read(mapping);
 }
 
 /**
@@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 	/*
 	 * Remove pte page tables, so we can re-fault the page as huge.
+	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
 	 */
-	result = retract_page_tables(mapping, start, mm, addr, hpage,
-				     cc);
+	retract_page_tables(mapping, start);
+	if (cc && !cc->is_khugepaged)
+		result = SCAN_PTE_MAPPED_HUGEPAGE;
 	unlock_page(hpage);
 
 	/*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-29  6:25   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that does raise some
questions, and requires a more complicated pte_free_defer() for powerpc
(when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
that enhancement to a later release.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1083f0e38a07..4fd408154692 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		break;
 	case SCAN_PMD_NONE:
 		/*
-		 * In MADV_COLLAPSE path, possible race with khugepaged where
-		 * all pte entries have been removed and pmd cleared.  If so,
-		 * skip all the pte checks and just update the pmd mapping.
+		 * All pte entries have been removed and pmd cleared.
+		 * Skip all the pte checks and just update the pmd mapping.
 		 */
 		goto maybe_install_pmd;
 	default:
@@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
 	mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-			       struct mm_struct *target_mm,
-			       unsigned long target_addr, struct page *hpage,
-			       struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
-	int target_result = SCAN_FAIL;
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		int result = SCAN_FAIL;
-		struct mm_struct *mm = NULL;
-		unsigned long addr = 0;
-		pmd_t *pmd;
-		bool is_target = false;
+		struct mm_struct *mm;
+		unsigned long addr;
+		pmd_t *pmd, pgt_pmd;
+		spinlock_t *pml;
+		spinlock_t *ptl;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-		 * got written to. These VMAs are likely not worth investing
-		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
-		 * later.
+		 * got written to. These VMAs are likely not worth removing
+		 * page tables from, as PMD-mapping is likely to be split later.
 		 *
-		 * Note that vma->anon_vma check is racy: it can be set up after
-		 * the check but before we took mmap_lock by the fault path.
-		 * But page lock would prevent establishing any new ptes of the
-		 * page, so we are safe.
-		 *
-		 * An alternative would be drop the check, but check that page
-		 * table is clear before calling pmdp_collapse_flush() under
-		 * ptl. It has higher chance to recover THP for the VMA, but
-		 * has higher cost too. It would also probably require locking
-		 * the anon_vma.
+		 * Note that vma->anon_vma check is racy: it can be set after
+		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
+		 * prevented establishing new ptes of the page. So we are safe
+		 * to remove page table below, without even checking it's empty.
 		 */
-		if (READ_ONCE(vma->anon_vma)) {
-			result = SCAN_PAGE_ANON;
-			goto next;
-		}
+		if (READ_ONCE(vma->anon_vma))
+			continue;
+
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
-		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
-			result = SCAN_VMA_CHECK;
-			goto next;
-		}
-		mm = vma->vm_mm;
-		is_target = mm == target_mm && addr == target_addr;
-		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
-		if (result != SCAN_SUCCEED)
-			goto next;
-		/*
-		 * We need exclusive mmap_lock to retract page table.
-		 *
-		 * We use trylock due to lock inversion: we need to acquire
-		 * mmap_lock while holding page lock. Fault path does it in
-		 * reverse order. Trylock is a way to avoid deadlock.
-		 *
-		 * Also, it's not MADV_COLLAPSE's job to collapse other
-		 * mappings - let khugepaged take care of them later.
-		 */
-		result = SCAN_PTE_MAPPED_HUGEPAGE;
-		if ((cc->is_khugepaged || is_target) &&
-		    mmap_write_trylock(mm)) {
-			/* trylock for the same lock inversion as above */
-			if (!vma_try_start_write(vma))
-				goto unlock_next;
-
-			/*
-			 * Re-check whether we have an ->anon_vma, because
-			 * collapse_and_free_pmd() requires that either no
-			 * ->anon_vma exists or the anon_vma is locked.
-			 * We already checked ->anon_vma above, but that check
-			 * is racy because ->anon_vma can be populated under the
-			 * mmap lock in read mode.
-			 */
-			if (vma->anon_vma) {
-				result = SCAN_PAGE_ANON;
-				goto unlock_next;
-			}
-			/*
-			 * When a vma is registered with uffd-wp, we can't
-			 * recycle the pmd pgtable because there can be pte
-			 * markers installed.  Skip it only, so the rest mm/vma
-			 * can still have the same file mapped hugely, however
-			 * it'll always mapped in small page size for uffd-wp
-			 * registered ranges.
-			 */
-			if (hpage_collapse_test_exit(mm)) {
-				result = SCAN_ANY_PROCESS;
-				goto unlock_next;
-			}
-			if (userfaultfd_wp(vma)) {
-				result = SCAN_PTE_UFFD_WP;
-				goto unlock_next;
-			}
-			collapse_and_free_pmd(mm, vma, addr, pmd);
-			if (!cc->is_khugepaged && is_target)
-				result = set_huge_pmd(vma, addr, pmd, hpage);
-			else
-				result = SCAN_SUCCEED;
-
-unlock_next:
-			mmap_write_unlock(mm);
-			goto next;
-		}
-		/*
-		 * Calling context will handle target mm/addr. Otherwise, let
-		 * khugepaged try again later.
-		 */
-		if (!is_target) {
-			khugepaged_add_pte_mapped_thp(mm, addr);
+		    vma->vm_end < addr + HPAGE_PMD_SIZE)
 			continue;
-		}
-next:
-		if (is_target)
-			target_result = result;
+
+		mm = vma->vm_mm;
+		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
+			continue;
+
+		if (hpage_collapse_test_exit(mm))
+			continue;
+		/*
+		 * When a vma is registered with uffd-wp, we cannot recycle
+		 * the page table because there may be pte markers installed.
+		 * Other vmas can still have the same file mapped hugely, but
+		 * skip this one: it will always be mapped in small page size
+		 * for uffd-wp registered ranges.
+		 *
+		 * What if VM_UFFD_WP is set a moment after this check?  No
+		 * problem, huge page lock is still held, stopping new mappings
+		 * of page which might then get replaced by pte markers: only
+		 * existing markers need to be protected here.  (We could check
+		 * after getting ptl below, but this comment distracting there!)
+		 */
+		if (userfaultfd_wp(vma))
+			continue;
+
+		/* Huge page lock is still held, so page table must be empty */
+		pml = pmd_lock(mm, pmd);
+		ptl = pte_lockptr(mm, pmd);
+		if (ptl != pml)
+			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+		pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
+		if (ptl != pml)
+			spin_unlock(ptl);
+		spin_unlock(pml);
+
+		mm_dec_nr_ptes(mm);
+		page_table_check_pte_clear_range(mm, addr, pgt_pmd);
+		pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 	}
-	i_mmap_unlock_write(mapping);
-	return target_result;
+	i_mmap_unlock_read(mapping);
 }
 
 /**
@@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 	/*
 	 * Remove pte page tables, so we can re-fault the page as huge.
+	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
 	 */
-	result = retract_page_tables(mapping, start, mm, addr, hpage,
-				     cc);
+	retract_page_tables(mapping, start);
+	if (cc && !cc->is_khugepaged)
+		result = SCAN_PTE_MAPPED_HUGEPAGE;
 	unlock_page(hpage);
 
 	/*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-29  6:25   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that does raise some
questions, and requires a more complicated pte_free_defer() for powerpc
(when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
that enhancement to a later release.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1083f0e38a07..4fd408154692 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		break;
 	case SCAN_PMD_NONE:
 		/*
-		 * In MADV_COLLAPSE path, possible race with khugepaged where
-		 * all pte entries have been removed and pmd cleared.  If so,
-		 * skip all the pte checks and just update the pmd mapping.
+		 * All pte entries have been removed and pmd cleared.
+		 * Skip all the pte checks and just update the pmd mapping.
 		 */
 		goto maybe_install_pmd;
 	default:
@@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
 	mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-			       struct mm_struct *target_mm,
-			       unsigned long target_addr, struct page *hpage,
-			       struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
-	int target_result = SCAN_FAIL;
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		int result = SCAN_FAIL;
-		struct mm_struct *mm = NULL;
-		unsigned long addr = 0;
-		pmd_t *pmd;
-		bool is_target = false;
+		struct mm_struct *mm;
+		unsigned long addr;
+		pmd_t *pmd, pgt_pmd;
+		spinlock_t *pml;
+		spinlock_t *ptl;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-		 * got written to. These VMAs are likely not worth investing
-		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
-		 * later.
+		 * got written to. These VMAs are likely not worth removing
+		 * page tables from, as PMD-mapping is likely to be split later.
 		 *
-		 * Note that vma->anon_vma check is racy: it can be set up after
-		 * the check but before we took mmap_lock by the fault path.
-		 * But page lock would prevent establishing any new ptes of the
-		 * page, so we are safe.
-		 *
-		 * An alternative would be drop the check, but check that page
-		 * table is clear before calling pmdp_collapse_flush() under
-		 * ptl. It has higher chance to recover THP for the VMA, but
-		 * has higher cost too. It would also probably require locking
-		 * the anon_vma.
+		 * Note that vma->anon_vma check is racy: it can be set after
+		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
+		 * prevented establishing new ptes of the page. So we are safe
+		 * to remove page table below, without even checking it's empty.
 		 */
-		if (READ_ONCE(vma->anon_vma)) {
-			result = SCAN_PAGE_ANON;
-			goto next;
-		}
+		if (READ_ONCE(vma->anon_vma))
+			continue;
+
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
-		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
-			result = SCAN_VMA_CHECK;
-			goto next;
-		}
-		mm = vma->vm_mm;
-		is_target = mm == target_mm && addr == target_addr;
-		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
-		if (result != SCAN_SUCCEED)
-			goto next;
-		/*
-		 * We need exclusive mmap_lock to retract page table.
-		 *
-		 * We use trylock due to lock inversion: we need to acquire
-		 * mmap_lock while holding page lock. Fault path does it in
-		 * reverse order. Trylock is a way to avoid deadlock.
-		 *
-		 * Also, it's not MADV_COLLAPSE's job to collapse other
-		 * mappings - let khugepaged take care of them later.
-		 */
-		result = SCAN_PTE_MAPPED_HUGEPAGE;
-		if ((cc->is_khugepaged || is_target) &&
-		    mmap_write_trylock(mm)) {
-			/* trylock for the same lock inversion as above */
-			if (!vma_try_start_write(vma))
-				goto unlock_next;
-
-			/*
-			 * Re-check whether we have an ->anon_vma, because
-			 * collapse_and_free_pmd() requires that either no
-			 * ->anon_vma exists or the anon_vma is locked.
-			 * We already checked ->anon_vma above, but that check
-			 * is racy because ->anon_vma can be populated under the
-			 * mmap lock in read mode.
-			 */
-			if (vma->anon_vma) {
-				result = SCAN_PAGE_ANON;
-				goto unlock_next;
-			}
-			/*
-			 * When a vma is registered with uffd-wp, we can't
-			 * recycle the pmd pgtable because there can be pte
-			 * markers installed.  Skip it only, so the rest mm/vma
-			 * can still have the same file mapped hugely, however
-			 * it'll always mapped in small page size for uffd-wp
-			 * registered ranges.
-			 */
-			if (hpage_collapse_test_exit(mm)) {
-				result = SCAN_ANY_PROCESS;
-				goto unlock_next;
-			}
-			if (userfaultfd_wp(vma)) {
-				result = SCAN_PTE_UFFD_WP;
-				goto unlock_next;
-			}
-			collapse_and_free_pmd(mm, vma, addr, pmd);
-			if (!cc->is_khugepaged && is_target)
-				result = set_huge_pmd(vma, addr, pmd, hpage);
-			else
-				result = SCAN_SUCCEED;
-
-unlock_next:
-			mmap_write_unlock(mm);
-			goto next;
-		}
-		/*
-		 * Calling context will handle target mm/addr. Otherwise, let
-		 * khugepaged try again later.
-		 */
-		if (!is_target) {
-			khugepaged_add_pte_mapped_thp(mm, addr);
+		    vma->vm_end < addr + HPAGE_PMD_SIZE)
 			continue;
-		}
-next:
-		if (is_target)
-			target_result = result;
+
+		mm = vma->vm_mm;
+		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
+			continue;
+
+		if (hpage_collapse_test_exit(mm))
+			continue;
+		/*
+		 * When a vma is registered with uffd-wp, we cannot recycle
+		 * the page table because there may be pte markers installed.
+		 * Other vmas can still have the same file mapped hugely, but
+		 * skip this one: it will always be mapped in small page size
+		 * for uffd-wp registered ranges.
+		 *
+		 * What if VM_UFFD_WP is set a moment after this check?  No
+		 * problem, huge page lock is still held, stopping new mappings
+		 * of page which might then get replaced by pte markers: only
+		 * existing markers need to be protected here.  (We could check
+		 * after getting ptl below, but this comment distracting there!)
+		 */
+		if (userfaultfd_wp(vma))
+			continue;
+
+		/* Huge page lock is still held, so page table must be empty */
+		pml = pmd_lock(mm, pmd);
+		ptl = pte_lockptr(mm, pmd);
+		if (ptl != pml)
+			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+		pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
+		if (ptl != pml)
+			spin_unlock(ptl);
+		spin_unlock(pml);
+
+		mm_dec_nr_ptes(mm);
+		page_table_check_pte_clear_range(mm, addr, pgt_pmd);
+		pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 	}
-	i_mmap_unlock_write(mapping);
-	return target_result;
+	i_mmap_unlock_read(mapping);
 }
 
 /**
@@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 	/*
 	 * Remove pte page tables, so we can re-fault the page as huge.
+	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
 	 */
-	result = retract_page_tables(mapping, start, mm, addr, hpage,
-				     cc);
+	retract_page_tables(mapping, start);
+	if (cc && !cc->is_khugepaged)
+		result = SCAN_PTE_MAPPED_HUGEPAGE;
 	unlock_page(hpage);
 
 	/*
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:26   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes the need for tlb_remove_table_sync_one() here.

Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Some cleanup while rearranging: rename "count" to "nr_ptes";
and "step 2" does not need to duplicate the checks in "step 1".

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 131 +++++++++++++++---------------------------------
 1 file changed, 41 insertions(+), 90 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4fd408154692..2999500abdd5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1485,7 +1485,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
 {
@@ -1497,7 +1497,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	};
 
 	VM_BUG_ON(!PageTransHuge(hpage));
-	mmap_assert_write_locked(vma->vm_mm);
+	mmap_assert_locked(vma->vm_mm);
 
 	if (do_set_pmd(&vmf, hpage))
 		return SCAN_FAIL;
@@ -1506,48 +1506,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to another
- *    page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	struct mmu_notifier_range range;
-
-	mmap_assert_write_locked(mm);
-	if (vma->vm_file)
-		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
-	/*
-	 * All anon_vmas attached to the VMA have the same root and are
-	 * therefore locked by the same lock.
-	 */
-	if (vma->anon_vma)
-		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + HPAGE_PMD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	pmd = pmdp_collapse_flush(vma, addr, pmdp);
-	tlb_remove_table_sync_one();
-	mmu_notifier_invalidate_range_end(&range);
-	mm_dec_nr_ptes(mm);
-	page_table_check_pte_clear_range(mm, addr, pmd);
-	pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1563,16 +1521,17 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	struct mmu_notifier_range range;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
-	pmd_t *pmd;
-	spinlock_t *ptl;
-	int count = 0, result = SCAN_FAIL;
+	pmd_t *pmd, pgt_pmd;
+	spinlock_t *pml, *ptl;
+	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
-	mmap_assert_write_locked(mm);
+	mmap_assert_locked(mm);
 
 	/* Fast check before locking page if already PMD-mapped */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
@@ -1612,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
@@ -1625,27 +1585,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
-	/* Lock the vma before taking i_mmap and page table locks */
-	vma_start_write(vma);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
 
-	/*
-	 * We need to lock the mapping so that from here on, only GUP-fast and
-	 * hardware page walks can access the parts of the page tables that
-	 * we're operating on.
-	 * See collapse_and_free_pmd().
-	 */
-	i_mmap_lock_write(vma->vm_file->f_mapping);
-
-	/*
-	 * This spinlock should be unnecessary: Nobody else should be accessing
-	 * the page tables under spinlock protection here, only
-	 * lockless_pages_from_mm() and the hardware page walker can access page
-	 * tables while all the high-level locks are held in write mode.
-	 */
 	result = SCAN_FAIL;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
-	if (!start_pte)
-		goto drop_immap;
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto abort;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1671,40 +1618,44 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 */
 		if (hpage + i != page)
 			goto abort;
-		count++;
+		nr_ptes++;
 	}
 
-	/* step 2: adjust rmap */
+	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
-		struct page *page;
-
 		if (pte_none(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
-			goto abort;
-		page_remove_rmap(page, vma, false);
+
+		/* Must clear entry, or a racing truncate may re-remove it */
+		pte_clear(mm, addr, pte);
+		page_remove_rmap(hpage + i, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (count) {
-		page_ref_sub(hpage, count);
-		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	if (nr_ptes) {
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -nr_ptes);
 	}
 
-	/* step 4: remove pte entries */
-	/* we make no change to anon, but protect concurrent anon page lookup */
-	if (vma->anon_vma)
-		anon_vma_lock_write(vma->anon_vma);
+	/* step 4: remove page table */
 
-	collapse_and_free_pmd(mm, vma, haddr, pmd);
+	/* Huge page lock is still held, so page table must remain empty */
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
 
-	if (vma->anon_vma)
-		anon_vma_unlock_write(vma->anon_vma);
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	mmu_notifier_invalidate_range_end(&range);
+
+	mm_dec_nr_ptes(mm);
+	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 
 maybe_install_pmd:
 	/* step 5: install pmd entry */
@@ -1718,9 +1669,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	return result;
 
 abort:
-	pte_unmap_unlock(start_pte, ptl);
-drop_immap:
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	mmu_notifier_invalidate_range_end(&range);
 	goto drop_hpage;
 }
 
@@ -2842,9 +2793,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		case SCAN_PTE_MAPPED_HUGEPAGE:
 			BUG_ON(mmap_locked);
 			BUG_ON(*prev);
-			mmap_write_lock(mm);
+			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_write_unlock(mm);
+			mmap_locked = true;
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
@ 2023-05-29  6:26   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes the need for tlb_remove_table_sync_one() here.

Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Some cleanup while rearranging: rename "count" to "nr_ptes";
and "step 2" does not need to duplicate the checks in "step 1".

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 131 +++++++++++++++---------------------------------
 1 file changed, 41 insertions(+), 90 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4fd408154692..2999500abdd5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1485,7 +1485,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
 {
@@ -1497,7 +1497,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	};
 
 	VM_BUG_ON(!PageTransHuge(hpage));
-	mmap_assert_write_locked(vma->vm_mm);
+	mmap_assert_locked(vma->vm_mm);
 
 	if (do_set_pmd(&vmf, hpage))
 		return SCAN_FAIL;
@@ -1506,48 +1506,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to another
- *    page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	struct mmu_notifier_range range;
-
-	mmap_assert_write_locked(mm);
-	if (vma->vm_file)
-		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
-	/*
-	 * All anon_vmas attached to the VMA have the same root and are
-	 * therefore locked by the same lock.
-	 */
-	if (vma->anon_vma)
-		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + HPAGE_PMD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	pmd = pmdp_collapse_flush(vma, addr, pmdp);
-	tlb_remove_table_sync_one();
-	mmu_notifier_invalidate_range_end(&range);
-	mm_dec_nr_ptes(mm);
-	page_table_check_pte_clear_range(mm, addr, pmd);
-	pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1563,16 +1521,17 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	struct mmu_notifier_range range;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
-	pmd_t *pmd;
-	spinlock_t *ptl;
-	int count = 0, result = SCAN_FAIL;
+	pmd_t *pmd, pgt_pmd;
+	spinlock_t *pml, *ptl;
+	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
-	mmap_assert_write_locked(mm);
+	mmap_assert_locked(mm);
 
 	/* Fast check before locking page if already PMD-mapped */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
@@ -1612,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
@@ -1625,27 +1585,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
-	/* Lock the vma before taking i_mmap and page table locks */
-	vma_start_write(vma);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
 
-	/*
-	 * We need to lock the mapping so that from here on, only GUP-fast and
-	 * hardware page walks can access the parts of the page tables that
-	 * we're operating on.
-	 * See collapse_and_free_pmd().
-	 */
-	i_mmap_lock_write(vma->vm_file->f_mapping);
-
-	/*
-	 * This spinlock should be unnecessary: Nobody else should be accessing
-	 * the page tables under spinlock protection here, only
-	 * lockless_pages_from_mm() and the hardware page walker can access page
-	 * tables while all the high-level locks are held in write mode.
-	 */
 	result = SCAN_FAIL;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
-	if (!start_pte)
-		goto drop_immap;
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto abort;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1671,40 +1618,44 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 */
 		if (hpage + i != page)
 			goto abort;
-		count++;
+		nr_ptes++;
 	}
 
-	/* step 2: adjust rmap */
+	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
-		struct page *page;
-
 		if (pte_none(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
-			goto abort;
-		page_remove_rmap(page, vma, false);
+
+		/* Must clear entry, or a racing truncate may re-remove it */
+		pte_clear(mm, addr, pte);
+		page_remove_rmap(hpage + i, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (count) {
-		page_ref_sub(hpage, count);
-		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	if (nr_ptes) {
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -nr_ptes);
 	}
 
-	/* step 4: remove pte entries */
-	/* we make no change to anon, but protect concurrent anon page lookup */
-	if (vma->anon_vma)
-		anon_vma_lock_write(vma->anon_vma);
+	/* step 4: remove page table */
 
-	collapse_and_free_pmd(mm, vma, haddr, pmd);
+	/* Huge page lock is still held, so page table must remain empty */
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
 
-	if (vma->anon_vma)
-		anon_vma_unlock_write(vma->anon_vma);
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	mmu_notifier_invalidate_range_end(&range);
+
+	mm_dec_nr_ptes(mm);
+	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 
 maybe_install_pmd:
 	/* step 5: install pmd entry */
@@ -1718,9 +1669,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	return result;
 
 abort:
-	pte_unmap_unlock(start_pte, ptl);
-drop_immap:
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	mmu_notifier_invalidate_range_end(&range);
 	goto drop_hpage;
 }
 
@@ -2842,9 +2793,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		case SCAN_PTE_MAPPED_HUGEPAGE:
 			BUG_ON(mmap_locked);
 			BUG_ON(*prev);
-			mmap_write_lock(mm);
+			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_write_unlock(mm);
+			mmap_locked = true;
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
@ 2023-05-29  6:26   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes the need for tlb_remove_table_sync_one() here.

Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Some cleanup while rearranging: rename "count" to "nr_ptes";
and "step 2" does not need to duplicate the checks in "step 1".

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 131 +++++++++++++++---------------------------------
 1 file changed, 41 insertions(+), 90 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4fd408154692..2999500abdd5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1485,7 +1485,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
 {
@@ -1497,7 +1497,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	};
 
 	VM_BUG_ON(!PageTransHuge(hpage));
-	mmap_assert_write_locked(vma->vm_mm);
+	mmap_assert_locked(vma->vm_mm);
 
 	if (do_set_pmd(&vmf, hpage))
 		return SCAN_FAIL;
@@ -1506,48 +1506,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to another
- *    page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	struct mmu_notifier_range range;
-
-	mmap_assert_write_locked(mm);
-	if (vma->vm_file)
-		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
-	/*
-	 * All anon_vmas attached to the VMA have the same root and are
-	 * therefore locked by the same lock.
-	 */
-	if (vma->anon_vma)
-		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + HPAGE_PMD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	pmd = pmdp_collapse_flush(vma, addr, pmdp);
-	tlb_remove_table_sync_one();
-	mmu_notifier_invalidate_range_end(&range);
-	mm_dec_nr_ptes(mm);
-	page_table_check_pte_clear_range(mm, addr, pmd);
-	pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1563,16 +1521,17 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	struct mmu_notifier_range range;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
-	pmd_t *pmd;
-	spinlock_t *ptl;
-	int count = 0, result = SCAN_FAIL;
+	pmd_t *pmd, pgt_pmd;
+	spinlock_t *pml, *ptl;
+	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
-	mmap_assert_write_locked(mm);
+	mmap_assert_locked(mm);
 
 	/* Fast check before locking page if already PMD-mapped */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
@@ -1612,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
@@ -1625,27 +1585,14 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
-	/* Lock the vma before taking i_mmap and page table locks */
-	vma_start_write(vma);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
 
-	/*
-	 * We need to lock the mapping so that from here on, only GUP-fast and
-	 * hardware page walks can access the parts of the page tables that
-	 * we're operating on.
-	 * See collapse_and_free_pmd().
-	 */
-	i_mmap_lock_write(vma->vm_file->f_mapping);
-
-	/*
-	 * This spinlock should be unnecessary: Nobody else should be accessing
-	 * the page tables under spinlock protection here, only
-	 * lockless_pages_from_mm() and the hardware page walker can access page
-	 * tables while all the high-level locks are held in write mode.
-	 */
 	result = SCAN_FAIL;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
-	if (!start_pte)
-		goto drop_immap;
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto abort;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1671,40 +1618,44 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 */
 		if (hpage + i != page)
 			goto abort;
-		count++;
+		nr_ptes++;
 	}
 
-	/* step 2: adjust rmap */
+	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
-		struct page *page;
-
 		if (pte_none(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
-			goto abort;
-		page_remove_rmap(page, vma, false);
+
+		/* Must clear entry, or a racing truncate may re-remove it */
+		pte_clear(mm, addr, pte);
+		page_remove_rmap(hpage + i, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (count) {
-		page_ref_sub(hpage, count);
-		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	if (nr_ptes) {
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -nr_ptes);
 	}
 
-	/* step 4: remove pte entries */
-	/* we make no change to anon, but protect concurrent anon page lookup */
-	if (vma->anon_vma)
-		anon_vma_lock_write(vma->anon_vma);
+	/* step 4: remove page table */
 
-	collapse_and_free_pmd(mm, vma, haddr, pmd);
+	/* Huge page lock is still held, so page table must remain empty */
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
 
-	if (vma->anon_vma)
-		anon_vma_unlock_write(vma->anon_vma);
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	mmu_notifier_invalidate_range_end(&range);
+
+	mm_dec_nr_ptes(mm);
+	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 
 maybe_install_pmd:
 	/* step 5: install pmd entry */
@@ -1718,9 +1669,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	return result;
 
 abort:
-	pte_unmap_unlock(start_pte, ptl);
-drop_immap:
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	mmu_notifier_invalidate_range_end(&range);
 	goto drop_hpage;
 }
 
@@ -2842,9 +2793,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		case SCAN_PTE_MAPPED_HUGEPAGE:
 			BUG_ON(mmap_locked);
 			BUG_ON(*prev);
-			mmap_write_lock(mm);
+			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_write_unlock(mm);
+			mmap_locked = true;
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:28   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 125 +++++++-----------------------------------------
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2999500abdd5..301c0e54a2ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
 	struct mm_slot slot;
-
-	/* pte-mapped THP in this mm */
-	int nr_pte_mapped_thp;
-	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1441,50 +1433,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- *     emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- *     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- *     (at virtual address X) and adds an entry (for X) into mm_struct A's
- *     ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- *     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- *     (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	struct khugepaged_mm_slot *mm_slot;
-	struct mm_slot *slot;
-	bool ret = false;
-
-	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-	spin_lock(&khugepaged_mm_lock);
-	slot = mm_slot_lookup(mm_slots_hash, mm);
-	mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
-		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-		ret = true;
-	}
-	spin_unlock(&khugepaged_mm_lock);
-	return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
@@ -1675,29 +1623,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	goto drop_hpage;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-	struct mm_slot *slot = &mm_slot->slot;
-	struct mm_struct *mm = slot->mm;
-	int i;
-
-	if (likely(mm_slot->nr_pte_mapped_thp == 0))
-		return;
-
-	if (!mmap_write_trylock(mm))
-		return;
-
-	if (unlikely(hpage_collapse_test_exit(mm)))
-		goto out;
-
-	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-	mm_slot->nr_pte_mapped_thp = 0;
-	mmap_write_unlock(mm);
-}
-
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -2326,16 +2251,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 {
 	BUILD_BUG();
 }
-
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-}
-
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	return false;
-}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2365,7 +2280,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		khugepaged_scan.mm_slot = mm_slot;
 	}
 	spin_unlock(&khugepaged_mm_lock);
-	khugepaged_collapse_pte_mapped_thps(mm_slot);
 
 	mm = slot->mm;
 	/*
@@ -2418,36 +2332,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				*result = hpage_collapse_scan_file(mm,
-								   khugepaged_scan.address,
-								   file, pgoff, cc);
 				mmap_locked = false;
+				*result = hpage_collapse_scan_file(mm,
+					khugepaged_scan.address, file, pgoff, cc);
+				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
+					mmap_read_lock(mm);
+					mmap_locked = true;
+					if (hpage_collapse_test_exit(mm)) {
+						fput(file);
+						goto breakouterloop;
+					}
+					*result = collapse_pte_mapped_thp(mm,
+						khugepaged_scan.address, false);
+					if (*result == SCAN_PMD_MAPPED)
+						*result = SCAN_SUCCEED;
+				}
 				fput(file);
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-								  khugepaged_scan.address,
-								  &mmap_locked,
-								  cc);
+					khugepaged_scan.address, &mmap_locked, cc);
 			}
-			switch (*result) {
-			case SCAN_PTE_MAPPED_HUGEPAGE: {
-				pmd_t *pmd;
 
-				*result = find_pmd_or_thp_or_none(mm,
-								  khugepaged_scan.address,
-								  &pmd);
-				if (*result != SCAN_SUCCEED)
-					break;
-				if (!khugepaged_add_pte_mapped_thp(mm,
-								   khugepaged_scan.address))
-					break;
-			} fallthrough;
-			case SCAN_SUCCEED:
+			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
-				break;
-			default:
-				break;
-			}
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
@ 2023-05-29  6:28   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 125 +++++++-----------------------------------------
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2999500abdd5..301c0e54a2ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
 	struct mm_slot slot;
-
-	/* pte-mapped THP in this mm */
-	int nr_pte_mapped_thp;
-	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1441,50 +1433,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- *     emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- *     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- *     (at virtual address X) and adds an entry (for X) into mm_struct A's
- *     ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- *     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- *     (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	struct khugepaged_mm_slot *mm_slot;
-	struct mm_slot *slot;
-	bool ret = false;
-
-	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-	spin_lock(&khugepaged_mm_lock);
-	slot = mm_slot_lookup(mm_slots_hash, mm);
-	mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
-		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-		ret = true;
-	}
-	spin_unlock(&khugepaged_mm_lock);
-	return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
@@ -1675,29 +1623,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	goto drop_hpage;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-	struct mm_slot *slot = &mm_slot->slot;
-	struct mm_struct *mm = slot->mm;
-	int i;
-
-	if (likely(mm_slot->nr_pte_mapped_thp == 0))
-		return;
-
-	if (!mmap_write_trylock(mm))
-		return;
-
-	if (unlikely(hpage_collapse_test_exit(mm)))
-		goto out;
-
-	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-	mm_slot->nr_pte_mapped_thp = 0;
-	mmap_write_unlock(mm);
-}
-
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -2326,16 +2251,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 {
 	BUILD_BUG();
 }
-
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-}
-
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	return false;
-}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2365,7 +2280,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		khugepaged_scan.mm_slot = mm_slot;
 	}
 	spin_unlock(&khugepaged_mm_lock);
-	khugepaged_collapse_pte_mapped_thps(mm_slot);
 
 	mm = slot->mm;
 	/*
@@ -2418,36 +2332,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				*result = hpage_collapse_scan_file(mm,
-								   khugepaged_scan.address,
-								   file, pgoff, cc);
 				mmap_locked = false;
+				*result = hpage_collapse_scan_file(mm,
+					khugepaged_scan.address, file, pgoff, cc);
+				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
+					mmap_read_lock(mm);
+					mmap_locked = true;
+					if (hpage_collapse_test_exit(mm)) {
+						fput(file);
+						goto breakouterloop;
+					}
+					*result = collapse_pte_mapped_thp(mm,
+						khugepaged_scan.address, false);
+					if (*result == SCAN_PMD_MAPPED)
+						*result = SCAN_SUCCEED;
+				}
 				fput(file);
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-								  khugepaged_scan.address,
-								  &mmap_locked,
-								  cc);
+					khugepaged_scan.address, &mmap_locked, cc);
 			}
-			switch (*result) {
-			case SCAN_PTE_MAPPED_HUGEPAGE: {
-				pmd_t *pmd;
 
-				*result = find_pmd_or_thp_or_none(mm,
-								  khugepaged_scan.address,
-								  &pmd);
-				if (*result != SCAN_SUCCEED)
-					break;
-				if (!khugepaged_add_pte_mapped_thp(mm,
-								   khugepaged_scan.address))
-					break;
-			} fallthrough;
-			case SCAN_SUCCEED:
+			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
-				break;
-			default:
-				break;
-			}
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
@ 2023-05-29  6:28   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 125 +++++++-----------------------------------------
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2999500abdd5..301c0e54a2ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
 	struct mm_slot slot;
-
-	/* pte-mapped THP in this mm */
-	int nr_pte_mapped_thp;
-	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1441,50 +1433,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- *     emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- *     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- *     (at virtual address X) and adds an entry (for X) into mm_struct A's
- *     ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- *     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- *     (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	struct khugepaged_mm_slot *mm_slot;
-	struct mm_slot *slot;
-	bool ret = false;
-
-	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-	spin_lock(&khugepaged_mm_lock);
-	slot = mm_slot_lookup(mm_slots_hash, mm);
-	mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
-		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-		ret = true;
-	}
-	spin_unlock(&khugepaged_mm_lock);
-	return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
@@ -1675,29 +1623,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	goto drop_hpage;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-	struct mm_slot *slot = &mm_slot->slot;
-	struct mm_struct *mm = slot->mm;
-	int i;
-
-	if (likely(mm_slot->nr_pte_mapped_thp == 0))
-		return;
-
-	if (!mmap_write_trylock(mm))
-		return;
-
-	if (unlikely(hpage_collapse_test_exit(mm)))
-		goto out;
-
-	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-	mm_slot->nr_pte_mapped_thp = 0;
-	mmap_write_unlock(mm);
-}
-
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -2326,16 +2251,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 {
 	BUILD_BUG();
 }
-
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-}
-
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	return false;
-}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2365,7 +2280,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		khugepaged_scan.mm_slot = mm_slot;
 	}
 	spin_unlock(&khugepaged_mm_lock);
-	khugepaged_collapse_pte_mapped_thps(mm_slot);
 
 	mm = slot->mm;
 	/*
@@ -2418,36 +2332,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				*result = hpage_collapse_scan_file(mm,
-								   khugepaged_scan.address,
-								   file, pgoff, cc);
 				mmap_locked = false;
+				*result = hpage_collapse_scan_file(mm,
+					khugepaged_scan.address, file, pgoff, cc);
+				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
+					mmap_read_lock(mm);
+					mmap_locked = true;
+					if (hpage_collapse_test_exit(mm)) {
+						fput(file);
+						goto breakouterloop;
+					}
+					*result = collapse_pte_mapped_thp(mm,
+						khugepaged_scan.address, false);
+					if (*result == SCAN_PMD_MAPPED)
+						*result = SCAN_SUCCEED;
+				}
 				fput(file);
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-								  khugepaged_scan.address,
-								  &mmap_locked,
-								  cc);
+					khugepaged_scan.address, &mmap_locked, cc);
 			}
-			switch (*result) {
-			case SCAN_PTE_MAPPED_HUGEPAGE: {
-				pmd_t *pmd;
 
-				*result = find_pmd_or_thp_or_none(mm,
-								  khugepaged_scan.address,
-								  &pmd);
-				if (*result != SCAN_SUCCEED)
-					break;
-				if (!khugepaged_add_pte_mapped_thp(mm,
-								   khugepaged_scan.address))
-					break;
-			} fallthrough;
-			case SCAN_SUCCEED:
+			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
-				break;
-			default:
-				break;
-			}
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write()
  2023-05-29  6:11 ` Hugh Dickins
  (?)
@ 2023-05-29  6:30   ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h        | 17 -----------------
 include/linux/mmap_lock.h | 10 ----------
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2e56980853..9b24f8fbf899 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,21 +690,6 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	up_write(&vma->vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-	int mm_lock_seq;
-
-	if (__is_vma_write_locked(vma, &mm_lock_seq))
-		return true;
-
-	if (!down_write_trylock(&vma->vm_lock->lock))
-		return false;
-
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
-	return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
 	int mm_lock_seq;
@@ -730,8 +715,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-		{ return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 	return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-	bool ret;
-
-	__mmap_lock_trace_start_locking(mm, true);
-	ret = down_write_trylock(&mm->mmap_lock) != 0;
-	__mmap_lock_trace_acquire_returned(mm, true, ret);
-	return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write()
@ 2023-05-29  6:30   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h        | 17 -----------------
 include/linux/mmap_lock.h | 10 ----------
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2e56980853..9b24f8fbf899 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,21 +690,6 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	up_write(&vma->vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-	int mm_lock_seq;
-
-	if (__is_vma_write_locked(vma, &mm_lock_seq))
-		return true;
-
-	if (!down_write_trylock(&vma->vm_lock->lock))
-		return false;
-
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
-	return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
 	int mm_lock_seq;
@@ -730,8 +715,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-		{ return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 	return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-	bool ret;
-
-	__mmap_lock_trace_start_locking(mm, true);
-	ret = down_write_trylock(&mm->mmap_lock) != 0;
-	__mmap_lock_trace_acquire_returned(mm, true, ret);
-	return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write()
@ 2023-05-29  6:30   ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29  6:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h        | 17 -----------------
 include/linux/mmap_lock.h | 10 ----------
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2e56980853..9b24f8fbf899 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,21 +690,6 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	up_write(&vma->vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-	int mm_lock_seq;
-
-	if (__is_vma_write_locked(vma, &mm_lock_seq))
-		return true;
-
-	if (!down_write_trylock(&vma->vm_lock->lock))
-		return false;
-
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
-	return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
 	int mm_lock_seq;
@@ -730,8 +715,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-		{ return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 	return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-	bool ret;
-
-	__mmap_lock_trace_start_locking(mm, true);
-	ret = down_write_trylock(&mm->mmap_lock) != 0;
-	__mmap_lock_trace_acquire_returned(mm, true, ret);
-	return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
  2023-05-29  6:16   ` Hugh Dickins
  (?)
@ 2023-05-29 13:56     ` Matthew Wilcox
  -1 siblings, 0 replies; 158+ messages in thread
From: Matthew Wilcox @ 2023-05-29 13:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
> +	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
> +/*
> + * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
> + * the barriers in pmdp_get_lockless() cannot guarantee that the value in
> + * pmd_high actually belongs with the value in pmd_low; but holding interrupts
> + * off blocks the TLB flush between present updates, which guarantees that a
> + * successful __pte_offset_map() points to a page from matched halves.
> + */
> +#define config_might_irq_save(flags)	local_irq_save(flags)
> +#define config_might_irq_restore(flags)	local_irq_restore(flags)
> +#else
> +#define config_might_irq_save(flags)
> +#define config_might_irq_restore(flags)

I don't like the name.  It should indicate that it's PMD-related, so
pmd_read_start(flags) / pmd_read_end(flags)?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
@ 2023-05-29 13:56     ` Matthew Wilcox
  0 siblings, 0 replies; 158+ messages in thread
From: Matthew Wilcox @ 2023-05-29 13:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
> +	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
> +/*
> + * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
> + * the barriers in pmdp_get_lockless() cannot guarantee that the value in
> + * pmd_high actually belongs with the value in pmd_low; but holding interrupts
> + * off blocks the TLB flush between present updates, which guarantees that a
> + * successful __pte_offset_map() points to a page from matched halves.
> + */
> +#define config_might_irq_save(flags)	local_irq_save(flags)
> +#define config_might_irq_restore(flags)	local_irq_restore(flags)
> +#else
> +#define config_might_irq_save(flags)
> +#define config_might_irq_restore(flags)

I don't like the name.  It should indicate that it's PMD-related, so
pmd_read_start(flags) / pmd_read_end(flags)?


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
@ 2023-05-29 13:56     ` Matthew Wilcox
  0 siblings, 0 replies; 158+ messages in thread
From: Matthew Wilcox @ 2023-05-29 13:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Steven Price, Christoph Hellwig,
	Jason Gunthorpe, Aneesh Kumar K.V, Axel Rasmussen,
	Christian Borntraeger, Thomas Hellstrom, Ralph Campbell,
	Pasha Tatashin, Anshuman Khandual, Heiko Carstens, P eter Xu,
	Suren Baghdasaryan, linux-arm-kernel, SeongJae Park, Jann Horn,
	linux-mm, linuxppc-dev, Kirill A. Shutemov, Naoya Horiguchi,
	linux-kernel, Minchan Kim, Mike Rapoport, Andrew Morton,
	Mel Gorman, David S. Miller, Zack Rusin, Mike Kravetz

On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
> +	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
> +/*
> + * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
> + * the barriers in pmdp_get_lockless() cannot guarantee that the value in
> + * pmd_high actually belongs with the value in pmd_low; but holding interrupts
> + * off blocks the TLB flush between present updates, which guarantees that a
> + * successful __pte_offset_map() points to a page from matched halves.
> + */
> +#define config_might_irq_save(flags)	local_irq_save(flags)
> +#define config_might_irq_restore(flags)	local_irq_restore(flags)
> +#else
> +#define config_might_irq_save(flags)
> +#define config_might_irq_restore(flags)

I don't like the name.  It should indicate that it's PMD-related, so
pmd_read_start(flags) / pmd_read_end(flags)?


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-05-29  6:20   ` Hugh Dickins
  (?)
@ 2023-05-29 14:02     ` Matthew Wilcox
  -1 siblings, 0 replies; 158+ messages in thread
From: Matthew Wilcox @ 2023-05-29 14:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}

This can't be safe (on ppc).  IIRC you might have up to 16x4k page
tables sharing one 64kB page.  So if you have two page tables from the
same page being defer-freed simultaneously, you'll reuse the rcu_head
and I cannot imagine things go well from that point.

I have no idea how to solve this problem.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29 14:02     ` Matthew Wilcox
  0 siblings, 0 replies; 158+ messages in thread
From: Matthew Wilcox @ 2023-05-29 14:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}

This can't be safe (on ppc).  IIRC you might have up to 16x4k page
tables sharing one 64kB page.  So if you have two page tables from the
same page being defer-freed simultaneously, you'll reuse the rcu_head
and I cannot imagine things go well from that point.

I have no idea how to solve this problem.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29 14:02     ` Matthew Wilcox
  0 siblings, 0 replies; 158+ messages in thread
From: Matthew Wilcox @ 2023-05-29 14:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Steven Price, Christoph Hellwig,
	Jason Gunthorpe, Aneesh Kumar K.V, Axel Rasmussen,
	Christian Borntraeger, Thomas Hellstrom, Ralph Campbell,
	Pasha Tatashin, Anshuman Khandual, Heiko Carstens, P eter Xu,
	Suren Baghdasaryan, linux-arm-kernel, SeongJae Park, Jann Horn,
	linux-mm, linuxppc-dev, Kirill A. Shutemov, Naoya Horiguchi,
	linux-kernel, Minchan Kim, Mike Rapoport, Andrew Morton,
	Mel Gorman, David S. Miller, Zack Rusin, Mike Kravetz

On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}

This can't be safe (on ppc).  IIRC you might have up to 16x4k page
tables sharing one 64kB page.  So if you have two page tables from the
same page being defer-freed simultaneously, you'll reuse the rcu_head
and I cannot imagine things go well from that point.

I have no idea how to solve this problem.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-05-29 14:02     ` Matthew Wilcox
  (?)
@ 2023-05-29 14:36       ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29 14:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Mon, 29 May 2023, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.

Oh yes, of course, thanks for catching that so quickly.
So my s390 and sparc implementations will be equally broken.

> 
> I have no idea how to solve this problem.

I do: I'll have to go back to the more complicated implementation we
actually ran with on powerpc - I was thinking those complications just
related to deposit/withdraw matters, forgetting the one-rcu_head issue.

It uses large (0x10000) increments of the page refcount, avoiding
call_rcu() when already active.

It's not a complication I had wanted to explain or test for now,
but we shall have to.  Should apply equally well to sparc, but s390
more of a problem, since s390 already has its own refcount cleverness.

Thanks, I must dash, out much of the day.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29 14:36       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29 14:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heik o Carstens, Peter Xu, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Andrew Morton, Mel Gorman, David S. Miller,
	Zack Rusin, Mike Kravetz

On Mon, 29 May 2023, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.

Oh yes, of course, thanks for catching that so quickly.
So my s390 and sparc implementations will be equally broken.

> 
> I have no idea how to solve this problem.

I do: I'll have to go back to the more complicated implementation we
actually ran with on powerpc - I was thinking those complications just
related to deposit/withdraw matters, forgetting the one-rcu_head issue.

It uses large (0x10000) increments of the page refcount, avoiding
call_rcu() when already active.

It's not a complication I had wanted to explain or test for now,
but we shall have to.  Should apply equally well to sparc, but s390
more of a problem, since s390 already has its own refcount cleverness.

Thanks, I must dash, out much of the day.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-05-29 14:36       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-29 14:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Mon, 29 May 2023, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.

Oh yes, of course, thanks for catching that so quickly.
So my s390 and sparc implementations will be equally broken.

> 
> I have no idea how to solve this problem.

I do: I'll have to go back to the more complicated implementation we
actually ran with on powerpc - I was thinking those complications just
related to deposit/withdraw matters, forgetting the one-rcu_head issue.

It uses large (0x10000) increments of the page refcount, avoiding
call_rcu() when already active.

It's not a complication I had wanted to explain or test for now,
but we shall have to.  Should apply equally well to sparc, but s390
more of a problem, since s390 already has its own refcount cleverness.

Thanks, I must dash, out much of the day.

Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-05-29  6:25   ` Hugh Dickins
  (?)
@ 2023-05-29 23:26     ` Peter Xu
  -1 siblings, 0 replies; 158+ messages in thread
From: Peter Xu @ 2023-05-29 23:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
> Simplify shmem and file THP collapse's retract_page_tables(), and relax
> its locking: to improve its success rate and to lessen impact on others.
> 
> Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
> target_mm, leave that part of the work to madvise_collapse() calling
> collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
> result code to arrange for that.  That spares retract_page_tables() four
> arguments; and since it will be successful in retracting all of the page
> tables expected of it, no need to track and return a result code itself.
> 
> It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
> but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
> allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
> THPs.  retract_page_tables() just needs to use those same spinlocks to
> exclude it briefly, while transitioning pmd from page table to none: so
> restore its use of pmd_lock() inside of which pte lock is nested.
> 
> Users of pte_offset_map_lock() etc all now allow for them to fail:
> so retract_page_tables() now has no use for mmap_write_trylock() or
> vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
> it does not even need the mmap_read_lock().
> 
> But those users do expect the page table to remain a good page table,
> until they unlock and rcu_read_unlock(): so the page table cannot be
> freed immediately, but rather by the recently added pte_free_defer().
> 
> retract_page_tables() can be enhanced to replace_page_tables(), which
> inserts the final huge pmd without mmap lock: going through an invalid
> state instead of pmd_none() followed by fault.  But that does raise some
> questions, and requires a more complicated pte_free_defer() for powerpc
> (when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
> that enhancement to a later release.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
>  1 file changed, 60 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1083f0e38a07..4fd408154692 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		break;
>  	case SCAN_PMD_NONE:
>  		/*
> -		 * In MADV_COLLAPSE path, possible race with khugepaged where
> -		 * all pte entries have been removed and pmd cleared.  If so,
> -		 * skip all the pte checks and just update the pmd mapping.
> +		 * All pte entries have been removed and pmd cleared.
> +		 * Skip all the pte checks and just update the pmd mapping.
>  		 */
>  		goto maybe_install_pmd;
>  	default:
> @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
>  	mmap_write_unlock(mm);
>  }
>  
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -			       struct mm_struct *target_mm,
> -			       unsigned long target_addr, struct page *hpage,
> -			       struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>  	struct vm_area_struct *vma;
> -	int target_result = SCAN_FAIL;
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -		int result = SCAN_FAIL;
> -		struct mm_struct *mm = NULL;
> -		unsigned long addr = 0;
> -		pmd_t *pmd;
> -		bool is_target = false;
> +		struct mm_struct *mm;
> +		unsigned long addr;
> +		pmd_t *pmd, pgt_pmd;
> +		spinlock_t *pml;
> +		spinlock_t *ptl;
>  
>  		/*
>  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -		 * got written to. These VMAs are likely not worth investing
> -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -		 * later.
> +		 * got written to. These VMAs are likely not worth removing
> +		 * page tables from, as PMD-mapping is likely to be split later.
>  		 *
> -		 * Note that vma->anon_vma check is racy: it can be set up after
> -		 * the check but before we took mmap_lock by the fault path.
> -		 * But page lock would prevent establishing any new ptes of the
> -		 * page, so we are safe.
> -		 *
> -		 * An alternative would be drop the check, but check that page
> -		 * table is clear before calling pmdp_collapse_flush() under
> -		 * ptl. It has higher chance to recover THP for the VMA, but
> -		 * has higher cost too. It would also probably require locking
> -		 * the anon_vma.
> +		 * Note that vma->anon_vma check is racy: it can be set after
> +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +		 * prevented establishing new ptes of the page. So we are safe
> +		 * to remove page table below, without even checking it's empty.
>  		 */
> -		if (READ_ONCE(vma->anon_vma)) {
> -			result = SCAN_PAGE_ANON;
> -			goto next;
> -		}
> +		if (READ_ONCE(vma->anon_vma))
> +			continue;

Not directly related to current patch, but I just realized there seems to
have similar issue as what ab0c3f1251b4 wanted to fix.

IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
set here, then does it mean that any vma used to get debugged will never be
able to merge into a thp (with either madvise or khugepaged)?

I think it'll only make a difference when the page cache is not huge yet
when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
if so, I think the anon_vma should still be there.

Did I miss something, or maybe that's not even a problem?

> +
>  		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		if (addr & ~HPAGE_PMD_MASK ||
> -		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -			result = SCAN_VMA_CHECK;
> -			goto next;
> -		}
> -		mm = vma->vm_mm;
> -		is_target = mm == target_mm && addr == target_addr;
> -		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -		if (result != SCAN_SUCCEED)
> -			goto next;
> -		/*
> -		 * We need exclusive mmap_lock to retract page table.
> -		 *
> -		 * We use trylock due to lock inversion: we need to acquire
> -		 * mmap_lock while holding page lock. Fault path does it in
> -		 * reverse order. Trylock is a way to avoid deadlock.
> -		 *
> -		 * Also, it's not MADV_COLLAPSE's job to collapse other
> -		 * mappings - let khugepaged take care of them later.
> -		 */
> -		result = SCAN_PTE_MAPPED_HUGEPAGE;
> -		if ((cc->is_khugepaged || is_target) &&
> -		    mmap_write_trylock(mm)) {
> -			/* trylock for the same lock inversion as above */
> -			if (!vma_try_start_write(vma))
> -				goto unlock_next;
> -
> -			/*
> -			 * Re-check whether we have an ->anon_vma, because
> -			 * collapse_and_free_pmd() requires that either no
> -			 * ->anon_vma exists or the anon_vma is locked.
> -			 * We already checked ->anon_vma above, but that check
> -			 * is racy because ->anon_vma can be populated under the
> -			 * mmap lock in read mode.
> -			 */
> -			if (vma->anon_vma) {
> -				result = SCAN_PAGE_ANON;
> -				goto unlock_next;
> -			}
> -			/*
> -			 * When a vma is registered with uffd-wp, we can't
> -			 * recycle the pmd pgtable because there can be pte
> -			 * markers installed.  Skip it only, so the rest mm/vma
> -			 * can still have the same file mapped hugely, however
> -			 * it'll always mapped in small page size for uffd-wp
> -			 * registered ranges.
> -			 */
> -			if (hpage_collapse_test_exit(mm)) {
> -				result = SCAN_ANY_PROCESS;
> -				goto unlock_next;
> -			}
> -			if (userfaultfd_wp(vma)) {
> -				result = SCAN_PTE_UFFD_WP;
> -				goto unlock_next;
> -			}
> -			collapse_and_free_pmd(mm, vma, addr, pmd);
> -			if (!cc->is_khugepaged && is_target)
> -				result = set_huge_pmd(vma, addr, pmd, hpage);
> -			else
> -				result = SCAN_SUCCEED;
> -
> -unlock_next:
> -			mmap_write_unlock(mm);
> -			goto next;
> -		}
> -		/*
> -		 * Calling context will handle target mm/addr. Otherwise, let
> -		 * khugepaged try again later.
> -		 */
> -		if (!is_target) {
> -			khugepaged_add_pte_mapped_thp(mm, addr);
> +		    vma->vm_end < addr + HPAGE_PMD_SIZE)
>  			continue;
> -		}
> -next:
> -		if (is_target)
> -			target_result = result;
> +
> +		mm = vma->vm_mm;
> +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +			continue;
> +
> +		if (hpage_collapse_test_exit(mm))
> +			continue;
> +		/*
> +		 * When a vma is registered with uffd-wp, we cannot recycle
> +		 * the page table because there may be pte markers installed.
> +		 * Other vmas can still have the same file mapped hugely, but
> +		 * skip this one: it will always be mapped in small page size
> +		 * for uffd-wp registered ranges.
> +		 *
> +		 * What if VM_UFFD_WP is set a moment after this check?  No
> +		 * problem, huge page lock is still held, stopping new mappings
> +		 * of page which might then get replaced by pte markers: only
> +		 * existing markers need to be protected here.  (We could check
> +		 * after getting ptl below, but this comment distracting there!)
> +		 */
> +		if (userfaultfd_wp(vma))
> +			continue;

IIUC here with the new code we only hold (1) hpage lock, and (2)
i_mmap_lock read.  Then could it possible that right after checking this
and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
pgtable locks taken?

The thing is installation of pte markers may not need either of the locks
iiuc..

Would taking mmap read lock help in this case?

Thanks,

> +
> +		/* Huge page lock is still held, so page table must be empty */
> +		pml = pmd_lock(mm, pmd);
> +		ptl = pte_lockptr(mm, pmd);
> +		if (ptl != pml)
> +			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +		pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> +		if (ptl != pml)
> +			spin_unlock(ptl);
> +		spin_unlock(pml);
> +
> +		mm_dec_nr_ptes(mm);
> +		page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +		pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>  	}
> -	i_mmap_unlock_write(mapping);
> -	return target_result;
> +	i_mmap_unlock_read(mapping);
>  }
>  
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  
>  	/*
>  	 * Remove pte page tables, so we can re-fault the page as huge.
> +	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>  	 */
> -	result = retract_page_tables(mapping, start, mm, addr, hpage,
> -				     cc);
> +	retract_page_tables(mapping, start);
> +	if (cc && !cc->is_khugepaged)
> +		result = SCAN_PTE_MAPPED_HUGEPAGE;
>  	unlock_page(hpage);
>  
>  	/*
> -- 
> 2.35.3
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-29 23:26     ` Peter Xu
  0 siblings, 0 replies; 158+ messages in thread
From: Peter Xu @ 2023-05-29 23:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Carstens, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Andrew Morton, Mel Gorman, David S. Miller,
	Zack Rusin, Mike Kravetz

On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
> Simplify shmem and file THP collapse's retract_page_tables(), and relax
> its locking: to improve its success rate and to lessen impact on others.
> 
> Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
> target_mm, leave that part of the work to madvise_collapse() calling
> collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
> result code to arrange for that.  That spares retract_page_tables() four
> arguments; and since it will be successful in retracting all of the page
> tables expected of it, no need to track and return a result code itself.
> 
> It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
> but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
> allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
> THPs.  retract_page_tables() just needs to use those same spinlocks to
> exclude it briefly, while transitioning pmd from page table to none: so
> restore its use of pmd_lock() inside of which pte lock is nested.
> 
> Users of pte_offset_map_lock() etc all now allow for them to fail:
> so retract_page_tables() now has no use for mmap_write_trylock() or
> vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
> it does not even need the mmap_read_lock().
> 
> But those users do expect the page table to remain a good page table,
> until they unlock and rcu_read_unlock(): so the page table cannot be
> freed immediately, but rather by the recently added pte_free_defer().
> 
> retract_page_tables() can be enhanced to replace_page_tables(), which
> inserts the final huge pmd without mmap lock: going through an invalid
> state instead of pmd_none() followed by fault.  But that does raise some
> questions, and requires a more complicated pte_free_defer() for powerpc
> (when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
> that enhancement to a later release.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
>  1 file changed, 60 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1083f0e38a07..4fd408154692 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		break;
>  	case SCAN_PMD_NONE:
>  		/*
> -		 * In MADV_COLLAPSE path, possible race with khugepaged where
> -		 * all pte entries have been removed and pmd cleared.  If so,
> -		 * skip all the pte checks and just update the pmd mapping.
> +		 * All pte entries have been removed and pmd cleared.
> +		 * Skip all the pte checks and just update the pmd mapping.
>  		 */
>  		goto maybe_install_pmd;
>  	default:
> @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
>  	mmap_write_unlock(mm);
>  }
>  
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -			       struct mm_struct *target_mm,
> -			       unsigned long target_addr, struct page *hpage,
> -			       struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>  	struct vm_area_struct *vma;
> -	int target_result = SCAN_FAIL;
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -		int result = SCAN_FAIL;
> -		struct mm_struct *mm = NULL;
> -		unsigned long addr = 0;
> -		pmd_t *pmd;
> -		bool is_target = false;
> +		struct mm_struct *mm;
> +		unsigned long addr;
> +		pmd_t *pmd, pgt_pmd;
> +		spinlock_t *pml;
> +		spinlock_t *ptl;
>  
>  		/*
>  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -		 * got written to. These VMAs are likely not worth investing
> -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -		 * later.
> +		 * got written to. These VMAs are likely not worth removing
> +		 * page tables from, as PMD-mapping is likely to be split later.
>  		 *
> -		 * Note that vma->anon_vma check is racy: it can be set up after
> -		 * the check but before we took mmap_lock by the fault path.
> -		 * But page lock would prevent establishing any new ptes of the
> -		 * page, so we are safe.
> -		 *
> -		 * An alternative would be drop the check, but check that page
> -		 * table is clear before calling pmdp_collapse_flush() under
> -		 * ptl. It has higher chance to recover THP for the VMA, but
> -		 * has higher cost too. It would also probably require locking
> -		 * the anon_vma.
> +		 * Note that vma->anon_vma check is racy: it can be set after
> +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +		 * prevented establishing new ptes of the page. So we are safe
> +		 * to remove page table below, without even checking it's empty.
>  		 */
> -		if (READ_ONCE(vma->anon_vma)) {
> -			result = SCAN_PAGE_ANON;
> -			goto next;
> -		}
> +		if (READ_ONCE(vma->anon_vma))
> +			continue;

Not directly related to current patch, but I just realized there seems to
have similar issue as what ab0c3f1251b4 wanted to fix.

IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
set here, then does it mean that any vma used to get debugged will never be
able to merge into a thp (with either madvise or khugepaged)?

I think it'll only make a difference when the page cache is not huge yet
when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
if so, I think the anon_vma should still be there.

Did I miss something, or maybe that's not even a problem?

> +
>  		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		if (addr & ~HPAGE_PMD_MASK ||
> -		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -			result = SCAN_VMA_CHECK;
> -			goto next;
> -		}
> -		mm = vma->vm_mm;
> -		is_target = mm == target_mm && addr == target_addr;
> -		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -		if (result != SCAN_SUCCEED)
> -			goto next;
> -		/*
> -		 * We need exclusive mmap_lock to retract page table.
> -		 *
> -		 * We use trylock due to lock inversion: we need to acquire
> -		 * mmap_lock while holding page lock. Fault path does it in
> -		 * reverse order. Trylock is a way to avoid deadlock.
> -		 *
> -		 * Also, it's not MADV_COLLAPSE's job to collapse other
> -		 * mappings - let khugepaged take care of them later.
> -		 */
> -		result = SCAN_PTE_MAPPED_HUGEPAGE;
> -		if ((cc->is_khugepaged || is_target) &&
> -		    mmap_write_trylock(mm)) {
> -			/* trylock for the same lock inversion as above */
> -			if (!vma_try_start_write(vma))
> -				goto unlock_next;
> -
> -			/*
> -			 * Re-check whether we have an ->anon_vma, because
> -			 * collapse_and_free_pmd() requires that either no
> -			 * ->anon_vma exists or the anon_vma is locked.
> -			 * We already checked ->anon_vma above, but that check
> -			 * is racy because ->anon_vma can be populated under the
> -			 * mmap lock in read mode.
> -			 */
> -			if (vma->anon_vma) {
> -				result = SCAN_PAGE_ANON;
> -				goto unlock_next;
> -			}
> -			/*
> -			 * When a vma is registered with uffd-wp, we can't
> -			 * recycle the pmd pgtable because there can be pte
> -			 * markers installed.  Skip it only, so the rest mm/vma
> -			 * can still have the same file mapped hugely, however
> -			 * it'll always mapped in small page size for uffd-wp
> -			 * registered ranges.
> -			 */
> -			if (hpage_collapse_test_exit(mm)) {
> -				result = SCAN_ANY_PROCESS;
> -				goto unlock_next;
> -			}
> -			if (userfaultfd_wp(vma)) {
> -				result = SCAN_PTE_UFFD_WP;
> -				goto unlock_next;
> -			}
> -			collapse_and_free_pmd(mm, vma, addr, pmd);
> -			if (!cc->is_khugepaged && is_target)
> -				result = set_huge_pmd(vma, addr, pmd, hpage);
> -			else
> -				result = SCAN_SUCCEED;
> -
> -unlock_next:
> -			mmap_write_unlock(mm);
> -			goto next;
> -		}
> -		/*
> -		 * Calling context will handle target mm/addr. Otherwise, let
> -		 * khugepaged try again later.
> -		 */
> -		if (!is_target) {
> -			khugepaged_add_pte_mapped_thp(mm, addr);
> +		    vma->vm_end < addr + HPAGE_PMD_SIZE)
>  			continue;
> -		}
> -next:
> -		if (is_target)
> -			target_result = result;
> +
> +		mm = vma->vm_mm;
> +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +			continue;
> +
> +		if (hpage_collapse_test_exit(mm))
> +			continue;
> +		/*
> +		 * When a vma is registered with uffd-wp, we cannot recycle
> +		 * the page table because there may be pte markers installed.
> +		 * Other vmas can still have the same file mapped hugely, but
> +		 * skip this one: it will always be mapped in small page size
> +		 * for uffd-wp registered ranges.
> +		 *
> +		 * What if VM_UFFD_WP is set a moment after this check?  No
> +		 * problem, huge page lock is still held, stopping new mappings
> +		 * of page which might then get replaced by pte markers: only
> +		 * existing markers need to be protected here.  (We could check
> +		 * after getting ptl below, but this comment distracting there!)
> +		 */
> +		if (userfaultfd_wp(vma))
> +			continue;

IIUC here with the new code we only hold (1) hpage lock, and (2)
i_mmap_lock read.  Then could it possible that right after checking this
and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
pgtable locks taken?

The thing is installation of pte markers may not need either of the locks
iiuc..

Would taking mmap read lock help in this case?

Thanks,

> +
> +		/* Huge page lock is still held, so page table must be empty */
> +		pml = pmd_lock(mm, pmd);
> +		ptl = pte_lockptr(mm, pmd);
> +		if (ptl != pml)
> +			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +		pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> +		if (ptl != pml)
> +			spin_unlock(ptl);
> +		spin_unlock(pml);
> +
> +		mm_dec_nr_ptes(mm);
> +		page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +		pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>  	}
> -	i_mmap_unlock_write(mapping);
> -	return target_result;
> +	i_mmap_unlock_read(mapping);
>  }
>  
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  
>  	/*
>  	 * Remove pte page tables, so we can re-fault the page as huge.
> +	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>  	 */
> -	result = retract_page_tables(mapping, start, mm, addr, hpage,
> -				     cc);
> +	retract_page_tables(mapping, start);
> +	if (cc && !cc->is_khugepaged)
> +		result = SCAN_PTE_MAPPED_HUGEPAGE;
>  	unlock_page(hpage);
>  
>  	/*
> -- 
> 2.35.3
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-29 23:26     ` Peter Xu
  0 siblings, 0 replies; 158+ messages in thread
From: Peter Xu @ 2023-05-29 23:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
> Simplify shmem and file THP collapse's retract_page_tables(), and relax
> its locking: to improve its success rate and to lessen impact on others.
> 
> Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
> target_mm, leave that part of the work to madvise_collapse() calling
> collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
> result code to arrange for that.  That spares retract_page_tables() four
> arguments; and since it will be successful in retracting all of the page
> tables expected of it, no need to track and return a result code itself.
> 
> It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
> but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
> allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
> THPs.  retract_page_tables() just needs to use those same spinlocks to
> exclude it briefly, while transitioning pmd from page table to none: so
> restore its use of pmd_lock() inside of which pte lock is nested.
> 
> Users of pte_offset_map_lock() etc all now allow for them to fail:
> so retract_page_tables() now has no use for mmap_write_trylock() or
> vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
> it does not even need the mmap_read_lock().
> 
> But those users do expect the page table to remain a good page table,
> until they unlock and rcu_read_unlock(): so the page table cannot be
> freed immediately, but rather by the recently added pte_free_defer().
> 
> retract_page_tables() can be enhanced to replace_page_tables(), which
> inserts the final huge pmd without mmap lock: going through an invalid
> state instead of pmd_none() followed by fault.  But that does raise some
> questions, and requires a more complicated pte_free_defer() for powerpc
> (when its arch_needs_pgtable_deposit() for shmem and file THPs).  Leave
> that enhancement to a later release.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/khugepaged.c | 169 +++++++++++++++++-------------------------------
>  1 file changed, 60 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1083f0e38a07..4fd408154692 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  		break;
>  	case SCAN_PMD_NONE:
>  		/*
> -		 * In MADV_COLLAPSE path, possible race with khugepaged where
> -		 * all pte entries have been removed and pmd cleared.  If so,
> -		 * skip all the pte checks and just update the pmd mapping.
> +		 * All pte entries have been removed and pmd cleared.
> +		 * Skip all the pte checks and just update the pmd mapping.
>  		 */
>  		goto maybe_install_pmd;
>  	default:
> @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
>  	mmap_write_unlock(mm);
>  }
>  
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -			       struct mm_struct *target_mm,
> -			       unsigned long target_addr, struct page *hpage,
> -			       struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>  	struct vm_area_struct *vma;
> -	int target_result = SCAN_FAIL;
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -		int result = SCAN_FAIL;
> -		struct mm_struct *mm = NULL;
> -		unsigned long addr = 0;
> -		pmd_t *pmd;
> -		bool is_target = false;
> +		struct mm_struct *mm;
> +		unsigned long addr;
> +		pmd_t *pmd, pgt_pmd;
> +		spinlock_t *pml;
> +		spinlock_t *ptl;
>  
>  		/*
>  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -		 * got written to. These VMAs are likely not worth investing
> -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -		 * later.
> +		 * got written to. These VMAs are likely not worth removing
> +		 * page tables from, as PMD-mapping is likely to be split later.
>  		 *
> -		 * Note that vma->anon_vma check is racy: it can be set up after
> -		 * the check but before we took mmap_lock by the fault path.
> -		 * But page lock would prevent establishing any new ptes of the
> -		 * page, so we are safe.
> -		 *
> -		 * An alternative would be drop the check, but check that page
> -		 * table is clear before calling pmdp_collapse_flush() under
> -		 * ptl. It has higher chance to recover THP for the VMA, but
> -		 * has higher cost too. It would also probably require locking
> -		 * the anon_vma.
> +		 * Note that vma->anon_vma check is racy: it can be set after
> +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +		 * prevented establishing new ptes of the page. So we are safe
> +		 * to remove page table below, without even checking it's empty.
>  		 */
> -		if (READ_ONCE(vma->anon_vma)) {
> -			result = SCAN_PAGE_ANON;
> -			goto next;
> -		}
> +		if (READ_ONCE(vma->anon_vma))
> +			continue;

Not directly related to current patch, but I just realized there seems to
have similar issue as what ab0c3f1251b4 wanted to fix.

IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
set here, then does it mean that any vma used to get debugged will never be
able to merge into a thp (with either madvise or khugepaged)?

I think it'll only make a difference when the page cache is not huge yet
when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
if so, I think the anon_vma should still be there.

Did I miss something, or maybe that's not even a problem?

> +
>  		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		if (addr & ~HPAGE_PMD_MASK ||
> -		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -			result = SCAN_VMA_CHECK;
> -			goto next;
> -		}
> -		mm = vma->vm_mm;
> -		is_target = mm == target_mm && addr == target_addr;
> -		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -		if (result != SCAN_SUCCEED)
> -			goto next;
> -		/*
> -		 * We need exclusive mmap_lock to retract page table.
> -		 *
> -		 * We use trylock due to lock inversion: we need to acquire
> -		 * mmap_lock while holding page lock. Fault path does it in
> -		 * reverse order. Trylock is a way to avoid deadlock.
> -		 *
> -		 * Also, it's not MADV_COLLAPSE's job to collapse other
> -		 * mappings - let khugepaged take care of them later.
> -		 */
> -		result = SCAN_PTE_MAPPED_HUGEPAGE;
> -		if ((cc->is_khugepaged || is_target) &&
> -		    mmap_write_trylock(mm)) {
> -			/* trylock for the same lock inversion as above */
> -			if (!vma_try_start_write(vma))
> -				goto unlock_next;
> -
> -			/*
> -			 * Re-check whether we have an ->anon_vma, because
> -			 * collapse_and_free_pmd() requires that either no
> -			 * ->anon_vma exists or the anon_vma is locked.
> -			 * We already checked ->anon_vma above, but that check
> -			 * is racy because ->anon_vma can be populated under the
> -			 * mmap lock in read mode.
> -			 */
> -			if (vma->anon_vma) {
> -				result = SCAN_PAGE_ANON;
> -				goto unlock_next;
> -			}
> -			/*
> -			 * When a vma is registered with uffd-wp, we can't
> -			 * recycle the pmd pgtable because there can be pte
> -			 * markers installed.  Skip it only, so the rest mm/vma
> -			 * can still have the same file mapped hugely, however
> -			 * it'll always mapped in small page size for uffd-wp
> -			 * registered ranges.
> -			 */
> -			if (hpage_collapse_test_exit(mm)) {
> -				result = SCAN_ANY_PROCESS;
> -				goto unlock_next;
> -			}
> -			if (userfaultfd_wp(vma)) {
> -				result = SCAN_PTE_UFFD_WP;
> -				goto unlock_next;
> -			}
> -			collapse_and_free_pmd(mm, vma, addr, pmd);
> -			if (!cc->is_khugepaged && is_target)
> -				result = set_huge_pmd(vma, addr, pmd, hpage);
> -			else
> -				result = SCAN_SUCCEED;
> -
> -unlock_next:
> -			mmap_write_unlock(mm);
> -			goto next;
> -		}
> -		/*
> -		 * Calling context will handle target mm/addr. Otherwise, let
> -		 * khugepaged try again later.
> -		 */
> -		if (!is_target) {
> -			khugepaged_add_pte_mapped_thp(mm, addr);
> +		    vma->vm_end < addr + HPAGE_PMD_SIZE)
>  			continue;
> -		}
> -next:
> -		if (is_target)
> -			target_result = result;
> +
> +		mm = vma->vm_mm;
> +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +			continue;
> +
> +		if (hpage_collapse_test_exit(mm))
> +			continue;
> +		/*
> +		 * When a vma is registered with uffd-wp, we cannot recycle
> +		 * the page table because there may be pte markers installed.
> +		 * Other vmas can still have the same file mapped hugely, but
> +		 * skip this one: it will always be mapped in small page size
> +		 * for uffd-wp registered ranges.
> +		 *
> +		 * What if VM_UFFD_WP is set a moment after this check?  No
> +		 * problem, huge page lock is still held, stopping new mappings
> +		 * of page which might then get replaced by pte markers: only
> +		 * existing markers need to be protected here.  (We could check
> +		 * after getting ptl below, but this comment distracting there!)
> +		 */
> +		if (userfaultfd_wp(vma))
> +			continue;

IIUC here with the new code we only hold (1) hpage lock, and (2)
i_mmap_lock read.  Then could it possible that right after checking this
and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
pgtable locks taken?

The thing is installation of pte markers may not need either of the locks
iiuc..

Would taking mmap read lock help in this case?

Thanks,

> +
> +		/* Huge page lock is still held, so page table must be empty */
> +		pml = pmd_lock(mm, pmd);
> +		ptl = pte_lockptr(mm, pmd);
> +		if (ptl != pml)
> +			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +		pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> +		if (ptl != pml)
> +			spin_unlock(ptl);
> +		spin_unlock(pml);
> +
> +		mm_dec_nr_ptes(mm);
> +		page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +		pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>  	}
> -	i_mmap_unlock_write(mapping);
> -	return target_result;
> +	i_mmap_unlock_read(mapping);
>  }
>  
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  
>  	/*
>  	 * Remove pte page tables, so we can re-fault the page as huge.
> +	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>  	 */
> -	result = retract_page_tables(mapping, start, mm, addr, hpage,
> -				     cc);
> +	retract_page_tables(mapping, start);
> +	if (cc && !cc->is_khugepaged)
> +		result = SCAN_PTE_MAPPED_HUGEPAGE;
>  	unlock_page(hpage);
>  
>  	/*
> -- 
> 2.35.3
> 

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-05-29 23:26     ` Peter Xu
  (?)
@ 2023-05-31  0:38       ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-31  0:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Thanks for looking, Peter: I was well aware of you dropping several hints
that you wanted to see what's intended before passing judgment on earlier
series, and I preferred to get on with showing this series, than go into
detail in responses to you there - thanks for your patience :)

On Mon, 29 May 2023, Peter Xu wrote:
> On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
...
> > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> >  	mmap_write_unlock(mm);
> >  }
> >  
> > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > -			       struct mm_struct *target_mm,
> > -			       unsigned long target_addr, struct page *hpage,
> > -			       struct collapse_control *cc)
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma;
> > -	int target_result = SCAN_FAIL;
> >  
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_read(mapping);
> >  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > -		int result = SCAN_FAIL;
> > -		struct mm_struct *mm = NULL;
> > -		unsigned long addr = 0;
> > -		pmd_t *pmd;
> > -		bool is_target = false;
> > +		struct mm_struct *mm;
> > +		unsigned long addr;
> > +		pmd_t *pmd, pgt_pmd;
> > +		spinlock_t *pml;
> > +		spinlock_t *ptl;
> >  
> >  		/*
> >  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > -		 * got written to. These VMAs are likely not worth investing
> > -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > -		 * later.
> > +		 * got written to. These VMAs are likely not worth removing
> > +		 * page tables from, as PMD-mapping is likely to be split later.
> >  		 *
> > -		 * Note that vma->anon_vma check is racy: it can be set up after
> > -		 * the check but before we took mmap_lock by the fault path.
> > -		 * But page lock would prevent establishing any new ptes of the
> > -		 * page, so we are safe.
> > -		 *
> > -		 * An alternative would be drop the check, but check that page
> > -		 * table is clear before calling pmdp_collapse_flush() under
> > -		 * ptl. It has higher chance to recover THP for the VMA, but
> > -		 * has higher cost too. It would also probably require locking
> > -		 * the anon_vma.
> > +		 * Note that vma->anon_vma check is racy: it can be set after
> > +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +		 * prevented establishing new ptes of the page. So we are safe
> > +		 * to remove page table below, without even checking it's empty.
> >  		 */
> > -		if (READ_ONCE(vma->anon_vma)) {
> > -			result = SCAN_PAGE_ANON;
> > -			goto next;
> > -		}
> > +		if (READ_ONCE(vma->anon_vma))
> > +			continue;
> 
> Not directly related to current patch, but I just realized there seems to
> have similar issue as what ab0c3f1251b4 wanted to fix.
> 
> IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
> set here, then does it mean that any vma used to get debugged will never be
> able to merge into a thp (with either madvise or khugepaged)?
> 
> I think it'll only make a difference when the page cache is not huge yet
> when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
> if so, I think the anon_vma should still be there.
> 
> Did I miss something, or maybe that's not even a problem?

Finding vma->anon_vma set would discourage retract_page_tables() from
doing its business with that previously uprobed area; but it does not stop
collapse_pte_mapped_thp() (which uprobes unregister calls directly) from
dealing with it, and MADV_COLLAPSE works on anon_vma'ed areas too.  It's
just a heuristic in retract_page_tables(), when it chooses to skip the
anon_vma'ed areas as often not worth bothering with.

As to vma merges: I haven't actually checked since the maple tree and other
rewrites of vma merging, but previously one vma with anon_vma set could be
merged with adjacent vma before or after without anon_vma set - the
anon_vma comparison is not just equality of anon_vma, but allows NULL too -
so the anon_vma will still be there, but extends to cover the wider extent.
Right, I find is_mergeable_anon_vma() still following that rule.

(And once vmas are merged, so that the whole of the huge page falls within
a single vma, khugepaged can consider it, and do collapse_pte_mapped_thp()
on it - before or after 11/12 I think.)

As to whether it would even be a problem: generally no, the vma is supposed
just to be an internal representation, and so long as the code resists
proliferating them unnecessarily, occasional failures to merge should not
matter.  The one place that forever sticks in my mind as mattering (perhaps
there are others I'm unaware of, but I'd call them bugs) is mremap(): which
is sufficiently awkward and bug-prone already, that nobody ever had the
courage to make it independent of vma boundaries; but ideally, it's
mremap() that we should fix.

But I may have written three answers, yet still missed your point.

...
> > +
> > +		mm = vma->vm_mm;
> > +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > +			continue;
> > +
> > +		if (hpage_collapse_test_exit(mm))
> > +			continue;
> > +		/*
> > +		 * When a vma is registered with uffd-wp, we cannot recycle
> > +		 * the page table because there may be pte markers installed.
> > +		 * Other vmas can still have the same file mapped hugely, but
> > +		 * skip this one: it will always be mapped in small page size
> > +		 * for uffd-wp registered ranges.
> > +		 *
> > +		 * What if VM_UFFD_WP is set a moment after this check?  No
> > +		 * problem, huge page lock is still held, stopping new mappings
> > +		 * of page which might then get replaced by pte markers: only
> > +		 * existing markers need to be protected here.  (We could check
> > +		 * after getting ptl below, but this comment distracting there!)
> > +		 */
> > +		if (userfaultfd_wp(vma))
> > +			continue;
> 
> IIUC here with the new code we only hold (1) hpage lock, and (2)
> i_mmap_lock read.  Then could it possible that right after checking this
> and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
> vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
> pgtable locks taken?
> 
> The thing is installation of pte markers may not need either of the locks
> iiuc..
> 
> Would taking mmap read lock help in this case?

Isn't my comment above it a good enough answer?  If I misunderstand the
uffd-wp pte marker ("If"? certainly I don't understand it well enough,
but I may or may not be too wrong about it here), and actually it can
spring up in places where the page has not even been mapped yet, then
I'd *much* rather just move that check down into the pte_locked area,
than involve mmap read lock (which, though easier to acquire than its
write lock, would I think take us back to square 1 in terms of needing
trylock); but I did prefer not to have a big uffd-wp comment distracting
from the code flow there.

I expect now, that if I follow up UFFDIO_WRITEPROTECT, I shall indeed
find it inserting pte markers where the page has not even been mapped
yet.  A "Yes" from you will save me looking, but probably I shall have
to move that check down (oh well, the comment will be smaller there).

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-31  0:38       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-31  0:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin,
	Anshuman Khan dual, Heiko Carstens, Suren Baghdasaryan,
	linux-arm-kernel, SeongJae Park, Jann Horn, linux-mm,
	linuxppc-dev, Kirill A. Shutemov, Naoya Horiguchi, linux-kernel,
	Minchan Kim, Mike Rapoport, Andrew Morton, Mel Gorman,
	David S. Miller, Zack Rusin, Mike Kravetz

Thanks for looking, Peter: I was well aware of you dropping several hints
that you wanted to see what's intended before passing judgment on earlier
series, and I preferred to get on with showing this series, than go into
detail in responses to you there - thanks for your patience :)

On Mon, 29 May 2023, Peter Xu wrote:
> On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
...
> > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> >  	mmap_write_unlock(mm);
> >  }
> >  
> > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > -			       struct mm_struct *target_mm,
> > -			       unsigned long target_addr, struct page *hpage,
> > -			       struct collapse_control *cc)
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma;
> > -	int target_result = SCAN_FAIL;
> >  
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_read(mapping);
> >  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > -		int result = SCAN_FAIL;
> > -		struct mm_struct *mm = NULL;
> > -		unsigned long addr = 0;
> > -		pmd_t *pmd;
> > -		bool is_target = false;
> > +		struct mm_struct *mm;
> > +		unsigned long addr;
> > +		pmd_t *pmd, pgt_pmd;
> > +		spinlock_t *pml;
> > +		spinlock_t *ptl;
> >  
> >  		/*
> >  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > -		 * got written to. These VMAs are likely not worth investing
> > -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > -		 * later.
> > +		 * got written to. These VMAs are likely not worth removing
> > +		 * page tables from, as PMD-mapping is likely to be split later.
> >  		 *
> > -		 * Note that vma->anon_vma check is racy: it can be set up after
> > -		 * the check but before we took mmap_lock by the fault path.
> > -		 * But page lock would prevent establishing any new ptes of the
> > -		 * page, so we are safe.
> > -		 *
> > -		 * An alternative would be drop the check, but check that page
> > -		 * table is clear before calling pmdp_collapse_flush() under
> > -		 * ptl. It has higher chance to recover THP for the VMA, but
> > -		 * has higher cost too. It would also probably require locking
> > -		 * the anon_vma.
> > +		 * Note that vma->anon_vma check is racy: it can be set after
> > +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +		 * prevented establishing new ptes of the page. So we are safe
> > +		 * to remove page table below, without even checking it's empty.
> >  		 */
> > -		if (READ_ONCE(vma->anon_vma)) {
> > -			result = SCAN_PAGE_ANON;
> > -			goto next;
> > -		}
> > +		if (READ_ONCE(vma->anon_vma))
> > +			continue;
> 
> Not directly related to current patch, but I just realized there seems to
> have similar issue as what ab0c3f1251b4 wanted to fix.
> 
> IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
> set here, then does it mean that any vma used to get debugged will never be
> able to merge into a thp (with either madvise or khugepaged)?
> 
> I think it'll only make a difference when the page cache is not huge yet
> when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
> if so, I think the anon_vma should still be there.
> 
> Did I miss something, or maybe that's not even a problem?

Finding vma->anon_vma set would discourage retract_page_tables() from
doing its business with that previously uprobed area; but it does not stop
collapse_pte_mapped_thp() (which uprobes unregister calls directly) from
dealing with it, and MADV_COLLAPSE works on anon_vma'ed areas too.  It's
just a heuristic in retract_page_tables(), when it chooses to skip the
anon_vma'ed areas as often not worth bothering with.

As to vma merges: I haven't actually checked since the maple tree and other
rewrites of vma merging, but previously one vma with anon_vma set could be
merged with adjacent vma before or after without anon_vma set - the
anon_vma comparison is not just equality of anon_vma, but allows NULL too -
so the anon_vma will still be there, but extends to cover the wider extent.
Right, I find is_mergeable_anon_vma() still following that rule.

(And once vmas are merged, so that the whole of the huge page falls within
a single vma, khugepaged can consider it, and do collapse_pte_mapped_thp()
on it - before or after 11/12 I think.)

As to whether it would even be a problem: generally no, the vma is supposed
just to be an internal representation, and so long as the code resists
proliferating them unnecessarily, occasional failures to merge should not
matter.  The one place that forever sticks in my mind as mattering (perhaps
there are others I'm unaware of, but I'd call them bugs) is mremap(): which
is sufficiently awkward and bug-prone already, that nobody ever had the
courage to make it independent of vma boundaries; but ideally, it's
mremap() that we should fix.

But I may have written three answers, yet still missed your point.

...
> > +
> > +		mm = vma->vm_mm;
> > +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > +			continue;
> > +
> > +		if (hpage_collapse_test_exit(mm))
> > +			continue;
> > +		/*
> > +		 * When a vma is registered with uffd-wp, we cannot recycle
> > +		 * the page table because there may be pte markers installed.
> > +		 * Other vmas can still have the same file mapped hugely, but
> > +		 * skip this one: it will always be mapped in small page size
> > +		 * for uffd-wp registered ranges.
> > +		 *
> > +		 * What if VM_UFFD_WP is set a moment after this check?  No
> > +		 * problem, huge page lock is still held, stopping new mappings
> > +		 * of page which might then get replaced by pte markers: only
> > +		 * existing markers need to be protected here.  (We could check
> > +		 * after getting ptl below, but this comment distracting there!)
> > +		 */
> > +		if (userfaultfd_wp(vma))
> > +			continue;
> 
> IIUC here with the new code we only hold (1) hpage lock, and (2)
> i_mmap_lock read.  Then could it possible that right after checking this
> and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
> vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
> pgtable locks taken?
> 
> The thing is installation of pte markers may not need either of the locks
> iiuc..
> 
> Would taking mmap read lock help in this case?

Isn't my comment above it a good enough answer?  If I misunderstand the
uffd-wp pte marker ("If"? certainly I don't understand it well enough,
but I may or may not be too wrong about it here), and actually it can
spring up in places where the page has not even been mapped yet, then
I'd *much* rather just move that check down into the pte_locked area,
than involve mmap read lock (which, though easier to acquire than its
write lock, would I think take us back to square 1 in terms of needing
trylock); but I did prefer not to have a big uffd-wp comment distracting
from the code flow there.

I expect now, that if I follow up UFFDIO_WRITEPROTECT, I shall indeed
find it inserting pte markers where the page has not even been mapped
yet.  A "Yes" from you will save me looking, but probably I shall have
to move that check down (oh well, the comment will be smaller there).

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-31  0:38       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-05-31  0:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Thanks for looking, Peter: I was well aware of you dropping several hints
that you wanted to see what's intended before passing judgment on earlier
series, and I preferred to get on with showing this series, than go into
detail in responses to you there - thanks for your patience :)

On Mon, 29 May 2023, Peter Xu wrote:
> On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
...
> > @@ -1748,123 +1747,73 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> >  	mmap_write_unlock(mm);
> >  }
> >  
> > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > -			       struct mm_struct *target_mm,
> > -			       unsigned long target_addr, struct page *hpage,
> > -			       struct collapse_control *cc)
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma;
> > -	int target_result = SCAN_FAIL;
> >  
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_read(mapping);
> >  	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > -		int result = SCAN_FAIL;
> > -		struct mm_struct *mm = NULL;
> > -		unsigned long addr = 0;
> > -		pmd_t *pmd;
> > -		bool is_target = false;
> > +		struct mm_struct *mm;
> > +		unsigned long addr;
> > +		pmd_t *pmd, pgt_pmd;
> > +		spinlock_t *pml;
> > +		spinlock_t *ptl;
> >  
> >  		/*
> >  		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > -		 * got written to. These VMAs are likely not worth investing
> > -		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > -		 * later.
> > +		 * got written to. These VMAs are likely not worth removing
> > +		 * page tables from, as PMD-mapping is likely to be split later.
> >  		 *
> > -		 * Note that vma->anon_vma check is racy: it can be set up after
> > -		 * the check but before we took mmap_lock by the fault path.
> > -		 * But page lock would prevent establishing any new ptes of the
> > -		 * page, so we are safe.
> > -		 *
> > -		 * An alternative would be drop the check, but check that page
> > -		 * table is clear before calling pmdp_collapse_flush() under
> > -		 * ptl. It has higher chance to recover THP for the VMA, but
> > -		 * has higher cost too. It would also probably require locking
> > -		 * the anon_vma.
> > +		 * Note that vma->anon_vma check is racy: it can be set after
> > +		 * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +		 * prevented establishing new ptes of the page. So we are safe
> > +		 * to remove page table below, without even checking it's empty.
> >  		 */
> > -		if (READ_ONCE(vma->anon_vma)) {
> > -			result = SCAN_PAGE_ANON;
> > -			goto next;
> > -		}
> > +		if (READ_ONCE(vma->anon_vma))
> > +			continue;
> 
> Not directly related to current patch, but I just realized there seems to
> have similar issue as what ab0c3f1251b4 wanted to fix.
> 
> IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
> set here, then does it mean that any vma used to get debugged will never be
> able to merge into a thp (with either madvise or khugepaged)?
> 
> I think it'll only make a difference when the page cache is not huge yet
> when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
> if so, I think the anon_vma should still be there.
> 
> Did I miss something, or maybe that's not even a problem?

Finding vma->anon_vma set would discourage retract_page_tables() from
doing its business with that previously uprobed area; but it does not stop
collapse_pte_mapped_thp() (which uprobes unregister calls directly) from
dealing with it, and MADV_COLLAPSE works on anon_vma'ed areas too.  It's
just a heuristic in retract_page_tables(), when it chooses to skip the
anon_vma'ed areas as often not worth bothering with.

As to vma merges: I haven't actually checked since the maple tree and other
rewrites of vma merging, but previously one vma with anon_vma set could be
merged with adjacent vma before or after without anon_vma set - the
anon_vma comparison is not just equality of anon_vma, but allows NULL too -
so the anon_vma will still be there, but extends to cover the wider extent.
Right, I find is_mergeable_anon_vma() still following that rule.

(And once vmas are merged, so that the whole of the huge page falls within
a single vma, khugepaged can consider it, and do collapse_pte_mapped_thp()
on it - before or after 11/12 I think.)

As to whether it would even be a problem: generally no, the vma is supposed
just to be an internal representation, and so long as the code resists
proliferating them unnecessarily, occasional failures to merge should not
matter.  The one place that forever sticks in my mind as mattering (perhaps
there are others I'm unaware of, but I'd call them bugs) is mremap(): which
is sufficiently awkward and bug-prone already, that nobody ever had the
courage to make it independent of vma boundaries; but ideally, it's
mremap() that we should fix.

But I may have written three answers, yet still missed your point.

...
> > +
> > +		mm = vma->vm_mm;
> > +		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> > +			continue;
> > +
> > +		if (hpage_collapse_test_exit(mm))
> > +			continue;
> > +		/*
> > +		 * When a vma is registered with uffd-wp, we cannot recycle
> > +		 * the page table because there may be pte markers installed.
> > +		 * Other vmas can still have the same file mapped hugely, but
> > +		 * skip this one: it will always be mapped in small page size
> > +		 * for uffd-wp registered ranges.
> > +		 *
> > +		 * What if VM_UFFD_WP is set a moment after this check?  No
> > +		 * problem, huge page lock is still held, stopping new mappings
> > +		 * of page which might then get replaced by pte markers: only
> > +		 * existing markers need to be protected here.  (We could check
> > +		 * after getting ptl below, but this comment distracting there!)
> > +		 */
> > +		if (userfaultfd_wp(vma))
> > +			continue;
> 
> IIUC here with the new code we only hold (1) hpage lock, and (2)
> i_mmap_lock read.  Then could it possible that right after checking this
> and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
> vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
> pgtable locks taken?
> 
> The thing is installation of pte markers may not need either of the locks
> iiuc..
> 
> Would taking mmap read lock help in this case?

Isn't my comment above it a good enough answer?  If I misunderstand the
uffd-wp pte marker ("If"? certainly I don't understand it well enough,
but I may or may not be too wrong about it here), and actually it can
spring up in places where the page has not even been mapped yet, then
I'd *much* rather just move that check down into the pte_locked area,
than involve mmap read lock (which, though easier to acquire than its
write lock, would I think take us back to square 1 in terms of needing
trylock); but I did prefer not to have a big uffd-wp comment distracting
from the code flow there.

I expect now, that if I follow up UFFDIO_WRITEPROTECT, I shall indeed
find it inserting pte markers where the page has not even been mapped
yet.  A "Yes" from you will save me looking, but probably I shall have
to move that check down (oh well, the comment will be smaller there).

Thanks,
Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-05-29  6:25   ` Hugh Dickins
  (?)
@ 2023-05-31 15:34     ` Jann Horn
  -1 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 15:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -                              struct mm_struct *target_mm,
> -                              unsigned long target_addr, struct page *hpage,
> -                              struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>         struct vm_area_struct *vma;
> -       int target_result = SCAN_FAIL;
>
> -       i_mmap_lock_write(mapping);
> +       i_mmap_lock_read(mapping);
>         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -               int result = SCAN_FAIL;
> -               struct mm_struct *mm = NULL;
> -               unsigned long addr = 0;
> -               pmd_t *pmd;
> -               bool is_target = false;
> +               struct mm_struct *mm;
> +               unsigned long addr;
> +               pmd_t *pmd, pgt_pmd;
> +               spinlock_t *pml;
> +               spinlock_t *ptl;
>
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -                * got written to. These VMAs are likely not worth investing
> -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -                * later.
> +                * got written to. These VMAs are likely not worth removing
> +                * page tables from, as PMD-mapping is likely to be split later.
>                  *
> -                * Note that vma->anon_vma check is racy: it can be set up after
> -                * the check but before we took mmap_lock by the fault path.
> -                * But page lock would prevent establishing any new ptes of the
> -                * page, so we are safe.
> -                *
> -                * An alternative would be drop the check, but check that page
> -                * table is clear before calling pmdp_collapse_flush() under
> -                * ptl. It has higher chance to recover THP for the VMA, but
> -                * has higher cost too. It would also probably require locking
> -                * the anon_vma.
> +                * Note that vma->anon_vma check is racy: it can be set after
> +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +                * prevented establishing new ptes of the page. So we are safe
> +                * to remove page table below, without even checking it's empty.

This "we are safe to remove page table below, without even checking
it's empty" assumes that the only way to create new anonymous PTEs is
to use existing file PTEs, right? What about private shmem VMAs that
are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
looking at the mapping and its pages (except for checking that the
insertion point is before end-of-file), protected only by mmap_lock
(shared) and pte_offset_map_lock().


>                  */
> -               if (READ_ONCE(vma->anon_vma)) {
> -                       result = SCAN_PAGE_ANON;
> -                       goto next;
> -               }
> +               if (READ_ONCE(vma->anon_vma))
> +                       continue;
> +
>                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>                 if (addr & ~HPAGE_PMD_MASK ||
> -                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -                       result = SCAN_VMA_CHECK;
> -                       goto next;
> -               }
> -               mm = vma->vm_mm;
> -               is_target = mm == target_mm && addr == target_addr;
> -               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -               if (result != SCAN_SUCCEED)
> -                       goto next;
> -               /*
> -                * We need exclusive mmap_lock to retract page table.
> -                *
> -                * We use trylock due to lock inversion: we need to acquire
> -                * mmap_lock while holding page lock. Fault path does it in
> -                * reverse order. Trylock is a way to avoid deadlock.
> -                *
> -                * Also, it's not MADV_COLLAPSE's job to collapse other
> -                * mappings - let khugepaged take care of them later.
> -                */
> -               result = SCAN_PTE_MAPPED_HUGEPAGE;
> -               if ((cc->is_khugepaged || is_target) &&
> -                   mmap_write_trylock(mm)) {
> -                       /* trylock for the same lock inversion as above */
> -                       if (!vma_try_start_write(vma))
> -                               goto unlock_next;
> -
> -                       /*
> -                        * Re-check whether we have an ->anon_vma, because
> -                        * collapse_and_free_pmd() requires that either no
> -                        * ->anon_vma exists or the anon_vma is locked.
> -                        * We already checked ->anon_vma above, but that check
> -                        * is racy because ->anon_vma can be populated under the
> -                        * mmap lock in read mode.
> -                        */
> -                       if (vma->anon_vma) {
> -                               result = SCAN_PAGE_ANON;
> -                               goto unlock_next;
> -                       }
> -                       /*
> -                        * When a vma is registered with uffd-wp, we can't
> -                        * recycle the pmd pgtable because there can be pte
> -                        * markers installed.  Skip it only, so the rest mm/vma
> -                        * can still have the same file mapped hugely, however
> -                        * it'll always mapped in small page size for uffd-wp
> -                        * registered ranges.
> -                        */
> -                       if (hpage_collapse_test_exit(mm)) {
> -                               result = SCAN_ANY_PROCESS;
> -                               goto unlock_next;
> -                       }
> -                       if (userfaultfd_wp(vma)) {
> -                               result = SCAN_PTE_UFFD_WP;
> -                               goto unlock_next;
> -                       }
> -                       collapse_and_free_pmd(mm, vma, addr, pmd);

The old code called collapse_and_free_pmd(), which involves MMU
notifier invocation...

> -                       if (!cc->is_khugepaged && is_target)
> -                               result = set_huge_pmd(vma, addr, pmd, hpage);
> -                       else
> -                               result = SCAN_SUCCEED;
> -
> -unlock_next:
> -                       mmap_write_unlock(mm);
> -                       goto next;
> -               }
> -               /*
> -                * Calling context will handle target mm/addr. Otherwise, let
> -                * khugepaged try again later.
> -                */
> -               if (!is_target) {
> -                       khugepaged_add_pte_mapped_thp(mm, addr);
> +                   vma->vm_end < addr + HPAGE_PMD_SIZE)
>                         continue;
> -               }
> -next:
> -               if (is_target)
> -                       target_result = result;
> +
> +               mm = vma->vm_mm;
> +               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +                       continue;
> +
> +               if (hpage_collapse_test_exit(mm))
> +                       continue;
> +               /*
> +                * When a vma is registered with uffd-wp, we cannot recycle
> +                * the page table because there may be pte markers installed.
> +                * Other vmas can still have the same file mapped hugely, but
> +                * skip this one: it will always be mapped in small page size
> +                * for uffd-wp registered ranges.
> +                *
> +                * What if VM_UFFD_WP is set a moment after this check?  No
> +                * problem, huge page lock is still held, stopping new mappings
> +                * of page which might then get replaced by pte markers: only
> +                * existing markers need to be protected here.  (We could check
> +                * after getting ptl below, but this comment distracting there!)
> +                */
> +               if (userfaultfd_wp(vma))
> +                       continue;
> +
> +               /* Huge page lock is still held, so page table must be empty */
> +               pml = pmd_lock(mm, pmd);
> +               ptl = pte_lockptr(mm, pmd);
> +               if (ptl != pml)
> +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);

... while the new code only does pmdp_collapse_flush(), which clears
the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
notifiers. My understanding is that that's problematic - maybe (?) it
is sort of okay with regards to classic MMU notifier users like KVM,
but it's probably wrong for IOMMUv2 users, where an IOMMU directly
consumes the normal page tables?

(FWIW, last I looked, there also seemed to be some other issues with
MMU notifier usage wrt IOMMUv2, see the thread
<https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)


> +               if (ptl != pml)
> +                       spin_unlock(ptl);
> +               spin_unlock(pml);
> +
> +               mm_dec_nr_ptes(mm);
> +               page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +               pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>         }
> -       i_mmap_unlock_write(mapping);
> -       return target_result;
> +       i_mmap_unlock_read(mapping);
>  }
>
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>
>         /*
>          * Remove pte page tables, so we can re-fault the page as huge.
> +        * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>          */
> -       result = retract_page_tables(mapping, start, mm, addr, hpage,
> -                                    cc);
> +       retract_page_tables(mapping, start);
> +       if (cc && !cc->is_khugepaged)
> +               result = SCAN_PTE_MAPPED_HUGEPAGE;
>         unlock_page(hpage);
>
>         /*
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-31 15:34     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 15:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -                              struct mm_struct *target_mm,
> -                              unsigned long target_addr, struct page *hpage,
> -                              struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>         struct vm_area_struct *vma;
> -       int target_result = SCAN_FAIL;
>
> -       i_mmap_lock_write(mapping);
> +       i_mmap_lock_read(mapping);
>         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -               int result = SCAN_FAIL;
> -               struct mm_struct *mm = NULL;
> -               unsigned long addr = 0;
> -               pmd_t *pmd;
> -               bool is_target = false;
> +               struct mm_struct *mm;
> +               unsigned long addr;
> +               pmd_t *pmd, pgt_pmd;
> +               spinlock_t *pml;
> +               spinlock_t *ptl;
>
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -                * got written to. These VMAs are likely not worth investing
> -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -                * later.
> +                * got written to. These VMAs are likely not worth removing
> +                * page tables from, as PMD-mapping is likely to be split later.
>                  *
> -                * Note that vma->anon_vma check is racy: it can be set up after
> -                * the check but before we took mmap_lock by the fault path.
> -                * But page lock would prevent establishing any new ptes of the
> -                * page, so we are safe.
> -                *
> -                * An alternative would be drop the check, but check that page
> -                * table is clear before calling pmdp_collapse_flush() under
> -                * ptl. It has higher chance to recover THP for the VMA, but
> -                * has higher cost too. It would also probably require locking
> -                * the anon_vma.
> +                * Note that vma->anon_vma check is racy: it can be set after
> +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +                * prevented establishing new ptes of the page. So we are safe
> +                * to remove page table below, without even checking it's empty.

This "we are safe to remove page table below, without even checking
it's empty" assumes that the only way to create new anonymous PTEs is
to use existing file PTEs, right? What about private shmem VMAs that
are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
looking at the mapping and its pages (except for checking that the
insertion point is before end-of-file), protected only by mmap_lock
(shared) and pte_offset_map_lock().


>                  */
> -               if (READ_ONCE(vma->anon_vma)) {
> -                       result = SCAN_PAGE_ANON;
> -                       goto next;
> -               }
> +               if (READ_ONCE(vma->anon_vma))
> +                       continue;
> +
>                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>                 if (addr & ~HPAGE_PMD_MASK ||
> -                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -                       result = SCAN_VMA_CHECK;
> -                       goto next;
> -               }
> -               mm = vma->vm_mm;
> -               is_target = mm == target_mm && addr == target_addr;
> -               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -               if (result != SCAN_SUCCEED)
> -                       goto next;
> -               /*
> -                * We need exclusive mmap_lock to retract page table.
> -                *
> -                * We use trylock due to lock inversion: we need to acquire
> -                * mmap_lock while holding page lock. Fault path does it in
> -                * reverse order. Trylock is a way to avoid deadlock.
> -                *
> -                * Also, it's not MADV_COLLAPSE's job to collapse other
> -                * mappings - let khugepaged take care of them later.
> -                */
> -               result = SCAN_PTE_MAPPED_HUGEPAGE;
> -               if ((cc->is_khugepaged || is_target) &&
> -                   mmap_write_trylock(mm)) {
> -                       /* trylock for the same lock inversion as above */
> -                       if (!vma_try_start_write(vma))
> -                               goto unlock_next;
> -
> -                       /*
> -                        * Re-check whether we have an ->anon_vma, because
> -                        * collapse_and_free_pmd() requires that either no
> -                        * ->anon_vma exists or the anon_vma is locked.
> -                        * We already checked ->anon_vma above, but that check
> -                        * is racy because ->anon_vma can be populated under the
> -                        * mmap lock in read mode.
> -                        */
> -                       if (vma->anon_vma) {
> -                               result = SCAN_PAGE_ANON;
> -                               goto unlock_next;
> -                       }
> -                       /*
> -                        * When a vma is registered with uffd-wp, we can't
> -                        * recycle the pmd pgtable because there can be pte
> -                        * markers installed.  Skip it only, so the rest mm/vma
> -                        * can still have the same file mapped hugely, however
> -                        * it'll always mapped in small page size for uffd-wp
> -                        * registered ranges.
> -                        */
> -                       if (hpage_collapse_test_exit(mm)) {
> -                               result = SCAN_ANY_PROCESS;
> -                               goto unlock_next;
> -                       }
> -                       if (userfaultfd_wp(vma)) {
> -                               result = SCAN_PTE_UFFD_WP;
> -                               goto unlock_next;
> -                       }
> -                       collapse_and_free_pmd(mm, vma, addr, pmd);

The old code called collapse_and_free_pmd(), which involves MMU
notifier invocation...

> -                       if (!cc->is_khugepaged && is_target)
> -                               result = set_huge_pmd(vma, addr, pmd, hpage);
> -                       else
> -                               result = SCAN_SUCCEED;
> -
> -unlock_next:
> -                       mmap_write_unlock(mm);
> -                       goto next;
> -               }
> -               /*
> -                * Calling context will handle target mm/addr. Otherwise, let
> -                * khugepaged try again later.
> -                */
> -               if (!is_target) {
> -                       khugepaged_add_pte_mapped_thp(mm, addr);
> +                   vma->vm_end < addr + HPAGE_PMD_SIZE)
>                         continue;
> -               }
> -next:
> -               if (is_target)
> -                       target_result = result;
> +
> +               mm = vma->vm_mm;
> +               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +                       continue;
> +
> +               if (hpage_collapse_test_exit(mm))
> +                       continue;
> +               /*
> +                * When a vma is registered with uffd-wp, we cannot recycle
> +                * the page table because there may be pte markers installed.
> +                * Other vmas can still have the same file mapped hugely, but
> +                * skip this one: it will always be mapped in small page size
> +                * for uffd-wp registered ranges.
> +                *
> +                * What if VM_UFFD_WP is set a moment after this check?  No
> +                * problem, huge page lock is still held, stopping new mappings
> +                * of page which might then get replaced by pte markers: only
> +                * existing markers need to be protected here.  (We could check
> +                * after getting ptl below, but this comment distracting there!)
> +                */
> +               if (userfaultfd_wp(vma))
> +                       continue;
> +
> +               /* Huge page lock is still held, so page table must be empty */
> +               pml = pmd_lock(mm, pmd);
> +               ptl = pte_lockptr(mm, pmd);
> +               if (ptl != pml)
> +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);

... while the new code only does pmdp_collapse_flush(), which clears
the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
notifiers. My understanding is that that's problematic - maybe (?) it
is sort of okay with regards to classic MMU notifier users like KVM,
but it's probably wrong for IOMMUv2 users, where an IOMMU directly
consumes the normal page tables?

(FWIW, last I looked, there also seemed to be some other issues with
MMU notifier usage wrt IOMMUv2, see the thread
<https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)


> +               if (ptl != pml)
> +                       spin_unlock(ptl);
> +               spin_unlock(pml);
> +
> +               mm_dec_nr_ptes(mm);
> +               page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +               pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>         }
> -       i_mmap_unlock_write(mapping);
> -       return target_result;
> +       i_mmap_unlock_read(mapping);
>  }
>
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>
>         /*
>          * Remove pte page tables, so we can re-fault the page as huge.
> +        * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>          */
> -       result = retract_page_tables(mapping, start, mm, addr, hpage,
> -                                    cc);
> +       retract_page_tables(mapping, start);
> +       if (cc && !cc->is_khugepaged)
> +               result = SCAN_PTE_MAPPED_HUGEPAGE;
>         unlock_page(hpage);
>
>         /*
> --
> 2.35.3
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-31 15:34     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 15:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, linux-mm, linuxppc-dev, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, Minchan Kim, Mike Rapoport,
	Andrew Morton, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -                              struct mm_struct *target_mm,
> -                              unsigned long target_addr, struct page *hpage,
> -                              struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
>         struct vm_area_struct *vma;
> -       int target_result = SCAN_FAIL;
>
> -       i_mmap_lock_write(mapping);
> +       i_mmap_lock_read(mapping);
>         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> -               int result = SCAN_FAIL;
> -               struct mm_struct *mm = NULL;
> -               unsigned long addr = 0;
> -               pmd_t *pmd;
> -               bool is_target = false;
> +               struct mm_struct *mm;
> +               unsigned long addr;
> +               pmd_t *pmd, pgt_pmd;
> +               spinlock_t *pml;
> +               spinlock_t *ptl;
>
>                 /*
>                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -                * got written to. These VMAs are likely not worth investing
> -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> -                * later.
> +                * got written to. These VMAs are likely not worth removing
> +                * page tables from, as PMD-mapping is likely to be split later.
>                  *
> -                * Note that vma->anon_vma check is racy: it can be set up after
> -                * the check but before we took mmap_lock by the fault path.
> -                * But page lock would prevent establishing any new ptes of the
> -                * page, so we are safe.
> -                *
> -                * An alternative would be drop the check, but check that page
> -                * table is clear before calling pmdp_collapse_flush() under
> -                * ptl. It has higher chance to recover THP for the VMA, but
> -                * has higher cost too. It would also probably require locking
> -                * the anon_vma.
> +                * Note that vma->anon_vma check is racy: it can be set after
> +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +                * prevented establishing new ptes of the page. So we are safe
> +                * to remove page table below, without even checking it's empty.

This "we are safe to remove page table below, without even checking
it's empty" assumes that the only way to create new anonymous PTEs is
to use existing file PTEs, right? What about private shmem VMAs that
are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
looking at the mapping and its pages (except for checking that the
insertion point is before end-of-file), protected only by mmap_lock
(shared) and pte_offset_map_lock().


>                  */
> -               if (READ_ONCE(vma->anon_vma)) {
> -                       result = SCAN_PAGE_ANON;
> -                       goto next;
> -               }
> +               if (READ_ONCE(vma->anon_vma))
> +                       continue;
> +
>                 addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>                 if (addr & ~HPAGE_PMD_MASK ||
> -                   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -                       result = SCAN_VMA_CHECK;
> -                       goto next;
> -               }
> -               mm = vma->vm_mm;
> -               is_target = mm == target_mm && addr == target_addr;
> -               result = find_pmd_or_thp_or_none(mm, addr, &pmd);
> -               if (result != SCAN_SUCCEED)
> -                       goto next;
> -               /*
> -                * We need exclusive mmap_lock to retract page table.
> -                *
> -                * We use trylock due to lock inversion: we need to acquire
> -                * mmap_lock while holding page lock. Fault path does it in
> -                * reverse order. Trylock is a way to avoid deadlock.
> -                *
> -                * Also, it's not MADV_COLLAPSE's job to collapse other
> -                * mappings - let khugepaged take care of them later.
> -                */
> -               result = SCAN_PTE_MAPPED_HUGEPAGE;
> -               if ((cc->is_khugepaged || is_target) &&
> -                   mmap_write_trylock(mm)) {
> -                       /* trylock for the same lock inversion as above */
> -                       if (!vma_try_start_write(vma))
> -                               goto unlock_next;
> -
> -                       /*
> -                        * Re-check whether we have an ->anon_vma, because
> -                        * collapse_and_free_pmd() requires that either no
> -                        * ->anon_vma exists or the anon_vma is locked.
> -                        * We already checked ->anon_vma above, but that check
> -                        * is racy because ->anon_vma can be populated under the
> -                        * mmap lock in read mode.
> -                        */
> -                       if (vma->anon_vma) {
> -                               result = SCAN_PAGE_ANON;
> -                               goto unlock_next;
> -                       }
> -                       /*
> -                        * When a vma is registered with uffd-wp, we can't
> -                        * recycle the pmd pgtable because there can be pte
> -                        * markers installed.  Skip it only, so the rest mm/vma
> -                        * can still have the same file mapped hugely, however
> -                        * it'll always mapped in small page size for uffd-wp
> -                        * registered ranges.
> -                        */
> -                       if (hpage_collapse_test_exit(mm)) {
> -                               result = SCAN_ANY_PROCESS;
> -                               goto unlock_next;
> -                       }
> -                       if (userfaultfd_wp(vma)) {
> -                               result = SCAN_PTE_UFFD_WP;
> -                               goto unlock_next;
> -                       }
> -                       collapse_and_free_pmd(mm, vma, addr, pmd);

The old code called collapse_and_free_pmd(), which involves MMU
notifier invocation...

> -                       if (!cc->is_khugepaged && is_target)
> -                               result = set_huge_pmd(vma, addr, pmd, hpage);
> -                       else
> -                               result = SCAN_SUCCEED;
> -
> -unlock_next:
> -                       mmap_write_unlock(mm);
> -                       goto next;
> -               }
> -               /*
> -                * Calling context will handle target mm/addr. Otherwise, let
> -                * khugepaged try again later.
> -                */
> -               if (!is_target) {
> -                       khugepaged_add_pte_mapped_thp(mm, addr);
> +                   vma->vm_end < addr + HPAGE_PMD_SIZE)
>                         continue;
> -               }
> -next:
> -               if (is_target)
> -                       target_result = result;
> +
> +               mm = vma->vm_mm;
> +               if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
> +                       continue;
> +
> +               if (hpage_collapse_test_exit(mm))
> +                       continue;
> +               /*
> +                * When a vma is registered with uffd-wp, we cannot recycle
> +                * the page table because there may be pte markers installed.
> +                * Other vmas can still have the same file mapped hugely, but
> +                * skip this one: it will always be mapped in small page size
> +                * for uffd-wp registered ranges.
> +                *
> +                * What if VM_UFFD_WP is set a moment after this check?  No
> +                * problem, huge page lock is still held, stopping new mappings
> +                * of page which might then get replaced by pte markers: only
> +                * existing markers need to be protected here.  (We could check
> +                * after getting ptl below, but this comment distracting there!)
> +                */
> +               if (userfaultfd_wp(vma))
> +                       continue;
> +
> +               /* Huge page lock is still held, so page table must be empty */
> +               pml = pmd_lock(mm, pmd);
> +               ptl = pte_lockptr(mm, pmd);
> +               if (ptl != pml)
> +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);

... while the new code only does pmdp_collapse_flush(), which clears
the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
notifiers. My understanding is that that's problematic - maybe (?) it
is sort of okay with regards to classic MMU notifier users like KVM,
but it's probably wrong for IOMMUv2 users, where an IOMMU directly
consumes the normal page tables?

(FWIW, last I looked, there also seemed to be some other issues with
MMU notifier usage wrt IOMMUv2, see the thread
<https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)


> +               if (ptl != pml)
> +                       spin_unlock(ptl);
> +               spin_unlock(pml);
> +
> +               mm_dec_nr_ptes(mm);
> +               page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> +               pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>         }
> -       i_mmap_unlock_write(mapping);
> -       return target_result;
> +       i_mmap_unlock_read(mapping);
>  }
>
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>
>         /*
>          * Remove pte page tables, so we can re-fault the page as huge.
> +        * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>          */
> -       result = retract_page_tables(mapping, start, mm, addr, hpage,
> -                                    cc);
> +       retract_page_tables(mapping, start);
> +       if (cc && !cc->is_khugepaged)
> +               result = SCAN_PTE_MAPPED_HUGEPAGE;
>         unlock_page(hpage);
>
>         /*
> --
> 2.35.3
>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  2023-05-29  6:14   ` Hugh Dickins
  (?)
@ 2023-05-31 17:06     ` Jann Horn
  -1 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> Before putting them to use (several commits later), add rcu_read_lock()
> to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> separate commit, since it risks exposing imbalances: prior commits have
> fixed all the known imbalances, but we may find some have been missed.
[...]
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c7ab18a5fb77..674671835631 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  {
>         pmd_t pmdval;
>
> -       /* rcu_read_lock() to be added later */
> +       rcu_read_lock();
>         pmdval = pmdp_get_lockless(pmd);
>         if (pmdvalp)
>                 *pmdvalp = pmdval;

It might be a good idea to document that this series assumes that the
first argument to __pte_offset_map() is a pointer into a second-level
page table (and not a local copy of the entry) unless the containing
VMA is known to not be THP-eligible or the page table is detached from
the page table hierarchy or something like that. Currently a bunch of
places pass references to local copies of the entry, and while I think
all of these are fine, it would probably be good to at least document
why these are allowed to do it while other places aren't.

$ vgrep 'pte_offset_map(&'
Index File                  Line Content
    0 arch/sparc/mm/tlb.c    151 pte = pte_offset_map(&pmd, vaddr);
    1 kernel/events/core.c  7501 ptep = pte_offset_map(&pmd, addr);
    2 mm/gup.c              2460 ptem = ptep = pte_offset_map(&pmd, addr);
    3 mm/huge_memory.c      2057 pte = pte_offset_map(&_pmd, haddr);
    4 mm/huge_memory.c      2214 pte = pte_offset_map(&_pmd, haddr);
    5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(&pmd, addr);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-05-31 17:06     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, linux-mm, linuxppc-dev, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, Minchan Kim, Mike Rapoport,
	Andrew Morton, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> Before putting them to use (several commits later), add rcu_read_lock()
> to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> separate commit, since it risks exposing imbalances: prior commits have
> fixed all the known imbalances, but we may find some have been missed.
[...]
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c7ab18a5fb77..674671835631 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  {
>         pmd_t pmdval;
>
> -       /* rcu_read_lock() to be added later */
> +       rcu_read_lock();
>         pmdval = pmdp_get_lockless(pmd);
>         if (pmdvalp)
>                 *pmdvalp = pmdval;

It might be a good idea to document that this series assumes that the
first argument to __pte_offset_map() is a pointer into a second-level
page table (and not a local copy of the entry) unless the containing
VMA is known to not be THP-eligible or the page table is detached from
the page table hierarchy or something like that. Currently a bunch of
places pass references to local copies of the entry, and while I think
all of these are fine, it would probably be good to at least document
why these are allowed to do it while other places aren't.

$ vgrep 'pte_offset_map(&'
Index File                  Line Content
    0 arch/sparc/mm/tlb.c    151 pte = pte_offset_map(&pmd, vaddr);
    1 kernel/events/core.c  7501 ptep = pte_offset_map(&pmd, addr);
    2 mm/gup.c              2460 ptem = ptep = pte_offset_map(&pmd, addr);
    3 mm/huge_memory.c      2057 pte = pte_offset_map(&_pmd, haddr);
    4 mm/huge_memory.c      2214 pte = pte_offset_map(&_pmd, haddr);
    5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(&pmd, addr);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-05-31 17:06     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> Before putting them to use (several commits later), add rcu_read_lock()
> to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> separate commit, since it risks exposing imbalances: prior commits have
> fixed all the known imbalances, but we may find some have been missed.
[...]
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c7ab18a5fb77..674671835631 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>  {
>         pmd_t pmdval;
>
> -       /* rcu_read_lock() to be added later */
> +       rcu_read_lock();
>         pmdval = pmdp_get_lockless(pmd);
>         if (pmdvalp)
>                 *pmdvalp = pmdval;

It might be a good idea to document that this series assumes that the
first argument to __pte_offset_map() is a pointer into a second-level
page table (and not a local copy of the entry) unless the containing
VMA is known to not be THP-eligible or the page table is detached from
the page table hierarchy or something like that. Currently a bunch of
places pass references to local copies of the entry, and while I think
all of these are fine, it would probably be good to at least document
why these are allowed to do it while other places aren't.

$ vgrep 'pte_offset_map(&'
Index File                  Line Content
    0 arch/sparc/mm/tlb.c    151 pte = pte_offset_map(&pmd, vaddr);
    1 kernel/events/core.c  7501 ptep = pte_offset_map(&pmd, addr);
    2 mm/gup.c              2460 ptem = ptep = pte_offset_map(&pmd, addr);
    3 mm/huge_memory.c      2057 pte = pte_offset_map(&_pmd, haddr);
    4 mm/huge_memory.c      2214 pte = pte_offset_map(&_pmd, haddr);
    5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(&pmd, addr);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-05-29  6:26   ` Hugh Dickins
@ 2023-05-31 17:25     ` Jann Horn
  -1 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 17:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:26 AM Hugh Dickins <hughd@google.com> wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

I think there's a weirdness in the existing code, and this change
probably turns that into a UAF bug.

collapse_pte_mapped_thp() can be called on an address that might not
be associated with a VMA anymore, and after this change, the page
tables for that address might be in the middle of page table teardown
in munmap(), right? The existing mmap_write_lock() guards against
concurrent munmap() (so in the old code we are guaranteed to either
see a normal VMA or not see the page tables anymore), but
mmap_read_lock() only guards against the part of munmap() up to the
mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
(including free_pgtables()) happens after that.

So we can now enter collapse_pte_mapped_thp() and race with concurrent
free_pgtables() such that a PUD disappears under us while we're
walking it or something like that:


int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
          bool install_pmd)
{
  struct mmu_notifier_range range;
  unsigned long haddr = addr & HPAGE_PMD_MASK;
  struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
  struct page *hpage;
  pte_t *start_pte, *pte;
  pmd_t *pmd, pgt_pmd;
  spinlock_t *pml, *ptl;
  int nr_ptes = 0, result = SCAN_FAIL;
  int i;

  mmap_assert_locked(mm);

  /* Fast check before locking page if already PMD-mapped */
  result = find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in here
  if (result == SCAN_PMD_MAPPED)
    return result;

  if (!vma || !vma->vm_file || // <<< bailout happens too late
      !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
    return SCAN_VMA_CHECK;


I guess the right fix here is to make sure that at least the basic VMA
revalidation stuff (making sure there still is a VMA covering this
range) happens before find_pmd_or_thp_or_none()? Like:


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 301c0e54a2ef..5db365587556 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
*mm, unsigned long addr,

         mmap_assert_locked(mm);

+        if (!vma || !vma->vm_file ||
+            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+                return SCAN_VMA_CHECK;
+
         /* Fast check before locking page if already PMD-mapped */
         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
         if (result == SCAN_PMD_MAPPED)
                 return result;

-        if (!vma || !vma->vm_file ||
-            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-                return SCAN_VMA_CHECK;
-
         /*
          * If we are here, we've succeeded in replacing all the native pages
          * in the page cache with a single hugepage. If a mm were to fault-in

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
@ 2023-05-31 17:25     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 17:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, linux-mm, linuxppc-dev, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, Minchan Kim, Mike Rapoport,
	Andrew Morton, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

On Mon, May 29, 2023 at 8:26 AM Hugh Dickins <hughd@google.com> wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

I think there's a weirdness in the existing code, and this change
probably turns that into a UAF bug.

collapse_pte_mapped_thp() can be called on an address that might not
be associated with a VMA anymore, and after this change, the page
tables for that address might be in the middle of page table teardown
in munmap(), right? The existing mmap_write_lock() guards against
concurrent munmap() (so in the old code we are guaranteed to either
see a normal VMA or not see the page tables anymore), but
mmap_read_lock() only guards against the part of munmap() up to the
mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
(including free_pgtables()) happens after that.

So we can now enter collapse_pte_mapped_thp() and race with concurrent
free_pgtables() such that a PUD disappears under us while we're
walking it or something like that:


int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
          bool install_pmd)
{
  struct mmu_notifier_range range;
  unsigned long haddr = addr & HPAGE_PMD_MASK;
  struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
  struct page *hpage;
  pte_t *start_pte, *pte;
  pmd_t *pmd, pgt_pmd;
  spinlock_t *pml, *ptl;
  int nr_ptes = 0, result = SCAN_FAIL;
  int i;

  mmap_assert_locked(mm);

  /* Fast check before locking page if already PMD-mapped */
  result = find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in here
  if (result == SCAN_PMD_MAPPED)
    return result;

  if (!vma || !vma->vm_file || // <<< bailout happens too late
      !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
    return SCAN_VMA_CHECK;


I guess the right fix here is to make sure that at least the basic VMA
revalidation stuff (making sure there still is a VMA covering this
range) happens before find_pmd_or_thp_or_none()? Like:


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 301c0e54a2ef..5db365587556 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
*mm, unsigned long addr,

         mmap_assert_locked(mm);

+        if (!vma || !vma->vm_file ||
+            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+                return SCAN_VMA_CHECK;
+
         /* Fast check before locking page if already PMD-mapped */
         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
         if (result == SCAN_PMD_MAPPED)
                 return result;

-        if (!vma || !vma->vm_file ||
-            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-                return SCAN_VMA_CHECK;
-
         /*
          * If we are here, we've succeeded in replacing all the native pages
          * in the page cache with a single hugepage. If a mm were to fault-in

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
  2023-05-29  6:11 ` Hugh Dickins
                   ` (13 preceding siblings ...)
  (?)
@ 2023-05-31 17:59 ` Jann Horn
  2023-06-02  4:37     ` Hugh Dickins
  -1 siblings, 1 reply; 158+ messages in thread
From: Jann Horn @ 2023-05-31 17:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> Here is the third series of patches to mm (and a few architectures), based
> on v6.4-rc3 with the preceding two series applied: in which khugepaged
> takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.

To clarify: Part of the design here is that when you look up a user
page table with pte_offset_map_nolock() or pte_offset_map() without
holding mmap_lock in write mode, and you later lock the page table
yourself, you don't know whether you actually have the real page table
or a detached table that is currently in its RCU grace period, right?
And detached tables are supposed to consist of only zeroed entries,
and we assume that no relevant codepath will do anything bad if one of
these functions spuriously returns a pointer to a page table full of
zeroed entries?

So in particular, in handle_pte_fault() we can reach the "if
(unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
detached zeroed page table, but we're okay with that because in that
case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
which implies !pte_same(*vmf->pte, entry) , which means we'll bail
out?

If that's the intent, it might be good to add some comments, because
at least to me that's not very obvious.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
       [not found]     ` <ZHe0A079X9B8jWlH@x1n>
  2023-05-31 22:18         ` Jann Horn
@ 2023-05-31 22:18         ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 22:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, May 31, 2023 at 10:54 PM Peter Xu <peterx@redhat.com> wrote:
> On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > -                              struct mm_struct *target_mm,
> > > -                              unsigned long target_addr, struct page *hpage,
> > > -                              struct collapse_control *cc)
> > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >  {
> > >         struct vm_area_struct *vma;
> > > -       int target_result = SCAN_FAIL;
> > >
> > > -       i_mmap_lock_write(mapping);
> > > +       i_mmap_lock_read(mapping);
> > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > -               int result = SCAN_FAIL;
> > > -               struct mm_struct *mm = NULL;
> > > -               unsigned long addr = 0;
> > > -               pmd_t *pmd;
> > > -               bool is_target = false;
> > > +               struct mm_struct *mm;
> > > +               unsigned long addr;
> > > +               pmd_t *pmd, pgt_pmd;
> > > +               spinlock_t *pml;
> > > +               spinlock_t *ptl;
> > >
> > >                 /*
> > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > > -                * got written to. These VMAs are likely not worth investing
> > > -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > > -                * later.
> > > +                * got written to. These VMAs are likely not worth removing
> > > +                * page tables from, as PMD-mapping is likely to be split later.
> > >                  *
> > > -                * Note that vma->anon_vma check is racy: it can be set up after
> > > -                * the check but before we took mmap_lock by the fault path.
> > > -                * But page lock would prevent establishing any new ptes of the
> > > -                * page, so we are safe.
> > > -                *
> > > -                * An alternative would be drop the check, but check that page
> > > -                * table is clear before calling pmdp_collapse_flush() under
> > > -                * ptl. It has higher chance to recover THP for the VMA, but
> > > -                * has higher cost too. It would also probably require locking
> > > -                * the anon_vma.
> > > +                * Note that vma->anon_vma check is racy: it can be set after
> > > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > > +                * prevented establishing new ptes of the page. So we are safe
> > > +                * to remove page table below, without even checking it's empty.
> >
> > This "we are safe to remove page table below, without even checking
> > it's empty" assumes that the only way to create new anonymous PTEs is
> > to use existing file PTEs, right? What about private shmem VMAs that
> > are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> > looking at the mapping and its pages (except for checking that the
> > insertion point is before end-of-file), protected only by mmap_lock
> > (shared) and pte_offset_map_lock().
>
> Hmm, yes.  We probably need to keep that though, and 5b51072e97 explained
> the reason (to still respect file permissions).
>
> Maybe the anon_vma check can also be moved into the pgtable lock section,
> with some comments explaining (but it's getting a bit ugly..)?

Or check that all entries are pte_none() or something like that inside
the pgtable-locked section?

[...]
> > The old code called collapse_and_free_pmd(), which involves MMU
> > notifier invocation...
[...]
> > ... while the new code only does pmdp_collapse_flush(), which clears
> > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> > notifiers. My understanding is that that's problematic - maybe (?) it
> > is sort of okay with regards to classic MMU notifier users like KVM,
> > but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> > consumes the normal page tables?
>
> The iommuv2 wasn't "consuming" the pgtables?

My wording was confusing, I meant that as "the iommuv2 hardware
directly uses/walks the page tables".

> IIUC it relies on that to
> make sure no secondary (and illegal) tlb exists in the iommu tlbs.
>
> For this case if the pgtable _must_ be empty when reaching here (we'd
> better make sure of it..), maybe we're good?  Because we should have just
> invalidated once when unmap all the pages in the thp range, so no existing
> tlb should generate anyway for either cpu or iommu hardwares.

My headcanon is that there are approximately three reasons why we
normally have to do iommuv2 invalidations and I think one or two of
them might still apply here, though admittedly I haven't actually dug
up documentation on how this stuff actually works for IOMMUv2, so
maybe one of y'all can tell me that my concerns here are unfounded:

1. We have to flush normal TLB entries. This is probably not necessary
if the page table contains no entries.
2. We might have to flush "paging-structure caches" / "intermediate
table walk caches", if the IOMMU caches the physical addresses of page
tables to skip some levels of page table walk. IDK if IOMMUs do that,
but normal MMUs definitely do it, so I wouldn't be surprised if the
IOMMUs did it too (or reserved the right to do it in a future hardware
generation or whatever).
3. We have to *serialize* with page table walks performed by the
IOMMU. We're doing an RCU barrier to synchronize against page table
walks from the MMU, but without an appropriate mmu_notifier call, we
have nothing to ensure that we aren't yanking a page table out from
under an IOMMU page table walker while it's in the middle of its walk.
Sure, this isn't very likely in practice, the IOMMU page table walker
is probably pretty fast, but still we need some kind of explicit
synchronization to make this robust, I think.

> However OTOH, maybe it'll also be safer to just have the mmu notifiers like
> before (e.g., no idea whether anything can cache invalidate tlb
> translations from the empty pgtable)? As that doesn't seems to beat the
> purpose of the patchset as notifiers shouldn't fail.
>
> >
> > (FWIW, last I looked, there also seemed to be some other issues with
> > MMU notifier usage wrt IOMMUv2, see the thread
> > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-31 22:18         ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 22:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin,
	Anshuman Khan dual, Heiko Carstens, Suren Baghdasaryan,
	linux-arm-kernel, SeongJae Park, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Andrew Morton, Mel Gorman, David S. Miller,
	Zack Rusin, Mike Kravetz

On Wed, May 31, 2023 at 10:54 PM Peter Xu <peterx@redhat.com> wrote:
> On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > -                              struct mm_struct *target_mm,
> > > -                              unsigned long target_addr, struct page *hpage,
> > > -                              struct collapse_control *cc)
> > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >  {
> > >         struct vm_area_struct *vma;
> > > -       int target_result = SCAN_FAIL;
> > >
> > > -       i_mmap_lock_write(mapping);
> > > +       i_mmap_lock_read(mapping);
> > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > -               int result = SCAN_FAIL;
> > > -               struct mm_struct *mm = NULL;
> > > -               unsigned long addr = 0;
> > > -               pmd_t *pmd;
> > > -               bool is_target = false;
> > > +               struct mm_struct *mm;
> > > +               unsigned long addr;
> > > +               pmd_t *pmd, pgt_pmd;
> > > +               spinlock_t *pml;
> > > +               spinlock_t *ptl;
> > >
> > >                 /*
> > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > > -                * got written to. These VMAs are likely not worth investing
> > > -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > > -                * later.
> > > +                * got written to. These VMAs are likely not worth removing
> > > +                * page tables from, as PMD-mapping is likely to be split later.
> > >                  *
> > > -                * Note that vma->anon_vma check is racy: it can be set up after
> > > -                * the check but before we took mmap_lock by the fault path.
> > > -                * But page lock would prevent establishing any new ptes of the
> > > -                * page, so we are safe.
> > > -                *
> > > -                * An alternative would be drop the check, but check that page
> > > -                * table is clear before calling pmdp_collapse_flush() under
> > > -                * ptl. It has higher chance to recover THP for the VMA, but
> > > -                * has higher cost too. It would also probably require locking
> > > -                * the anon_vma.
> > > +                * Note that vma->anon_vma check is racy: it can be set after
> > > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > > +                * prevented establishing new ptes of the page. So we are safe
> > > +                * to remove page table below, without even checking it's empty.
> >
> > This "we are safe to remove page table below, without even checking
> > it's empty" assumes that the only way to create new anonymous PTEs is
> > to use existing file PTEs, right? What about private shmem VMAs that
> > are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> > looking at the mapping and its pages (except for checking that the
> > insertion point is before end-of-file), protected only by mmap_lock
> > (shared) and pte_offset_map_lock().
>
> Hmm, yes.  We probably need to keep that though, and 5b51072e97 explained
> the reason (to still respect file permissions).
>
> Maybe the anon_vma check can also be moved into the pgtable lock section,
> with some comments explaining (but it's getting a bit ugly..)?

Or check that all entries are pte_none() or something like that inside
the pgtable-locked section?

[...]
> > The old code called collapse_and_free_pmd(), which involves MMU
> > notifier invocation...
[...]
> > ... while the new code only does pmdp_collapse_flush(), which clears
> > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> > notifiers. My understanding is that that's problematic - maybe (?) it
> > is sort of okay with regards to classic MMU notifier users like KVM,
> > but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> > consumes the normal page tables?
>
> The iommuv2 wasn't "consuming" the pgtables?

My wording was confusing, I meant that as "the iommuv2 hardware
directly uses/walks the page tables".

> IIUC it relies on that to
> make sure no secondary (and illegal) tlb exists in the iommu tlbs.
>
> For this case if the pgtable _must_ be empty when reaching here (we'd
> better make sure of it..), maybe we're good?  Because we should have just
> invalidated once when unmap all the pages in the thp range, so no existing
> tlb should generate anyway for either cpu or iommu hardwares.

My headcanon is that there are approximately three reasons why we
normally have to do iommuv2 invalidations and I think one or two of
them might still apply here, though admittedly I haven't actually dug
up documentation on how this stuff actually works for IOMMUv2, so
maybe one of y'all can tell me that my concerns here are unfounded:

1. We have to flush normal TLB entries. This is probably not necessary
if the page table contains no entries.
2. We might have to flush "paging-structure caches" / "intermediate
table walk caches", if the IOMMU caches the physical addresses of page
tables to skip some levels of page table walk. IDK if IOMMUs do that,
but normal MMUs definitely do it, so I wouldn't be surprised if the
IOMMUs did it too (or reserved the right to do it in a future hardware
generation or whatever).
3. We have to *serialize* with page table walks performed by the
IOMMU. We're doing an RCU barrier to synchronize against page table
walks from the MMU, but without an appropriate mmu_notifier call, we
have nothing to ensure that we aren't yanking a page table out from
under an IOMMU page table walker while it's in the middle of its walk.
Sure, this isn't very likely in practice, the IOMMU page table walker
is probably pretty fast, but still we need some kind of explicit
synchronization to make this robust, I think.

> However OTOH, maybe it'll also be safer to just have the mmu notifiers like
> before (e.g., no idea whether anything can cache invalidate tlb
> translations from the empty pgtable)? As that doesn't seems to beat the
> purpose of the patchset as notifiers shouldn't fail.
>
> >
> > (FWIW, last I looked, there also seemed to be some other issues with
> > MMU notifier usage wrt IOMMUv2, see the thread
> > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-05-31 22:18         ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-05-31 22:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, May 31, 2023 at 10:54 PM Peter Xu <peterx@redhat.com> wrote:
> On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > -                              struct mm_struct *target_mm,
> > > -                              unsigned long target_addr, struct page *hpage,
> > > -                              struct collapse_control *cc)
> > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >  {
> > >         struct vm_area_struct *vma;
> > > -       int target_result = SCAN_FAIL;
> > >
> > > -       i_mmap_lock_write(mapping);
> > > +       i_mmap_lock_read(mapping);
> > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > -               int result = SCAN_FAIL;
> > > -               struct mm_struct *mm = NULL;
> > > -               unsigned long addr = 0;
> > > -               pmd_t *pmd;
> > > -               bool is_target = false;
> > > +               struct mm_struct *mm;
> > > +               unsigned long addr;
> > > +               pmd_t *pmd, pgt_pmd;
> > > +               spinlock_t *pml;
> > > +               spinlock_t *ptl;
> > >
> > >                 /*
> > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > > -                * got written to. These VMAs are likely not worth investing
> > > -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > > -                * later.
> > > +                * got written to. These VMAs are likely not worth removing
> > > +                * page tables from, as PMD-mapping is likely to be split later.
> > >                  *
> > > -                * Note that vma->anon_vma check is racy: it can be set up after
> > > -                * the check but before we took mmap_lock by the fault path.
> > > -                * But page lock would prevent establishing any new ptes of the
> > > -                * page, so we are safe.
> > > -                *
> > > -                * An alternative would be drop the check, but check that page
> > > -                * table is clear before calling pmdp_collapse_flush() under
> > > -                * ptl. It has higher chance to recover THP for the VMA, but
> > > -                * has higher cost too. It would also probably require locking
> > > -                * the anon_vma.
> > > +                * Note that vma->anon_vma check is racy: it can be set after
> > > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > > +                * prevented establishing new ptes of the page. So we are safe
> > > +                * to remove page table below, without even checking it's empty.
> >
> > This "we are safe to remove page table below, without even checking
> > it's empty" assumes that the only way to create new anonymous PTEs is
> > to use existing file PTEs, right? What about private shmem VMAs that
> > are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> > looking at the mapping and its pages (except for checking that the
> > insertion point is before end-of-file), protected only by mmap_lock
> > (shared) and pte_offset_map_lock().
>
> Hmm, yes.  We probably need to keep that though, and 5b51072e97 explained
> the reason (to still respect file permissions).
>
> Maybe the anon_vma check can also be moved into the pgtable lock section,
> with some comments explaining (but it's getting a bit ugly..)?

Or check that all entries are pte_none() or something like that inside
the pgtable-locked section?

[...]
> > The old code called collapse_and_free_pmd(), which involves MMU
> > notifier invocation...
[...]
> > ... while the new code only does pmdp_collapse_flush(), which clears
> > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> > notifiers. My understanding is that that's problematic - maybe (?) it
> > is sort of okay with regards to classic MMU notifier users like KVM,
> > but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> > consumes the normal page tables?
>
> The iommuv2 wasn't "consuming" the pgtables?

My wording was confusing, I meant that as "the iommuv2 hardware
directly uses/walks the page tables".

> IIUC it relies on that to
> make sure no secondary (and illegal) tlb exists in the iommu tlbs.
>
> For this case if the pgtable _must_ be empty when reaching here (we'd
> better make sure of it..), maybe we're good?  Because we should have just
> invalidated once when unmap all the pages in the thp range, so no existing
> tlb should generate anyway for either cpu or iommu hardwares.

My headcanon is that there are approximately three reasons why we
normally have to do iommuv2 invalidations and I think one or two of
them might still apply here, though admittedly I haven't actually dug
up documentation on how this stuff actually works for IOMMUv2, so
maybe one of y'all can tell me that my concerns here are unfounded:

1. We have to flush normal TLB entries. This is probably not necessary
if the page table contains no entries.
2. We might have to flush "paging-structure caches" / "intermediate
table walk caches", if the IOMMU caches the physical addresses of page
tables to skip some levels of page table walk. IDK if IOMMUs do that,
but normal MMUs definitely do it, so I wouldn't be surprised if the
IOMMUs did it too (or reserved the right to do it in a future hardware
generation or whatever).
3. We have to *serialize* with page table walks performed by the
IOMMU. We're doing an RCU barrier to synchronize against page table
walks from the MMU, but without an appropriate mmu_notifier call, we
have nothing to ensure that we aren't yanking a page table out from
under an IOMMU page table walker while it's in the middle of its walk.
Sure, this isn't very likely in practice, the IOMMU page table walker
is probably pretty fast, but still we need some kind of explicit
synchronization to make this robust, I think.

> However OTOH, maybe it'll also be safer to just have the mmu notifiers like
> before (e.g., no idea whether anything can cache invalidate tlb
> translations from the empty pgtable)? As that doesn't seems to beat the
> purpose of the patchset as notifiers shouldn't fail.
>
> >
> > (FWIW, last I looked, there also seemed to be some other issues with
> > MMU notifier usage wrt IOMMUv2, see the thread
> > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
  2023-05-29  6:23   ` Hugh Dickins
  (?)
@ 2023-06-01 13:31     ` Jann Horn
  -1 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-01 13:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:23 AM Hugh Dickins <hughd@google.com> wrote:
> Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This version
> suits all those architectures which use an unfragmented page for one page
> table (none of whose pte_free()s use the mm arg which was passed to it).

Pages that have been scheduled for deferred freeing can still be
locked, right? So struct page's members "ptl" and "rcu_head" can now
be in use at the same time? If that's intended, it would probably be a
good idea to add comments in the "/* Page table pages */" part of
struct page to point out that the first two members can be used by the
rcu_head while the page is still used as a page table in some
contexts, including use of the ptl.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-06-01 13:31     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-01 13:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, linux-mm, linuxppc-dev, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, Minchan Kim, Mike Rapoport,
	Andrew Morton, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

On Mon, May 29, 2023 at 8:23 AM Hugh Dickins <hughd@google.com> wrote:
> Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This version
> suits all those architectures which use an unfragmented page for one page
> table (none of whose pte_free()s use the mm arg which was passed to it).

Pages that have been scheduled for deferred freeing can still be
locked, right? So struct page's members "ptl" and "rcu_head" can now
be in use at the same time? If that's intended, it would probably be a
good idea to add comments in the "/* Page table pages */" part of
struct page to point out that the first two members can be used by the
rcu_head while the page is still used as a page table in some
contexts, including use of the ptl.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-06-01 13:31     ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-01 13:31 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Mon, May 29, 2023 at 8:23 AM Hugh Dickins <hughd@google.com> wrote:
> Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This version
> suits all those architectures which use an unfragmented page for one page
> table (none of whose pte_free()s use the mm arg which was passed to it).

Pages that have been scheduled for deferred freeing can still be
locked, right? So struct page's members "ptl" and "rcu_head" can now
be in use at the same time? If that's intended, it would probably be a
good idea to add comments in the "/* Page table pages */" part of
struct page to point out that the first two members can be used by the
rcu_head while the page is still used as a page table in some
contexts, including use of the ptl.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-05-29 14:36       ` Hugh Dickins
  (?)
@ 2023-06-01 13:57         ` Gerald Schaefer
  -1 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-01 13:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm, Vasily Gorbik

On Mon, 29 May 2023 07:36:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 29 May 2023, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > +}  
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.  
> 
> Oh yes, of course, thanks for catching that so quickly.
> So my s390 and sparc implementations will be equally broken.
> 
> > 
> > I have no idea how to solve this problem.  
> 
> I do: I'll have to go back to the more complicated implementation we
> actually ran with on powerpc - I was thinking those complications just
> related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> 
> It uses large (0x10000) increments of the page refcount, avoiding
> call_rcu() when already active.
> 
> It's not a complication I had wanted to explain or test for now,
> but we shall have to.  Should apply equally well to sparc, but s390
> more of a problem, since s390 already has its own refcount cleverness.

Yes, we have 2 pagetables in one 4K page, which could result in same
rcu_head reuse. It might be possible to use the cleverness from our
page_table_free() function, e.g. to only do the call_rcu() once, for
the case where both 2K pagetable fragments become unused, similar to
how we decide when to actually call __free_page().

However, it might be much worse, and page->rcu_head from a pagetable
page cannot be used at all for s390, because we also use page->lru
to keep our list of free 2K pagetable fragments. I always get confused
by struct page unions, so not completely sure, but it seems to me that
page->rcu_head would overlay with page->lru, right?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-01 13:57         ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-01 13:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm, Vasily Gorbik

On Mon, 29 May 2023 07:36:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 29 May 2023, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > +}  
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.  
> 
> Oh yes, of course, thanks for catching that so quickly.
> So my s390 and sparc implementations will be equally broken.
> 
> > 
> > I have no idea how to solve this problem.  
> 
> I do: I'll have to go back to the more complicated implementation we
> actually ran with on powerpc - I was thinking those complications just
> related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> 
> It uses large (0x10000) increments of the page refcount, avoiding
> call_rcu() when already active.
> 
> It's not a complication I had wanted to explain or test for now,
> but we shall have to.  Should apply equally well to sparc, but s390
> more of a problem, since s390 already has its own refcount cleverness.

Yes, we have 2 pagetables in one 4K page, which could result in same
rcu_head reuse. It might be possible to use the cleverness from our
page_table_free() function, e.g. to only do the call_rcu() once, for
the case where both 2K pagetable fragments become unused, similar to
how we decide when to actually call __free_page().

However, it might be much worse, and page->rcu_head from a pagetable
page cannot be used at all for s390, because we also use page->lru
to keep our list of free 2K pagetable fragments. I always get confused
by struct page unions, so not completely sure, but it seems to me that
page->rcu_head would overlay with page->lru, right?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-01 13:57         ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-01 13:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Vasily Gorbik, Anshuman Khandual

On Mon, 29 May 2023 07:36:40 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 29 May 2023, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > +}  
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.  
> 
> Oh yes, of course, thanks for catching that so quickly.
> So my s390 and sparc implementations will be equally broken.
> 
> > 
> > I have no idea how to solve this problem.  
> 
> I do: I'll have to go back to the more complicated implementation we
> actually ran with on powerpc - I was thinking those complications just
> related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> 
> It uses large (0x10000) increments of the page refcount, avoiding
> call_rcu() when already active.
> 
> It's not a complication I had wanted to explain or test for now,
> but we shall have to.  Should apply equally well to sparc, but s390
> more of a problem, since s390 already has its own refcount cleverness.

Yes, we have 2 pagetables in one 4K page, which could result in same
rcu_head reuse. It might be possible to use the cleverness from our
page_table_free() function, e.g. to only do the call_rcu() once, for
the case where both 2K pagetable fragments become unused, similar to
how we decide when to actually call __free_page().

However, it might be much worse, and page->rcu_head from a pagetable
page cannot be used at all for s390, because we also use page->lru
to keep our list of free 2K pagetable fragments. I always get confused
by struct page unions, so not completely sure, but it seems to me that
page->rcu_head would overlay with page->lru, right?

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-05-31 22:18         ` Jann Horn
  (?)
@ 2023-06-01 14:06           ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-01 14:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Xu, Hugh Dickins, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Thu, Jun 01, 2023 at 12:18:43AM +0200, Jann Horn wrote:

> 3. We have to *serialize* with page table walks performed by the
> IOMMU. We're doing an RCU barrier to synchronize against page table
> walks from the MMU, but without an appropriate mmu_notifier call, we
> have nothing to ensure that we aren't yanking a page table out from
> under an IOMMU page table walker while it's in the middle of its walk.
> Sure, this isn't very likely in practice, the IOMMU page table walker
> is probably pretty fast, but still we need some kind of explicit
> synchronization to make this robust, I think.

There is another thread talking about this..

Broadly we are saying that we need to call mmu ops invalidate_range at
any time the normal CPU TLB would be invalidated.

invalidate_range will not return until the iommu HW is coherent with
the current state of the page table.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-06-01 14:06           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-01 14:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Xu, Hugh Dickins, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Thu, Jun 01, 2023 at 12:18:43AM +0200, Jann Horn wrote:

> 3. We have to *serialize* with page table walks performed by the
> IOMMU. We're doing an RCU barrier to synchronize against page table
> walks from the MMU, but without an appropriate mmu_notifier call, we
> have nothing to ensure that we aren't yanking a page table out from
> under an IOMMU page table walker while it's in the middle of its walk.
> Sure, this isn't very likely in practice, the IOMMU page table walker
> is probably pretty fast, but still we need some kind of explicit
> synchronization to make this robust, I think.

There is another thread talking about this..

Broadly we are saying that we need to call mmu ops invalidate_range at
any time the normal CPU TLB would be invalidated.

invalidate_range will not return until the iommu HW is coherent with
the current state of the page table.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-06-01 14:06           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-01 14:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual

On Thu, Jun 01, 2023 at 12:18:43AM +0200, Jann Horn wrote:

> 3. We have to *serialize* with page table walks performed by the
> IOMMU. We're doing an RCU barrier to synchronize against page table
> walks from the MMU, but without an appropriate mmu_notifier call, we
> have nothing to ensure that we aren't yanking a page table out from
> under an IOMMU page table walker while it's in the middle of its walk.
> Sure, this isn't very likely in practice, the IOMMU page table walker
> is probably pretty fast, but still we need some kind of explicit
> synchronization to make this robust, I think.

There is another thread talking about this..

Broadly we are saying that we need to call mmu ops invalidate_range at
any time the normal CPU TLB would be invalidated.

invalidate_range will not return until the iommu HW is coherent with
the current state of the page table.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  2023-05-31 17:06     ` Jann Horn
  (?)
@ 2023-06-02  2:50       ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  2:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > Before putting them to use (several commits later), add rcu_read_lock()
> > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > separate commit, since it risks exposing imbalances: prior commits have
> > fixed all the known imbalances, but we may find some have been missed.
> [...]
> > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > index c7ab18a5fb77..674671835631 100644
> > --- a/mm/pgtable-generic.c
> > +++ b/mm/pgtable-generic.c
> > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> >  {
> >         pmd_t pmdval;
> >
> > -       /* rcu_read_lock() to be added later */
> > +       rcu_read_lock();
> >         pmdval = pmdp_get_lockless(pmd);
> >         if (pmdvalp)
> >                 *pmdvalp = pmdval;
> 
> It might be a good idea to document that this series assumes that the
> first argument to __pte_offset_map() is a pointer into a second-level
> page table (and not a local copy of the entry) unless the containing
> VMA is known to not be THP-eligible or the page table is detached from
> the page table hierarchy or something like that. Currently a bunch of
> places pass references to local copies of the entry, and while I think
> all of these are fine, it would probably be good to at least document
> why these are allowed to do it while other places aren't.

Thanks Jann: but I have to guess that here you are showing awareness of
an important issue that I'm simply ignorant of.

I have been haunted by a dim recollection that there is one architecture
(arm-32?) which is fussy about the placement of the pmdval being examined
(deduces info missing from the arch-independent interface, by following
up the address?), but I couldn't track it down when I tried.

Please tell me more; or better, don't spend your time explaining to me,
but please just send a link to a good reference on the issue.  I'll be
unable to document what you ask there, without educating myself first.

Thanks,
Hugh

> 
> $ vgrep 'pte_offset_map(&'
> Index File                  Line Content
>     0 arch/sparc/mm/tlb.c    151 pte = pte_offset_map(&pmd, vaddr);
>     1 kernel/events/core.c  7501 ptep = pte_offset_map(&pmd, addr);
>     2 mm/gup.c              2460 ptem = ptep = pte_offset_map(&pmd, addr);
>     3 mm/huge_memory.c      2057 pte = pte_offset_map(&_pmd, haddr);
>     4 mm/huge_memory.c      2214 pte = pte_offset_map(&_pmd, haddr);
>     5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(&pmd, addr);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-06-02  2:50       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  2:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > Before putting them to use (several commits later), add rcu_read_lock()
> > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > separate commit, since it risks exposing imbalances: prior commits have
> > fixed all the known imbalances, but we may find some have been missed.
> [...]
> > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > index c7ab18a5fb77..674671835631 100644
> > --- a/mm/pgtable-generic.c
> > +++ b/mm/pgtable-generic.c
> > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> >  {
> >         pmd_t pmdval;
> >
> > -       /* rcu_read_lock() to be added later */
> > +       rcu_read_lock();
> >         pmdval = pmdp_get_lockless(pmd);
> >         if (pmdvalp)
> >                 *pmdvalp = pmdval;
> 
> It might be a good idea to document that this series assumes that the
> first argument to __pte_offset_map() is a pointer into a second-level
> page table (and not a local copy of the entry) unless the containing
> VMA is known to not be THP-eligible or the page table is detached from
> the page table hierarchy or something like that. Currently a bunch of
> places pass references to local copies of the entry, and while I think
> all of these are fine, it would probably be good to at least document
> why these are allowed to do it while other places aren't.

Thanks Jann: but I have to guess that here you are showing awareness of
an important issue that I'm simply ignorant of.

I have been haunted by a dim recollection that there is one architecture
(arm-32?) which is fussy about the placement of the pmdval being examined
(deduces info missing from the arch-independent interface, by following
up the address?), but I couldn't track it down when I tried.

Please tell me more; or better, don't spend your time explaining to me,
but please just send a link to a good reference on the issue.  I'll be
unable to document what you ask there, without educating myself first.

Thanks,
Hugh

> 
> $ vgrep 'pte_offset_map(&'
> Index File                  Line Content
>     0 arch/sparc/mm/tlb.c    151 pte = pte_offset_map(&pmd, vaddr);
>     1 kernel/events/core.c  7501 ptep = pte_offset_map(&pmd, addr);
>     2 mm/gup.c              2460 ptem = ptep = pte_offset_map(&pmd, addr);
>     3 mm/huge_memory.c      2057 pte = pte_offset_map(&_pmd, haddr);
>     4 mm/huge_memory.c      2214 pte = pte_offset_map(&_pmd, haddr);
>     5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(&pmd, addr);

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-06-02  2:50       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  2:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > Before putting them to use (several commits later), add rcu_read_lock()
> > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > separate commit, since it risks exposing imbalances: prior commits have
> > fixed all the known imbalances, but we may find some have been missed.
> [...]
> > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > index c7ab18a5fb77..674671835631 100644
> > --- a/mm/pgtable-generic.c
> > +++ b/mm/pgtable-generic.c
> > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> >  {
> >         pmd_t pmdval;
> >
> > -       /* rcu_read_lock() to be added later */
> > +       rcu_read_lock();
> >         pmdval = pmdp_get_lockless(pmd);
> >         if (pmdvalp)
> >                 *pmdvalp = pmdval;
> 
> It might be a good idea to document that this series assumes that the
> first argument to __pte_offset_map() is a pointer into a second-level
> page table (and not a local copy of the entry) unless the containing
> VMA is known to not be THP-eligible or the page table is detached from
> the page table hierarchy or something like that. Currently a bunch of
> places pass references to local copies of the entry, and while I think
> all of these are fine, it would probably be good to at least document
> why these are allowed to do it while other places aren't.

Thanks Jann: but I have to guess that here you are showing awareness of
an important issue that I'm simply ignorant of.

I have been haunted by a dim recollection that there is one architecture
(arm-32?) which is fussy about the placement of the pmdval being examined
(deduces info missing from the arch-independent interface, by following
up the address?), but I couldn't track it down when I tried.

Please tell me more; or better, don't spend your time explaining to me,
but please just send a link to a good reference on the issue.  I'll be
unable to document what you ask there, without educating myself first.

Thanks,
Hugh

> 
> $ vgrep 'pte_offset_map(&'
> Index File                  Line Content
>     0 arch/sparc/mm/tlb.c    151 pte = pte_offset_map(&pmd, vaddr);
>     1 kernel/events/core.c  7501 ptep = pte_offset_map(&pmd, addr);
>     2 mm/gup.c              2460 ptem = ptep = pte_offset_map(&pmd, addr);
>     3 mm/huge_memory.c      2057 pte = pte_offset_map(&_pmd, haddr);
>     4 mm/huge_memory.c      2214 pte = pte_offset_map(&_pmd, haddr);
>     5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(&pmd, addr);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
  2023-05-31 17:59 ` [PATCH 00/12] mm: free retracted page table by RCU Jann Horn
  2023-06-02  4:37     ` Hugh Dickins
@ 2023-06-02  4:37     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  4:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3756 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> > Here is the third series of patches to mm (and a few architectures), based
> > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> 
> To clarify: Part of the design here is that when you look up a user
> page table with pte_offset_map_nolock() or pte_offset_map() without
> holding mmap_lock in write mode, and you later lock the page table
> yourself, you don't know whether you actually have the real page table
> or a detached table that is currently in its RCU grace period, right?

Right.  (And I'd rather not assume anything of mmap_lock, but there are
one or two or three places that may still do so.)

> And detached tables are supposed to consist of only zeroed entries,
> and we assume that no relevant codepath will do anything bad if one of
> these functions spuriously returns a pointer to a page table full of
> zeroed entries?

(Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

If someone is using pte_offset_map() without lock, they must be prepared
to accept page-table-like changes.  The limits of pte_offset_map_nolock()
with later spin_lock(ptl): I'm still exploring: there's certainly an
argument that one ought to do a pmd_same() check before proceeding,
but I don't think anywhere needs that at present.

Whether the page table has to be full of zeroed entries when detached:
I believe it is always like that at present (by the end of the series,
when the collapse_pte_offset_map() oddity is fixed), but whether it needs
to be so I'm not sure.  Quite likely it will need to be; but I'm open to
the possibility that all it needs is to be still a page table, with
perhaps new entries from a new usage in it.

The most obvious vital thing (in the split ptlock case) is that it
remains a struct page with a usable ptl spinlock embedded in it.

The question becomes more urgent when/if extending to replacing the
pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
wants to deposit the page table for later use even in the shmem/file
case (and all arches in the anon case): I did work out the details once
before, but I'm not sure whether I would still agree with myself; and was
glad to leave replacement out of this series, to revisit some time later.

> 
> So in particular, in handle_pte_fault() we can reach the "if
> (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> detached zeroed page table, but we're okay with that because in that
> case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> out?

There is no current (even at end of series) circumstance in which we
could be pointing to a detached page table there; but yes, I want to
allow for that, and yes I agree with your analysis.  But with the
interesting unanswered question for the future, of what if the same
value could be found there: would that imply it's safe to proceed,
or would some further prevention be needed?

> 
> If that's the intent, it might be good to add some comments, because
> at least to me that's not very obvious.

That's a very fair request; but I shall have difficulty deciding where
to place such comments.  I shall have to try, then you redirect me.

And I think we approach this in opposite ways: my nature is to put some
infrastructure in place, and then look at it to see what we can get away
with; whereas your nature is to define upfront what the possibilities are.
We can expect some tussles!

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-06-02  4:37     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  4:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3756 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> > Here is the third series of patches to mm (and a few architectures), based
> > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> 
> To clarify: Part of the design here is that when you look up a user
> page table with pte_offset_map_nolock() or pte_offset_map() without
> holding mmap_lock in write mode, and you later lock the page table
> yourself, you don't know whether you actually have the real page table
> or a detached table that is currently in its RCU grace period, right?

Right.  (And I'd rather not assume anything of mmap_lock, but there are
one or two or three places that may still do so.)

> And detached tables are supposed to consist of only zeroed entries,
> and we assume that no relevant codepath will do anything bad if one of
> these functions spuriously returns a pointer to a page table full of
> zeroed entries?

(Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

If someone is using pte_offset_map() without lock, they must be prepared
to accept page-table-like changes.  The limits of pte_offset_map_nolock()
with later spin_lock(ptl): I'm still exploring: there's certainly an
argument that one ought to do a pmd_same() check before proceeding,
but I don't think anywhere needs that at present.

Whether the page table has to be full of zeroed entries when detached:
I believe it is always like that at present (by the end of the series,
when the collapse_pte_offset_map() oddity is fixed), but whether it needs
to be so I'm not sure.  Quite likely it will need to be; but I'm open to
the possibility that all it needs is to be still a page table, with
perhaps new entries from a new usage in it.

The most obvious vital thing (in the split ptlock case) is that it
remains a struct page with a usable ptl spinlock embedded in it.

The question becomes more urgent when/if extending to replacing the
pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
wants to deposit the page table for later use even in the shmem/file
case (and all arches in the anon case): I did work out the details once
before, but I'm not sure whether I would still agree with myself; and was
glad to leave replacement out of this series, to revisit some time later.

> 
> So in particular, in handle_pte_fault() we can reach the "if
> (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> detached zeroed page table, but we're okay with that because in that
> case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> out?

There is no current (even at end of series) circumstance in which we
could be pointing to a detached page table there; but yes, I want to
allow for that, and yes I agree with your analysis.  But with the
interesting unanswered question for the future, of what if the same
value could be found there: would that imply it's safe to proceed,
or would some further prevention be needed?

> 
> If that's the intent, it might be good to add some comments, because
> at least to me that's not very obvious.

That's a very fair request; but I shall have difficulty deciding where
to place such comments.  I shall have to try, then you redirect me.

And I think we approach this in opposite ways: my nature is to put some
infrastructure in place, and then look at it to see what we can get away
with; whereas your nature is to define upfront what the possibilities are.
We can expect some tussles!

Thanks,
Hugh

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-06-02  4:37     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  4:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

[-- Attachment #1: Type: text/plain, Size: 3756 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> > Here is the third series of patches to mm (and a few architectures), based
> > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> 
> To clarify: Part of the design here is that when you look up a user
> page table with pte_offset_map_nolock() or pte_offset_map() without
> holding mmap_lock in write mode, and you later lock the page table
> yourself, you don't know whether you actually have the real page table
> or a detached table that is currently in its RCU grace period, right?

Right.  (And I'd rather not assume anything of mmap_lock, but there are
one or two or three places that may still do so.)

> And detached tables are supposed to consist of only zeroed entries,
> and we assume that no relevant codepath will do anything bad if one of
> these functions spuriously returns a pointer to a page table full of
> zeroed entries?

(Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

If someone is using pte_offset_map() without lock, they must be prepared
to accept page-table-like changes.  The limits of pte_offset_map_nolock()
with later spin_lock(ptl): I'm still exploring: there's certainly an
argument that one ought to do a pmd_same() check before proceeding,
but I don't think anywhere needs that at present.

Whether the page table has to be full of zeroed entries when detached:
I believe it is always like that at present (by the end of the series,
when the collapse_pte_offset_map() oddity is fixed), but whether it needs
to be so I'm not sure.  Quite likely it will need to be; but I'm open to
the possibility that all it needs is to be still a page table, with
perhaps new entries from a new usage in it.

The most obvious vital thing (in the split ptlock case) is that it
remains a struct page with a usable ptl spinlock embedded in it.

The question becomes more urgent when/if extending to replacing the
pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
wants to deposit the page table for later use even in the shmem/file
case (and all arches in the anon case): I did work out the details once
before, but I'm not sure whether I would still agree with myself; and was
glad to leave replacement out of this series, to revisit some time later.

> 
> So in particular, in handle_pte_fault() we can reach the "if
> (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> detached zeroed page table, but we're okay with that because in that
> case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> out?

There is no current (even at end of series) circumstance in which we
could be pointing to a detached page table there; but yes, I want to
allow for that, and yes I agree with your analysis.  But with the
interesting unanswered question for the future, of what if the same
value could be found there: would that imply it's safe to proceed,
or would some further prevention be needed?

> 
> If that's the intent, it might be good to add some comments, because
> at least to me that's not very obvious.

That's a very fair request; but I shall have difficulty deciding where
to place such comments.  I shall have to try, then you redirect me.

And I think we approach this in opposite ways: my nature is to put some
infrastructure in place, and then look at it to see what we can get away
with; whereas your nature is to define upfront what the possibilities are.
We can expect some tussles!

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-05-31 17:25     ` Jann Horn
  (?)
@ 2023-06-02  5:11       ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  5:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:26 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> I think there's a weirdness in the existing code, and this change
> probably turns that into a UAF bug.
> 
> collapse_pte_mapped_thp() can be called on an address that might not
> be associated with a VMA anymore, and after this change, the page
> tables for that address might be in the middle of page table teardown
> in munmap(), right? The existing mmap_write_lock() guards against
> concurrent munmap() (so in the old code we are guaranteed to either
> see a normal VMA or not see the page tables anymore), but
> mmap_read_lock() only guards against the part of munmap() up to the
> mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
> (including free_pgtables()) happens after that.

Excellent point, thank you.  Don't let anyone overhear us, but I have
to confess to you that that mmap_write_downgrade() has never impinged
forcefully enough on my consciousness: it's still my habit to think of
mmap_lock as exclusive over free_pgtables(), and I've not encountered
this bug in my testing.

Right, I'll gladly incorporate your collapse_pte_mapped_thp()
rearrangement below.  And am reassured to realize that by removing
mmap_lock dependence elsewhere, I won't have got it wrong in other places.

Thanks,
Hugh

> 
> So we can now enter collapse_pte_mapped_thp() and race with concurrent
> free_pgtables() such that a PUD disappears under us while we're
> walking it or something like that:
> 
> 
> int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>           bool install_pmd)
> {
>   struct mmu_notifier_range range;
>   unsigned long haddr = addr & HPAGE_PMD_MASK;
>   struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
>   struct page *hpage;
>   pte_t *start_pte, *pte;
>   pmd_t *pmd, pgt_pmd;
>   spinlock_t *pml, *ptl;
>   int nr_ptes = 0, result = SCAN_FAIL;
>   int i;
> 
>   mmap_assert_locked(mm);
> 
>   /* Fast check before locking page if already PMD-mapped */
>   result = find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in here
>   if (result == SCAN_PMD_MAPPED)
>     return result;
> 
>   if (!vma || !vma->vm_file || // <<< bailout happens too late
>       !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
>     return SCAN_VMA_CHECK;
> 
> 
> I guess the right fix here is to make sure that at least the basic VMA
> revalidation stuff (making sure there still is a VMA covering this
> range) happens before find_pmd_or_thp_or_none()? Like:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 301c0e54a2ef..5db365587556 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
> 
>          mmap_assert_locked(mm);
> 
> +        if (!vma || !vma->vm_file ||
> +            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> +                return SCAN_VMA_CHECK;
> +
>          /* Fast check before locking page if already PMD-mapped */
>          result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>          if (result == SCAN_PMD_MAPPED)
>                  return result;
> 
> -        if (!vma || !vma->vm_file ||
> -            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -                return SCAN_VMA_CHECK;
> -
>          /*
>           * If we are here, we've succeeded in replacing all the native pages
>           * in the page cache with a single hugepage. If a mm were to fault-in
> 

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
@ 2023-06-02  5:11       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  5:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:26 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> I think there's a weirdness in the existing code, and this change
> probably turns that into a UAF bug.
> 
> collapse_pte_mapped_thp() can be called on an address that might not
> be associated with a VMA anymore, and after this change, the page
> tables for that address might be in the middle of page table teardown
> in munmap(), right? The existing mmap_write_lock() guards against
> concurrent munmap() (so in the old code we are guaranteed to either
> see a normal VMA or not see the page tables anymore), but
> mmap_read_lock() only guards against the part of munmap() up to the
> mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
> (including free_pgtables()) happens after that.

Excellent point, thank you.  Don't let anyone overhear us, but I have
to confess to you that that mmap_write_downgrade() has never impinged
forcefully enough on my consciousness: it's still my habit to think of
mmap_lock as exclusive over free_pgtables(), and I've not encountered
this bug in my testing.

Right, I'll gladly incorporate your collapse_pte_mapped_thp()
rearrangement below.  And am reassured to realize that by removing
mmap_lock dependence elsewhere, I won't have got it wrong in other places.

Thanks,
Hugh

> 
> So we can now enter collapse_pte_mapped_thp() and race with concurrent
> free_pgtables() such that a PUD disappears under us while we're
> walking it or something like that:
> 
> 
> int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>           bool install_pmd)
> {
>   struct mmu_notifier_range range;
>   unsigned long haddr = addr & HPAGE_PMD_MASK;
>   struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
>   struct page *hpage;
>   pte_t *start_pte, *pte;
>   pmd_t *pmd, pgt_pmd;
>   spinlock_t *pml, *ptl;
>   int nr_ptes = 0, result = SCAN_FAIL;
>   int i;
> 
>   mmap_assert_locked(mm);
> 
>   /* Fast check before locking page if already PMD-mapped */
>   result = find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in here
>   if (result == SCAN_PMD_MAPPED)
>     return result;
> 
>   if (!vma || !vma->vm_file || // <<< bailout happens too late
>       !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
>     return SCAN_VMA_CHECK;
> 
> 
> I guess the right fix here is to make sure that at least the basic VMA
> revalidation stuff (making sure there still is a VMA covering this
> range) happens before find_pmd_or_thp_or_none()? Like:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 301c0e54a2ef..5db365587556 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
> 
>          mmap_assert_locked(mm);
> 
> +        if (!vma || !vma->vm_file ||
> +            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> +                return SCAN_VMA_CHECK;
> +
>          /* Fast check before locking page if already PMD-mapped */
>          result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>          if (result == SCAN_PMD_MAPPED)
>                  return result;
> 
> -        if (!vma || !vma->vm_file ||
> -            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -                return SCAN_VMA_CHECK;
> -
>          /*
>           * If we are here, we've succeeded in replacing all the native pages
>           * in the page cache with a single hugepage. If a mm were to fault-in
> 

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
@ 2023-06-02  5:11       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  5:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:26 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> I think there's a weirdness in the existing code, and this change
> probably turns that into a UAF bug.
> 
> collapse_pte_mapped_thp() can be called on an address that might not
> be associated with a VMA anymore, and after this change, the page
> tables for that address might be in the middle of page table teardown
> in munmap(), right? The existing mmap_write_lock() guards against
> concurrent munmap() (so in the old code we are guaranteed to either
> see a normal VMA or not see the page tables anymore), but
> mmap_read_lock() only guards against the part of munmap() up to the
> mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
> (including free_pgtables()) happens after that.

Excellent point, thank you.  Don't let anyone overhear us, but I have
to confess to you that that mmap_write_downgrade() has never impinged
forcefully enough on my consciousness: it's still my habit to think of
mmap_lock as exclusive over free_pgtables(), and I've not encountered
this bug in my testing.

Right, I'll gladly incorporate your collapse_pte_mapped_thp()
rearrangement below.  And am reassured to realize that by removing
mmap_lock dependence elsewhere, I won't have got it wrong in other places.

Thanks,
Hugh

> 
> So we can now enter collapse_pte_mapped_thp() and race with concurrent
> free_pgtables() such that a PUD disappears under us while we're
> walking it or something like that:
> 
> 
> int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>           bool install_pmd)
> {
>   struct mmu_notifier_range range;
>   unsigned long haddr = addr & HPAGE_PMD_MASK;
>   struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
>   struct page *hpage;
>   pte_t *start_pte, *pte;
>   pmd_t *pmd, pgt_pmd;
>   spinlock_t *pml, *ptl;
>   int nr_ptes = 0, result = SCAN_FAIL;
>   int i;
> 
>   mmap_assert_locked(mm);
> 
>   /* Fast check before locking page if already PMD-mapped */
>   result = find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in here
>   if (result == SCAN_PMD_MAPPED)
>     return result;
> 
>   if (!vma || !vma->vm_file || // <<< bailout happens too late
>       !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
>     return SCAN_VMA_CHECK;
> 
> 
> I guess the right fix here is to make sure that at least the basic VMA
> revalidation stuff (making sure there still is a VMA covering this
> range) happens before find_pmd_or_thp_or_none()? Like:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 301c0e54a2ef..5db365587556 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
> 
>          mmap_assert_locked(mm);
> 
> +        if (!vma || !vma->vm_file ||
> +            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> +                return SCAN_VMA_CHECK;
> +
>          /* Fast check before locking page if already PMD-mapped */
>          result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>          if (result == SCAN_PMD_MAPPED)
>                  return result;
> 
> -        if (!vma || !vma->vm_file ||
> -            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -                return SCAN_VMA_CHECK;
> -
>          /*
>           * If we are here, we've succeeded in replacing all the native pages
>           * in the page cache with a single hugepage. If a mm were to fault-in
> 

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
       [not found]   ` <ZHeg3oRljRn6wlLX@ziepe.ca>
  2023-06-02  5:35       ` Hugh Dickins
@ 2023-06-02  5:35       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  5:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> > There is a faint risk that __pte_offset_map(), on a 32-bit architecture
> > with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
> > a pmdval assembled from a pmd_low and a pmd_high which never belonged
> > together: their combination not pointing to a page table at all, perhaps
> > not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
> > 
> > Guard against that (on such configs) by local_irq_save() blocking TLB
> > flush between present updates, as linux/pgtable.h suggests.  It's only
> > needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
> > __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
> > lock, would just send it back to __pte_offset_map() again.
> 
> What about the other places calling pmdp_get_lockless ? It seems like
> this is quietly making it part of the API that the caller must hold
> the IPIs off.

No, I'm making no judgment of other places where pmdp_get_lockless() is
used: examination might show that some need more care, but I'll just
assume that each is taking as much care as it needs.

But here where I'm making changes, I do see that we need this extra care.

> 
> And Jann had a note that this approach used by the lockless functions
> doesn't work anyhow:
> 
> https://lore.kernel.org/linux-mm/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/

Thanks a lot for the link: I don't know why, but I never saw that mail
thread at all before.  I have not fully digested it yet, to be honest:
MADV_DONTNEED, doesn't flush TLB yet, etc - I'll have to get into the
right frame of mind for that.

> 
> Though we never fixed it, AFAIK..

I'm certainly depending very much on pmdp_get_lockless(): and hoping to
find its case is easier to defend than at the ptep_get_lockless() level.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
@ 2023-06-02  5:35       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  5:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> > There is a faint risk that __pte_offset_map(), on a 32-bit architecture
> > with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
> > a pmdval assembled from a pmd_low and a pmd_high which never belonged
> > together: their combination not pointing to a page table at all, perhaps
> > not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
> > 
> > Guard against that (on such configs) by local_irq_save() blocking TLB
> > flush between present updates, as linux/pgtable.h suggests.  It's only
> > needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
> > __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
> > lock, would just send it back to __pte_offset_map() again.
> 
> What about the other places calling pmdp_get_lockless ? It seems like
> this is quietly making it part of the API that the caller must hold
> the IPIs off.

No, I'm making no judgment of other places where pmdp_get_lockless() is
used: examination might show that some need more care, but I'll just
assume that each is taking as much care as it needs.

But here where I'm making changes, I do see that we need this extra care.

> 
> And Jann had a note that this approach used by the lockless functions
> doesn't work anyhow:
> 
> https://lore.kernel.org/linux-mm/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/

Thanks a lot for the link: I don't know why, but I never saw that mail
thread at all before.  I have not fully digested it yet, to be honest:
MADV_DONTNEED, doesn't flush TLB yet, etc - I'll have to get into the
right frame of mind for that.

> 
> Though we never fixed it, AFAIK..

I'm certainly depending very much on pmdp_get_lockless(): and hoping to
find its case is easier to defend than at the ptep_get_lockless() level.

Thanks,
Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map()
@ 2023-06-02  5:35       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  5:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual

On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:16:16PM -0700, Hugh Dickins wrote:
> > There is a faint risk that __pte_offset_map(), on a 32-bit architecture
> > with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
> > a pmdval assembled from a pmd_low and a pmd_high which never belonged
> > together: their combination not pointing to a page table at all, perhaps
> > not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
> > 
> > Guard against that (on such configs) by local_irq_save() blocking TLB
> > flush between present updates, as linux/pgtable.h suggests.  It's only
> > needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
> > __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
> > lock, would just send it back to __pte_offset_map() again.
> 
> What about the other places calling pmdp_get_lockless ? It seems like
> this is quietly making it part of the API that the caller must hold
> the IPIs off.

No, I'm making no judgment of other places where pmdp_get_lockless() is
used: examination might show that some need more care, but I'll just
assume that each is taking as much care as it needs.

But here where I'm making changes, I do see that we need this extra care.

> 
> And Jann had a note that this approach used by the lockless functions
> doesn't work anyhow:
> 
> https://lore.kernel.org/linux-mm/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/

Thanks a lot for the link: I don't know why, but I never saw that mail
thread at all before.  I have not fully digested it yet, to be honest:
MADV_DONTNEED, doesn't flush TLB yet, etc - I'll have to get into the
right frame of mind for that.

> 
> Though we never fixed it, AFAIK..

I'm certainly depending very much on pmdp_get_lockless(): and hoping to
find its case is easier to defend than at the ptep_get_lockless() level.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
       [not found]   ` <ZHekpAKJ05cr/GLl@ziepe.ca>
  2023-06-02  6:03       ` Hugh Dickins
@ 2023-06-02  6:03       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  6:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:23:47PM -0700, Hugh Dickins wrote:
> > Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This version
> > suits all those architectures which use an unfragmented page for one page
> > table (none of whose pte_free()s use the mm arg which was passed to it).
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > +	page = pgtable;
> > +	call_rcu(&page->rcu_head, pte_free_now);
> 
> People have told me that we can't use the rcu_head on the struct page
> backing page table blocks. I understood it was because PPC was using
> that memory for something else.

In the 05/12 thread, Matthew pointed out that powerpc (and a few others)
use the one struct page for multiple page tables, and the lack of
multiple rcu_heads means I've got that patch and 06/12 sparc and
07/12 s390 embarrassingly wrong (whereas this generic 08/12 is okay).

I believe I know the extra grossness needed for powerpc and sparc: I had
it already for powerpc, but fooled myself into thinking not yet needed.

But (I haven't quite got there yet) it looks like Gerald is pointing
out that s390 is using lru which coincides with rcu_head: I already knew
s390 the most difficult, but that will be another layer of difficulty.

I expect it was s390 which people warned you of.

> 
> I was hoping Mathew's folio conversion would help clarify this..

I doubt that: what we have for use today is pages, however they are
dressed up.

> 
> On the flip side, if we are able to use rcu_head here then we should
> use it everywhere and also use it mmu_gather.c instead of allocating
> memory and having the smp_call_function() fallback. This would fix it
> to be actual RCU.
> 
> There have been a few talks that it sure would be nice if the page
> tables were always freed via RCU and every arch just turns on
> CONFIG_MMU_GATHER_RCU_TABLE_FREE. It seems to me that patch 10 is kind
> of half doing that by making this one path always use RCU on all
> arches.
> 
> AFAIK the main reason it hasn't been done was the lack of a rcu_head..

I haven't paid attention to that part of the history, and won't be
competent to propagate this further, into MMU-Gather-World; but agree
that would be a satisfying conclusion.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-06-02  6:03       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  6:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:23:47PM -0700, Hugh Dickins wrote:
> > Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This version
> > suits all those architectures which use an unfragmented page for one page
> > table (none of whose pte_free()s use the mm arg which was passed to it).
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > +	page = pgtable;
> > +	call_rcu(&page->rcu_head, pte_free_now);
> 
> People have told me that we can't use the rcu_head on the struct page
> backing page table blocks. I understood it was because PPC was using
> that memory for something else.

In the 05/12 thread, Matthew pointed out that powerpc (and a few others)
use the one struct page for multiple page tables, and the lack of
multiple rcu_heads means I've got that patch and 06/12 sparc and
07/12 s390 embarrassingly wrong (whereas this generic 08/12 is okay).

I believe I know the extra grossness needed for powerpc and sparc: I had
it already for powerpc, but fooled myself into thinking not yet needed.

But (I haven't quite got there yet) it looks like Gerald is pointing
out that s390 is using lru which coincides with rcu_head: I already knew
s390 the most difficult, but that will be another layer of difficulty.

I expect it was s390 which people warned you of.

> 
> I was hoping Mathew's folio conversion would help clarify this..

I doubt that: what we have for use today is pages, however they are
dressed up.

> 
> On the flip side, if we are able to use rcu_head here then we should
> use it everywhere and also use it mmu_gather.c instead of allocating
> memory and having the smp_call_function() fallback. This would fix it
> to be actual RCU.
> 
> There have been a few talks that it sure would be nice if the page
> tables were always freed via RCU and every arch just turns on
> CONFIG_MMU_GATHER_RCU_TABLE_FREE. It seems to me that patch 10 is kind
> of half doing that by making this one path always use RCU on all
> arches.
> 
> AFAIK the main reason it hasn't been done was the lack of a rcu_head..

I haven't paid attention to that part of the history, and won't be
competent to propagate this further, into MMU-Gather-World; but agree
that would be a satisfying conclusion.

Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-06-02  6:03       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  6:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual

On Wed, 31 May 2023, Jason Gunthorpe wrote:
> On Sun, May 28, 2023 at 11:23:47PM -0700, Hugh Dickins wrote:
> > Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This version
> > suits all those architectures which use an unfragmented page for one page
> > table (none of whose pte_free()s use the mm arg which was passed to it).
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > +	page = pgtable;
> > +	call_rcu(&page->rcu_head, pte_free_now);
> 
> People have told me that we can't use the rcu_head on the struct page
> backing page table blocks. I understood it was because PPC was using
> that memory for something else.

In the 05/12 thread, Matthew pointed out that powerpc (and a few others)
use the one struct page for multiple page tables, and the lack of
multiple rcu_heads means I've got that patch and 06/12 sparc and
07/12 s390 embarrassingly wrong (whereas this generic 08/12 is okay).

I believe I know the extra grossness needed for powerpc and sparc: I had
it already for powerpc, but fooled myself into thinking not yet needed.

But (I haven't quite got there yet) it looks like Gerald is pointing
out that s390 is using lru which coincides with rcu_head: I already knew
s390 the most difficult, but that will be another layer of difficulty.

I expect it was s390 which people warned you of.

> 
> I was hoping Mathew's folio conversion would help clarify this..

I doubt that: what we have for use today is pages, however they are
dressed up.

> 
> On the flip side, if we are able to use rcu_head here then we should
> use it everywhere and also use it mmu_gather.c instead of allocating
> memory and having the smp_call_function() fallback. This would fix it
> to be actual RCU.
> 
> There have been a few talks that it sure would be nice if the page
> tables were always freed via RCU and every arch just turns on
> CONFIG_MMU_GATHER_RCU_TABLE_FREE. It seems to me that patch 10 is kind
> of half doing that by making this one path always use RCU on all
> arches.
> 
> AFAIK the main reason it hasn't been done was the lack of a rcu_head..

I haven't paid attention to that part of the history, and won't be
competent to propagate this further, into MMU-Gather-World; but agree
that would be a satisfying conclusion.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-06-01 13:57         ` Gerald Schaefer
@ 2023-06-02  6:38           ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  6:38 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Matthew Wilcox, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, Vishal Moola,
	Vasily Gorbik, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Thu, 1 Jun 2023, Gerald Schaefer wrote:
> On Mon, 29 May 2023 07:36:40 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 29 May 2023, Matthew Wilcox wrote:
> > > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = virt_to_page(pgtable);
> > > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > > +}  
> > > 
> > > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > > tables sharing one 64kB page.  So if you have two page tables from the
> > > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > > and I cannot imagine things go well from that point.  
> > 
> > Oh yes, of course, thanks for catching that so quickly.
> > So my s390 and sparc implementations will be equally broken.
> > 
> > > 
> > > I have no idea how to solve this problem.  
> > 
> > I do: I'll have to go back to the more complicated implementation we
> > actually ran with on powerpc - I was thinking those complications just
> > related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> > 
> > It uses large (0x10000) increments of the page refcount, avoiding
> > call_rcu() when already active.
> > 
> > It's not a complication I had wanted to explain or test for now,
> > but we shall have to.  Should apply equally well to sparc, but s390
> > more of a problem, since s390 already has its own refcount cleverness.
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().

Yes, I expect that it will be possible to mesh in with s390's cleverness
there; but I may not be clever enough to do so myself - it was easier to
get right by going my own way - except that the multiply-used rcu_head
soon showed that I'd not got it right at all :-(

> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

However, I believe you are right that it's worse.  I'm glad to hear
that you get confused by the struct page unions, me too, I preferred the
old pre-union days when we could see at a glance which fields overlaid.
(Perhaps I'm nostalgically exaggerating that "see at a glance" ease.)

But I think I remember the discussions when rcu_head, and compound_head
at lru.next, came in: with the agreement that rcu_head.next would at
least be 2-aligned to avoid PageTail - ah, it's even commented in the
fundamental include/linux/types.h.

Sigh.  I don't at this moment know what to do for s390:
it is frustrating to be held up by just the one architecture.
But big thanks to you, Gerald, for bringing this to light.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-02  6:38           ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-02  6:38 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

On Thu, 1 Jun 2023, Gerald Schaefer wrote:
> On Mon, 29 May 2023 07:36:40 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 29 May 2023, Matthew Wilcox wrote:
> > > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > > +{
> > > > +	struct page *page;
> > > > +
> > > > +	page = virt_to_page(pgtable);
> > > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > > +}  
> > > 
> > > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > > tables sharing one 64kB page.  So if you have two page tables from the
> > > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > > and I cannot imagine things go well from that point.  
> > 
> > Oh yes, of course, thanks for catching that so quickly.
> > So my s390 and sparc implementations will be equally broken.
> > 
> > > 
> > > I have no idea how to solve this problem.  
> > 
> > I do: I'll have to go back to the more complicated implementation we
> > actually ran with on powerpc - I was thinking those complications just
> > related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> > 
> > It uses large (0x10000) increments of the page refcount, avoiding
> > call_rcu() when already active.
> > 
> > It's not a complication I had wanted to explain or test for now,
> > but we shall have to.  Should apply equally well to sparc, but s390
> > more of a problem, since s390 already has its own refcount cleverness.
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().

Yes, I expect that it will be possible to mesh in with s390's cleverness
there; but I may not be clever enough to do so myself - it was easier to
get right by going my own way - except that the multiply-used rcu_head
soon showed that I'd not got it right at all :-(

> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

However, I believe you are right that it's worse.  I'm glad to hear
that you get confused by the struct page unions, me too, I preferred the
old pre-union days when we could see at a glance which fields overlaid.
(Perhaps I'm nostalgically exaggerating that "see at a glance" ease.)

But I think I remember the discussions when rcu_head, and compound_head
at lru.next, came in: with the agreement that rcu_head.next would at
least be 2-aligned to avoid PageTail - ah, it's even commented in the
fundamental include/linux/types.h.

Sigh.  I don't at this moment know what to do for s390:
it is frustrating to be held up by just the one architecture.
But big thanks to you, Gerald, for bringing this to light.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
  2023-06-02  6:03       ` Hugh Dickins
  (?)
@ 2023-06-02 12:15         ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 12:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Axel Rasmussen, Anshuman Khandual, Pasha Tatashin,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Thu, Jun 01, 2023 at 11:03:11PM -0700, Hugh Dickins wrote:
> > I was hoping Mathew's folio conversion would help clarify this..
> 
> I doubt that: what we have for use today is pages, however they are
> dressed up.

I mean the part where Matthew is going and splitting the types and
making it much clearer and type safe how the memory is layed out. eg
no more guessing if the arch code is overlaying something else onto
the rcu_head.

Then the hope against hope is that after doing all this we can find
enough space for everything including the rcu heads..

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-06-02 12:15         ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 12:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Axel Rasmussen, Anshuman Khandual, Pasha Tatashin,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Thu, Jun 01, 2023 at 11:03:11PM -0700, Hugh Dickins wrote:
> > I was hoping Mathew's folio conversion would help clarify this..
> 
> I doubt that: what we have for use today is pages, however they are
> dressed up.

I mean the part where Matthew is going and splitting the types and
making it much clearer and type safe how the memory is layed out. eg
no more guessing if the arch code is overlaying something else onto
the rcu_head.

Then the hope against hope is that after doing all this we can find
enough space for everything including the rcu heads..

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page
@ 2023-06-02 12:15         ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 12:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Aneesh Kumar K.V, Axel Rasmussen,
	Christian Borntraeger, Thomas Hellstrom, Ralph Campbell,
	Pasha Tatashin, Anshuman Khandual, Heiko Carstens, Qi Z heng,
	Suren Baghdasaryan, linux-arm-kernel, SeongJae Park, Jann Horn,
	linux-mm, linuxppc-dev, Kirill A. Shutemov, Naoya Horiguchi,
	linux-kernel, Minchan Kim, Mike Rapoport, Andrew Morton,
	Mel Gorman, David S. Miller, Zack Rusin, Mike Kravetz

On Thu, Jun 01, 2023 at 11:03:11PM -0700, Hugh Dickins wrote:
> > I was hoping Mathew's folio conversion would help clarify this..
> 
> I doubt that: what we have for use today is pages, however they are
> dressed up.

I mean the part where Matthew is going and splitting the types and
making it much clearer and type safe how the memory is layed out. eg
no more guessing if the arch code is overlaying something else onto
the rcu_head.

Then the hope against hope is that after doing all this we can find
enough space for everything including the rcu heads..

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-05-29 14:02     ` Matthew Wilcox
  (?)
@ 2023-06-02 14:20       ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 14:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Axel Rasmussen, Anshuman Khandual, Pasha Tatashin,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.
> 
> I have no idea how to solve this problem.

Maybe power and s390 should allocate a side structure, sort of a
pre-memdesc thing to store enough extra data?

If we can get enough bytes then something like this would let a single
rcu head be shared to manage the free bits.

struct 64k_page {
    u8 free_pages;
    u8 pending_rcu_free_pages;
    struct rcu_head head;
}

free_sub_page(sub_id)
    if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
         call_rcu(&64k_page->head)

rcu_func()
   64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)

   if (64k_pages->free_pages == all_ones)
      free_pgea(64k_page);

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-02 14:20       ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 14:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Steven Price,
	Christoph Hellwig, Aneesh Kumar K.V, Axel Rasmussen,
	Christian Borntraeger, Thomas Hellstrom, Ralph Campbell,
	Pasha Tatashin, Anshuman Khandual, Heiko Carstens, Peter Xu,
	Suren Baghdasaryan, linux-arm-kernel, SeongJae Park, Jann Horn,
	linux-mm, linuxppc-dev, Kirill A. Shutemov, Naoya Horiguchi,
	linux-kernel, Minchan Kim, Mike Rapoport, Andrew Morton,
	Mel Gorman, David S. Miller, Zack Rusin, Mike Kravetz

On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.
> 
> I have no idea how to solve this problem.

Maybe power and s390 should allocate a side structure, sort of a
pre-memdesc thing to store enough extra data?

If we can get enough bytes then something like this would let a single
rcu head be shared to manage the free bits.

struct 64k_page {
    u8 free_pages;
    u8 pending_rcu_free_pages;
    struct rcu_head head;
}

free_sub_page(sub_id)
    if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
         call_rcu(&64k_page->head)

rcu_func()
   64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)

   if (64k_pages->free_pages == all_ones)
      free_pgea(64k_page);

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-02 14:20       ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 14:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Axel Rasmussen, Anshuman Khandual, Pasha Tatashin,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	call_rcu(&page->rcu_head, pte_free_now);
> > +}
> 
> This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> tables sharing one 64kB page.  So if you have two page tables from the
> same page being defer-freed simultaneously, you'll reuse the rcu_head
> and I cannot imagine things go well from that point.
> 
> I have no idea how to solve this problem.

Maybe power and s390 should allocate a side structure, sort of a
pre-memdesc thing to store enough extra data?

If we can get enough bytes then something like this would let a single
rcu head be shared to manage the free bits.

struct 64k_page {
    u8 free_pages;
    u8 pending_rcu_free_pages;
    struct rcu_head head;
}

free_sub_page(sub_id)
    if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
         call_rcu(&64k_page->head)

rcu_func()
   64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)

   if (64k_pages->free_pages == all_ones)
      free_pgea(64k_page);

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  2023-06-02  2:50       ` Hugh Dickins
  (?)
@ 2023-06-02 14:21         ` Jann Horn
  -1 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-02 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Fri, Jun 2, 2023 at 4:50 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > > Before putting them to use (several commits later), add rcu_read_lock()
> > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > > separate commit, since it risks exposing imbalances: prior commits have
> > > fixed all the known imbalances, but we may find some have been missed.
> > [...]
> > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > > index c7ab18a5fb77..674671835631 100644
> > > --- a/mm/pgtable-generic.c
> > > +++ b/mm/pgtable-generic.c
> > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> > >  {
> > >         pmd_t pmdval;
> > >
> > > -       /* rcu_read_lock() to be added later */
> > > +       rcu_read_lock();
> > >         pmdval = pmdp_get_lockless(pmd);
> > >         if (pmdvalp)
> > >                 *pmdvalp = pmdval;
> >
> > It might be a good idea to document that this series assumes that the
> > first argument to __pte_offset_map() is a pointer into a second-level
> > page table (and not a local copy of the entry) unless the containing
> > VMA is known to not be THP-eligible or the page table is detached from
> > the page table hierarchy or something like that. Currently a bunch of
> > places pass references to local copies of the entry, and while I think
> > all of these are fine, it would probably be good to at least document
> > why these are allowed to do it while other places aren't.
>
> Thanks Jann: but I have to guess that here you are showing awareness of
> an important issue that I'm simply ignorant of.
>
> I have been haunted by a dim recollection that there is one architecture
> (arm-32?) which is fussy about the placement of the pmdval being examined
> (deduces info missing from the arch-independent interface, by following
> up the address?), but I couldn't track it down when I tried.
>
> Please tell me more; or better, don't spend your time explaining to me,
> but please just send a link to a good reference on the issue.  I'll be
> unable to document what you ask there, without educating myself first.

Sorry, I think I was somewhat confused about what was going on when I
wrote that message.

After this series, __pte_offset_map() looks as follows, with added
comments describing my understanding of the semantics:

// `pmd` points to one of:
// case 1: a pmd_t stored outside a page table,
//         referencing a page table detached by the caller
// case 2: a pmd_t stored outside a page table, which the caller copied
//         from a page table in an RCU-critical section that extends
//         until at least the end of this function
// case 3: a pmd_t stored inside a page table
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
        unsigned long __maybe_unused flags;
        pmd_t pmdval;

        // begin an RCU section; this is needed for case 3
        rcu_read_lock();
        config_might_irq_save(flags);
        // read the pmd_t.
        // if the pmd_t references a page table, this page table can not
        // go away because:
        //  - in case 1, the caller is the main owner of the page table
        //  - in case 2, because the caller
        //    started an RCU read-side critical section before the caller
        //    read the original pmd_t. (This pmdp_get_lockless() is just
        //    reading a copied pmd_t off the stack.)
        //  - in case 3, because we started an RCU section above before
        //    reading the pmd_t out of the page table here
        pmdval = pmdp_get_lockless(pmd);
        config_might_irq_restore(flags);

        if (pmdvalp)
                *pmdvalp = pmdval;
        if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
                goto nomap;
        if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
                goto nomap;
        if (unlikely(pmd_bad(pmdval))) {
                pmd_clear_bad(pmd);
                goto nomap;
        }
        return __pte_map(&pmdval, addr);
nomap:
        rcu_read_unlock();
        return NULL;
}

case 1 is what happens in __page_table_check_pte_clear_range(),
__split_huge_zero_page_pmd() and __split_huge_pmd_locked().
case 2 happens in lockless page table traversal (gup_pte_range() and
perf_get_pgtable_size()).
case 3 is normal page table traversal under mmap lock or mapping lock.

I think having a function like this that can run in three different
contexts in which it is protected in three different ways is somewhat
hard to understand without comments. Though maybe I'm thinking about
it the wrong way?

Basically my point is: __pte_offset_map() normally requires that the
pmd argument points into a page table so that the rcu_read_lock() can
provide protection starting from the time the pmd_t is read from a
page table. The exception are cases where the caller has taken its own
precautions to ensure that the referenced page table can not have been
freed.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-06-02 14:21         ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-02 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, linux-mm, linuxppc-dev, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, Minchan Kim, Mike Rapoport,
	Andrew Morton, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

On Fri, Jun 2, 2023 at 4:50 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > > Before putting them to use (several commits later), add rcu_read_lock()
> > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > > separate commit, since it risks exposing imbalances: prior commits have
> > > fixed all the known imbalances, but we may find some have been missed.
> > [...]
> > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > > index c7ab18a5fb77..674671835631 100644
> > > --- a/mm/pgtable-generic.c
> > > +++ b/mm/pgtable-generic.c
> > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> > >  {
> > >         pmd_t pmdval;
> > >
> > > -       /* rcu_read_lock() to be added later */
> > > +       rcu_read_lock();
> > >         pmdval = pmdp_get_lockless(pmd);
> > >         if (pmdvalp)
> > >                 *pmdvalp = pmdval;
> >
> > It might be a good idea to document that this series assumes that the
> > first argument to __pte_offset_map() is a pointer into a second-level
> > page table (and not a local copy of the entry) unless the containing
> > VMA is known to not be THP-eligible or the page table is detached from
> > the page table hierarchy or something like that. Currently a bunch of
> > places pass references to local copies of the entry, and while I think
> > all of these are fine, it would probably be good to at least document
> > why these are allowed to do it while other places aren't.
>
> Thanks Jann: but I have to guess that here you are showing awareness of
> an important issue that I'm simply ignorant of.
>
> I have been haunted by a dim recollection that there is one architecture
> (arm-32?) which is fussy about the placement of the pmdval being examined
> (deduces info missing from the arch-independent interface, by following
> up the address?), but I couldn't track it down when I tried.
>
> Please tell me more; or better, don't spend your time explaining to me,
> but please just send a link to a good reference on the issue.  I'll be
> unable to document what you ask there, without educating myself first.

Sorry, I think I was somewhat confused about what was going on when I
wrote that message.

After this series, __pte_offset_map() looks as follows, with added
comments describing my understanding of the semantics:

// `pmd` points to one of:
// case 1: a pmd_t stored outside a page table,
//         referencing a page table detached by the caller
// case 2: a pmd_t stored outside a page table, which the caller copied
//         from a page table in an RCU-critical section that extends
//         until at least the end of this function
// case 3: a pmd_t stored inside a page table
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
        unsigned long __maybe_unused flags;
        pmd_t pmdval;

        // begin an RCU section; this is needed for case 3
        rcu_read_lock();
        config_might_irq_save(flags);
        // read the pmd_t.
        // if the pmd_t references a page table, this page table can not
        // go away because:
        //  - in case 1, the caller is the main owner of the page table
        //  - in case 2, because the caller
        //    started an RCU read-side critical section before the caller
        //    read the original pmd_t. (This pmdp_get_lockless() is just
        //    reading a copied pmd_t off the stack.)
        //  - in case 3, because we started an RCU section above before
        //    reading the pmd_t out of the page table here
        pmdval = pmdp_get_lockless(pmd);
        config_might_irq_restore(flags);

        if (pmdvalp)
                *pmdvalp = pmdval;
        if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
                goto nomap;
        if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
                goto nomap;
        if (unlikely(pmd_bad(pmdval))) {
                pmd_clear_bad(pmd);
                goto nomap;
        }
        return __pte_map(&pmdval, addr);
nomap:
        rcu_read_unlock();
        return NULL;
}

case 1 is what happens in __page_table_check_pte_clear_range(),
__split_huge_zero_page_pmd() and __split_huge_pmd_locked().
case 2 happens in lockless page table traversal (gup_pte_range() and
perf_get_pgtable_size()).
case 3 is normal page table traversal under mmap lock or mapping lock.

I think having a function like this that can run in three different
contexts in which it is protected in three different ways is somewhat
hard to understand without comments. Though maybe I'm thinking about
it the wrong way?

Basically my point is: __pte_offset_map() normally requires that the
pmd argument points into a page table so that the rcu_read_lock() can
provide protection starting from the time the pmd_t is read from a
page table. The exception are cases where the caller has taken its own
precautions to ensure that the referenced page table can not have been
freed.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
@ 2023-06-02 14:21         ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-02 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Fri, Jun 2, 2023 at 4:50 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins <hughd@google.com> wrote:
> > > Before putting them to use (several commits later), add rcu_read_lock()
> > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > > separate commit, since it risks exposing imbalances: prior commits have
> > > fixed all the known imbalances, but we may find some have been missed.
> > [...]
> > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > > index c7ab18a5fb77..674671835631 100644
> > > --- a/mm/pgtable-generic.c
> > > +++ b/mm/pgtable-generic.c
> > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> > >  {
> > >         pmd_t pmdval;
> > >
> > > -       /* rcu_read_lock() to be added later */
> > > +       rcu_read_lock();
> > >         pmdval = pmdp_get_lockless(pmd);
> > >         if (pmdvalp)
> > >                 *pmdvalp = pmdval;
> >
> > It might be a good idea to document that this series assumes that the
> > first argument to __pte_offset_map() is a pointer into a second-level
> > page table (and not a local copy of the entry) unless the containing
> > VMA is known to not be THP-eligible or the page table is detached from
> > the page table hierarchy or something like that. Currently a bunch of
> > places pass references to local copies of the entry, and while I think
> > all of these are fine, it would probably be good to at least document
> > why these are allowed to do it while other places aren't.
>
> Thanks Jann: but I have to guess that here you are showing awareness of
> an important issue that I'm simply ignorant of.
>
> I have been haunted by a dim recollection that there is one architecture
> (arm-32?) which is fussy about the placement of the pmdval being examined
> (deduces info missing from the arch-independent interface, by following
> up the address?), but I couldn't track it down when I tried.
>
> Please tell me more; or better, don't spend your time explaining to me,
> but please just send a link to a good reference on the issue.  I'll be
> unable to document what you ask there, without educating myself first.

Sorry, I think I was somewhat confused about what was going on when I
wrote that message.

After this series, __pte_offset_map() looks as follows, with added
comments describing my understanding of the semantics:

// `pmd` points to one of:
// case 1: a pmd_t stored outside a page table,
//         referencing a page table detached by the caller
// case 2: a pmd_t stored outside a page table, which the caller copied
//         from a page table in an RCU-critical section that extends
//         until at least the end of this function
// case 3: a pmd_t stored inside a page table
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
        unsigned long __maybe_unused flags;
        pmd_t pmdval;

        // begin an RCU section; this is needed for case 3
        rcu_read_lock();
        config_might_irq_save(flags);
        // read the pmd_t.
        // if the pmd_t references a page table, this page table can not
        // go away because:
        //  - in case 1, the caller is the main owner of the page table
        //  - in case 2, because the caller
        //    started an RCU read-side critical section before the caller
        //    read the original pmd_t. (This pmdp_get_lockless() is just
        //    reading a copied pmd_t off the stack.)
        //  - in case 3, because we started an RCU section above before
        //    reading the pmd_t out of the page table here
        pmdval = pmdp_get_lockless(pmd);
        config_might_irq_restore(flags);

        if (pmdvalp)
                *pmdvalp = pmdval;
        if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
                goto nomap;
        if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
                goto nomap;
        if (unlikely(pmd_bad(pmdval))) {
                pmd_clear_bad(pmd);
                goto nomap;
        }
        return __pte_map(&pmdval, addr);
nomap:
        rcu_read_unlock();
        return NULL;
}

case 1 is what happens in __page_table_check_pte_clear_range(),
__split_huge_zero_page_pmd() and __split_huge_pmd_locked().
case 2 happens in lockless page table traversal (gup_pte_range() and
perf_get_pgtable_size()).
case 3 is normal page table traversal under mmap lock or mapping lock.

I think having a function like this that can run in three different
contexts in which it is protected in three different ways is somewhat
hard to understand without comments. Though maybe I'm thinking about
it the wrong way?

Basically my point is: __pte_offset_map() normally requires that the
pmd argument points into a page table so that the rcu_read_lock() can
provide protection starting from the time the pmd_t is read from a
page table. The exception are cases where the caller has taken its own
precautions to ensure that the referenced page table can not have been
freed.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
  2023-06-02  4:37     ` Hugh Dickins
  (?)
@ 2023-06-02 15:26       ` Jann Horn
  -1 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-02 15:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> > > Here is the third series of patches to mm (and a few architectures), based
> > > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> >
> > To clarify: Part of the design here is that when you look up a user
> > page table with pte_offset_map_nolock() or pte_offset_map() without
> > holding mmap_lock in write mode, and you later lock the page table
> > yourself, you don't know whether you actually have the real page table
> > or a detached table that is currently in its RCU grace period, right?
>
> Right.  (And I'd rather not assume anything of mmap_lock, but there are
> one or two or three places that may still do so.)
>
> > And detached tables are supposed to consist of only zeroed entries,
> > and we assume that no relevant codepath will do anything bad if one of
> > these functions spuriously returns a pointer to a page table full of
> > zeroed entries?
>
> (Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

I was not aware, thanks. I only knew that on Intel's Knights Landing
CPUs, the A/D bits are ignored by pte_none() due to some erratum.

> If someone is using pte_offset_map() without lock, they must be prepared
> to accept page-table-like changes.  The limits of pte_offset_map_nolock()
> with later spin_lock(ptl): I'm still exploring: there's certainly an
> argument that one ought to do a pmd_same() check before proceeding,
> but I don't think anywhere needs that at present.
>
> Whether the page table has to be full of zeroed entries when detached:
> I believe it is always like that at present (by the end of the series,
> when the collapse_pte_offset_map() oddity is fixed), but whether it needs
> to be so I'm not sure.  Quite likely it will need to be; but I'm open to
> the possibility that all it needs is to be still a page table, with
> perhaps new entries from a new usage in it.

My understanding is that at least handle_pte_fault(), the way it is
currently written, would do bad things in that case:

// assume we enter with mmap_lock in read mode,
// for a write fault on a shared writable VMA without a page_mkwrite handler
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
    [ not executed ]
  } else {
    /*
     * A regular pmd is established and it can't morph into a huge
     * pmd by anon khugepaged, since that takes mmap_lock in write
     * mode; but shmem or file collapse to THP could still morph
     * it into a huge pmd: just retry later if so.
     */
    vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
             vmf->address, &vmf->ptl);
    if (unlikely(!vmf->pte))
      [not executed]
    [assume that at this point, a concurrent THP collapse operation
     removes the page table, and the page table has now been reused
     and contains a read-only PTE]
    // this reads page table contents protected solely by RCU
    vmf->orig_pte = ptep_get_lockless(vmf->pte);
    vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

    if (pte_none(vmf->orig_pte)) {
      pte_unmap(vmf->pte);
      vmf->pte = NULL;
    }
  }

  if (!vmf->pte)
    [not executed]

  if (!pte_present(vmf->orig_pte))
    [not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
    [not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  if (unlikely(!pte_same(*vmf->pte, entry))) {
    [not executed]
  }
  if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
    if (!pte_write(entry))
      // This will go into wp_page_shared(),
      // which will call wp_page_reuse(),
      // which will upgrade the page to writable
      return do_wp_page(vmf);
    [ not executed ]
  }
  [ not executed ]
}

That looks like we could end up racing such that a read-only PTE in
the reused page table gets upgraded to writable, which would probably
be a security bug.

But I guess if you added bailout checks to every page table lock
operation, it'd be a different story, maybe?

> The most obvious vital thing (in the split ptlock case) is that it
> remains a struct page with a usable ptl spinlock embedded in it.
>
> The question becomes more urgent when/if extending to replacing the
> pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> wants to deposit the page table for later use even in the shmem/file
> case (and all arches in the anon case): I did work out the details once
> before, but I'm not sure whether I would still agree with myself; and was
> glad to leave replacement out of this series, to revisit some time later.
>
> >
> > So in particular, in handle_pte_fault() we can reach the "if
> > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > detached zeroed page table, but we're okay with that because in that
> > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > out?
>
> There is no current (even at end of series) circumstance in which we
> could be pointing to a detached page table there; but yes, I want to
> allow for that, and yes I agree with your analysis.

Hmm, what am I missing here?

static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
    [not executed]
  } else {
    /*
     * A regular pmd is established and it can't morph into a huge
     * pmd by anon khugepaged, since that takes mmap_lock in write
     * mode; but shmem or file collapse to THP could still morph
     * it into a huge pmd: just retry later if so.
     */
    vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
             vmf->address, &vmf->ptl);
    if (unlikely(!vmf->pte))
      [not executed]
    // this reads a present readonly PTE
    vmf->orig_pte = ptep_get_lockless(vmf->pte);
    vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

    if (pte_none(vmf->orig_pte)) {
      [not executed]
    }
  }

  [at this point, a concurrent THP collapse operation detaches the page table]
  // vmf->pte now points into a detached page table

  if (!vmf->pte)
    [not executed]

  if (!pte_present(vmf->orig_pte))
    [not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
    [not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  // vmf->pte still points into a detached page table
  if (unlikely(!pte_same(*vmf->pte, entry))) {
    update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
    goto unlock;
  }
  [...]
}

> But with the
> interesting unanswered question for the future, of what if the same
> value could be found there: would that imply it's safe to proceed,
> or would some further prevention be needed?

That would then hand the pointer to the detached page table to
functions like do_wp_page(), which I think will do bad things (see
above) if they are called on either a page table that has been reused
in a different VMA with different protection flags (which could, for
example, lead to pages becoming writable that should not be writable
or such things) or on a page table that is no longer in use (because
it would assume that PTEs referencing pages are refcounted when they
actually aren't).

> > If that's the intent, it might be good to add some comments, because
> > at least to me that's not very obvious.
>
> That's a very fair request; but I shall have difficulty deciding where
> to place such comments.  I shall have to try, then you redirect me.
>
> And I think we approach this in opposite ways: my nature is to put some
> infrastructure in place, and then look at it to see what we can get away
> with; whereas your nature is to define upfront what the possibilities are.
> We can expect some tussles!

Yeah. :P
One of my strongly-held beliefs is that it's important when making
changes to code to continuously ask oneself "If I had to explain the
rules by which this code operates - who has to take which locks, who
holds references to what and so on -, how complicated would those
rules be?", and if that turns into a series of exception cases, that
probably means there will be bugs, because someone will probably lose
track of one of those exceptions. So I would prefer it if we could
have some rule like "whenever you lock an L1 page table, you must
immediately recheck whether the page table is still referenced by the
L2 page table, unless you know that you have a stable page reference
for whatever reason", and then any code that operates on a locked page
table doesn't have to worry about whether the page table might be
detached.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-06-02 15:26       ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-02 15:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, linux-mm, linuxppc-dev, Kirill A. Shutemov,
	Naoya Horiguchi, linux-kernel, Minchan Kim, Mike Rapoport,
	Andrew Morton, Mel Gorman, David S. Miller, Zack Rusin,
	Mike Kravetz

On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> > > Here is the third series of patches to mm (and a few architectures), based
> > > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> >
> > To clarify: Part of the design here is that when you look up a user
> > page table with pte_offset_map_nolock() or pte_offset_map() without
> > holding mmap_lock in write mode, and you later lock the page table
> > yourself, you don't know whether you actually have the real page table
> > or a detached table that is currently in its RCU grace period, right?
>
> Right.  (And I'd rather not assume anything of mmap_lock, but there are
> one or two or three places that may still do so.)
>
> > And detached tables are supposed to consist of only zeroed entries,
> > and we assume that no relevant codepath will do anything bad if one of
> > these functions spuriously returns a pointer to a page table full of
> > zeroed entries?
>
> (Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

I was not aware, thanks. I only knew that on Intel's Knights Landing
CPUs, the A/D bits are ignored by pte_none() due to some erratum.

> If someone is using pte_offset_map() without lock, they must be prepared
> to accept page-table-like changes.  The limits of pte_offset_map_nolock()
> with later spin_lock(ptl): I'm still exploring: there's certainly an
> argument that one ought to do a pmd_same() check before proceeding,
> but I don't think anywhere needs that at present.
>
> Whether the page table has to be full of zeroed entries when detached:
> I believe it is always like that at present (by the end of the series,
> when the collapse_pte_offset_map() oddity is fixed), but whether it needs
> to be so I'm not sure.  Quite likely it will need to be; but I'm open to
> the possibility that all it needs is to be still a page table, with
> perhaps new entries from a new usage in it.

My understanding is that at least handle_pte_fault(), the way it is
currently written, would do bad things in that case:

// assume we enter with mmap_lock in read mode,
// for a write fault on a shared writable VMA without a page_mkwrite handler
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
    [ not executed ]
  } else {
    /*
     * A regular pmd is established and it can't morph into a huge
     * pmd by anon khugepaged, since that takes mmap_lock in write
     * mode; but shmem or file collapse to THP could still morph
     * it into a huge pmd: just retry later if so.
     */
    vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
             vmf->address, &vmf->ptl);
    if (unlikely(!vmf->pte))
      [not executed]
    [assume that at this point, a concurrent THP collapse operation
     removes the page table, and the page table has now been reused
     and contains a read-only PTE]
    // this reads page table contents protected solely by RCU
    vmf->orig_pte = ptep_get_lockless(vmf->pte);
    vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

    if (pte_none(vmf->orig_pte)) {
      pte_unmap(vmf->pte);
      vmf->pte = NULL;
    }
  }

  if (!vmf->pte)
    [not executed]

  if (!pte_present(vmf->orig_pte))
    [not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
    [not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  if (unlikely(!pte_same(*vmf->pte, entry))) {
    [not executed]
  }
  if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
    if (!pte_write(entry))
      // This will go into wp_page_shared(),
      // which will call wp_page_reuse(),
      // which will upgrade the page to writable
      return do_wp_page(vmf);
    [ not executed ]
  }
  [ not executed ]
}

That looks like we could end up racing such that a read-only PTE in
the reused page table gets upgraded to writable, which would probably
be a security bug.

But I guess if you added bailout checks to every page table lock
operation, it'd be a different story, maybe?

> The most obvious vital thing (in the split ptlock case) is that it
> remains a struct page with a usable ptl spinlock embedded in it.
>
> The question becomes more urgent when/if extending to replacing the
> pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> wants to deposit the page table for later use even in the shmem/file
> case (and all arches in the anon case): I did work out the details once
> before, but I'm not sure whether I would still agree with myself; and was
> glad to leave replacement out of this series, to revisit some time later.
>
> >
> > So in particular, in handle_pte_fault() we can reach the "if
> > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > detached zeroed page table, but we're okay with that because in that
> > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > out?
>
> There is no current (even at end of series) circumstance in which we
> could be pointing to a detached page table there; but yes, I want to
> allow for that, and yes I agree with your analysis.

Hmm, what am I missing here?

static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
    [not executed]
  } else {
    /*
     * A regular pmd is established and it can't morph into a huge
     * pmd by anon khugepaged, since that takes mmap_lock in write
     * mode; but shmem or file collapse to THP could still morph
     * it into a huge pmd: just retry later if so.
     */
    vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
             vmf->address, &vmf->ptl);
    if (unlikely(!vmf->pte))
      [not executed]
    // this reads a present readonly PTE
    vmf->orig_pte = ptep_get_lockless(vmf->pte);
    vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

    if (pte_none(vmf->orig_pte)) {
      [not executed]
    }
  }

  [at this point, a concurrent THP collapse operation detaches the page table]
  // vmf->pte now points into a detached page table

  if (!vmf->pte)
    [not executed]

  if (!pte_present(vmf->orig_pte))
    [not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
    [not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  // vmf->pte still points into a detached page table
  if (unlikely(!pte_same(*vmf->pte, entry))) {
    update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
    goto unlock;
  }
  [...]
}

> But with the
> interesting unanswered question for the future, of what if the same
> value could be found there: would that imply it's safe to proceed,
> or would some further prevention be needed?

That would then hand the pointer to the detached page table to
functions like do_wp_page(), which I think will do bad things (see
above) if they are called on either a page table that has been reused
in a different VMA with different protection flags (which could, for
example, lead to pages becoming writable that should not be writable
or such things) or on a page table that is no longer in use (because
it would assume that PTEs referencing pages are refcounted when they
actually aren't).

> > If that's the intent, it might be good to add some comments, because
> > at least to me that's not very obvious.
>
> That's a very fair request; but I shall have difficulty deciding where
> to place such comments.  I shall have to try, then you redirect me.
>
> And I think we approach this in opposite ways: my nature is to put some
> infrastructure in place, and then look at it to see what we can get away
> with; whereas your nature is to define upfront what the possibilities are.
> We can expect some tussles!

Yeah. :P
One of my strongly-held beliefs is that it's important when making
changes to code to continuously ask oneself "If I had to explain the
rules by which this code operates - who has to take which locks, who
holds references to what and so on -, how complicated would those
rules be?", and if that turns into a series of exception cases, that
probably means there will be bugs, because someone will probably lose
track of one of those exceptions. So I would prefer it if we could
have some rule like "whenever you lock an L1 page table, you must
immediately recheck whether the page table is still referenced by the
L2 page table, unless you know that you have a stable page reference
for whatever reason", and then any code that operates on a locked page
table doesn't have to worry about whether the page table might be
detached.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-06-02 15:26       ` Jann Horn
  0 siblings, 0 replies; 158+ messages in thread
From: Jann Horn @ 2023-06-02 15:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:11 AM Hugh Dickins <hughd@google.com> wrote:
> > > Here is the third series of patches to mm (and a few architectures), based
> > > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> >
> > To clarify: Part of the design here is that when you look up a user
> > page table with pte_offset_map_nolock() or pte_offset_map() without
> > holding mmap_lock in write mode, and you later lock the page table
> > yourself, you don't know whether you actually have the real page table
> > or a detached table that is currently in its RCU grace period, right?
>
> Right.  (And I'd rather not assume anything of mmap_lock, but there are
> one or two or three places that may still do so.)
>
> > And detached tables are supposed to consist of only zeroed entries,
> > and we assume that no relevant codepath will do anything bad if one of
> > these functions spuriously returns a pointer to a page table full of
> > zeroed entries?
>
> (Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

I was not aware, thanks. I only knew that on Intel's Knights Landing
CPUs, the A/D bits are ignored by pte_none() due to some erratum.

> If someone is using pte_offset_map() without lock, they must be prepared
> to accept page-table-like changes.  The limits of pte_offset_map_nolock()
> with later spin_lock(ptl): I'm still exploring: there's certainly an
> argument that one ought to do a pmd_same() check before proceeding,
> but I don't think anywhere needs that at present.
>
> Whether the page table has to be full of zeroed entries when detached:
> I believe it is always like that at present (by the end of the series,
> when the collapse_pte_offset_map() oddity is fixed), but whether it needs
> to be so I'm not sure.  Quite likely it will need to be; but I'm open to
> the possibility that all it needs is to be still a page table, with
> perhaps new entries from a new usage in it.

My understanding is that at least handle_pte_fault(), the way it is
currently written, would do bad things in that case:

// assume we enter with mmap_lock in read mode,
// for a write fault on a shared writable VMA without a page_mkwrite handler
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
    [ not executed ]
  } else {
    /*
     * A regular pmd is established and it can't morph into a huge
     * pmd by anon khugepaged, since that takes mmap_lock in write
     * mode; but shmem or file collapse to THP could still morph
     * it into a huge pmd: just retry later if so.
     */
    vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
             vmf->address, &vmf->ptl);
    if (unlikely(!vmf->pte))
      [not executed]
    [assume that at this point, a concurrent THP collapse operation
     removes the page table, and the page table has now been reused
     and contains a read-only PTE]
    // this reads page table contents protected solely by RCU
    vmf->orig_pte = ptep_get_lockless(vmf->pte);
    vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

    if (pte_none(vmf->orig_pte)) {
      pte_unmap(vmf->pte);
      vmf->pte = NULL;
    }
  }

  if (!vmf->pte)
    [not executed]

  if (!pte_present(vmf->orig_pte))
    [not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
    [not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  if (unlikely(!pte_same(*vmf->pte, entry))) {
    [not executed]
  }
  if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
    if (!pte_write(entry))
      // This will go into wp_page_shared(),
      // which will call wp_page_reuse(),
      // which will upgrade the page to writable
      return do_wp_page(vmf);
    [ not executed ]
  }
  [ not executed ]
}

That looks like we could end up racing such that a read-only PTE in
the reused page table gets upgraded to writable, which would probably
be a security bug.

But I guess if you added bailout checks to every page table lock
operation, it'd be a different story, maybe?

> The most obvious vital thing (in the split ptlock case) is that it
> remains a struct page with a usable ptl spinlock embedded in it.
>
> The question becomes more urgent when/if extending to replacing the
> pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> wants to deposit the page table for later use even in the shmem/file
> case (and all arches in the anon case): I did work out the details once
> before, but I'm not sure whether I would still agree with myself; and was
> glad to leave replacement out of this series, to revisit some time later.
>
> >
> > So in particular, in handle_pte_fault() we can reach the "if
> > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > detached zeroed page table, but we're okay with that because in that
> > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > out?
>
> There is no current (even at end of series) circumstance in which we
> could be pointing to a detached page table there; but yes, I want to
> allow for that, and yes I agree with your analysis.

Hmm, what am I missing here?

static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
    [not executed]
  } else {
    /*
     * A regular pmd is established and it can't morph into a huge
     * pmd by anon khugepaged, since that takes mmap_lock in write
     * mode; but shmem or file collapse to THP could still morph
     * it into a huge pmd: just retry later if so.
     */
    vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
             vmf->address, &vmf->ptl);
    if (unlikely(!vmf->pte))
      [not executed]
    // this reads a present readonly PTE
    vmf->orig_pte = ptep_get_lockless(vmf->pte);
    vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

    if (pte_none(vmf->orig_pte)) {
      [not executed]
    }
  }

  [at this point, a concurrent THP collapse operation detaches the page table]
  // vmf->pte now points into a detached page table

  if (!vmf->pte)
    [not executed]

  if (!pte_present(vmf->orig_pte))
    [not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
    [not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  // vmf->pte still points into a detached page table
  if (unlikely(!pte_same(*vmf->pte, entry))) {
    update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
    goto unlock;
  }
  [...]
}

> But with the
> interesting unanswered question for the future, of what if the same
> value could be found there: would that imply it's safe to proceed,
> or would some further prevention be needed?

That would then hand the pointer to the detached page table to
functions like do_wp_page(), which I think will do bad things (see
above) if they are called on either a page table that has been reused
in a different VMA with different protection flags (which could, for
example, lead to pages becoming writable that should not be writable
or such things) or on a page table that is no longer in use (because
it would assume that PTEs referencing pages are refcounted when they
actually aren't).

> > If that's the intent, it might be good to add some comments, because
> > at least to me that's not very obvious.
>
> That's a very fair request; but I shall have difficulty deciding where
> to place such comments.  I shall have to try, then you redirect me.
>
> And I think we approach this in opposite ways: my nature is to put some
> infrastructure in place, and then look at it to see what we can get away
> with; whereas your nature is to define upfront what the possibilities are.
> We can expect some tussles!

Yeah. :P
One of my strongly-held beliefs is that it's important when making
changes to code to continuously ask oneself "If I had to explain the
rules by which this code operates - who has to take which locks, who
holds references to what and so on -, how complicated would those
rules be?", and if that turns into a series of exception cases, that
probably means there will be bugs, because someone will probably lose
track of one of those exceptions. So I would prefer it if we could
have some rule like "whenever you lock an L1 page table, you must
immediately recheck whether the page table is still referenced by the
L2 page table, unless you know that you have a stable page reference
for whatever reason", and then any code that operates on a locked page
table doesn't have to worry about whether the page table might be
detached.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-06-02 14:20       ` Jason Gunthorpe
  (?)
@ 2023-06-06  3:40         ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Hugh Dickins, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Fri, 2 Jun 2023, Jason Gunthorpe wrote:
> On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > +}
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.
> > 
> > I have no idea how to solve this problem.
> 
> Maybe power and s390 should allocate a side structure, sort of a
> pre-memdesc thing to store enough extra data?
> 
> If we can get enough bytes then something like this would let a single
> rcu head be shared to manage the free bits.
> 
> struct 64k_page {
>     u8 free_pages;
>     u8 pending_rcu_free_pages;
>     struct rcu_head head;
> }
> 
> free_sub_page(sub_id)
>     if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
>          call_rcu(&64k_page->head)
> 
> rcu_func()
>    64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)
> 
>    if (64k_pages->free_pages == all_ones)
>       free_pgea(64k_page);

Or simply allocate as many rcu_heads as page tables.

I have not thought through your suggestion above, because I'm against
asking s390, or any other architecture, to degrade its page table
implementation by demanding more memory, just for the sake of my patch
series.  In a future memdesc world it might turn out to be reasonable,
but not for this (if I can possibly avoid it).

Below is what I believe to be the correct powerpc patch (built but not
retested).  sparc I thought was going to be an equal problem, but turns
out not: I'll comment on 06/12.  And let's move s390 discussion to 07/12.

[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c     | 51 ++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+	int refcount;
+
+	page = container_of(head, struct page, rcu_head);
+	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+				     &page->pt_frag_refcount);
+	if (refcount < PTE_FREE_DEFERRED) {
+		pte_fragment_free((unsigned long *)page_address(page), 0);
+		return;
+	}
+	/*
+	 * One page may be shared between PTE_FRAG_NR pagetables.
+	 * At least one more call to pte_free_defer() came in while we
+	 * were already deferring, so the free must be deferred again;
+	 * but just for one grace period, however many calls came in.
+	 */
+	while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+		refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+					     &page->pt_frag_refcount);
+	}
+	/* Remove that refcount of 1 left for fragment freeing above */
+	atomic_dec(&page->pt_frag_refcount);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	/*
+	 * One page may be shared between PTE_FRAG_NR pagetables: only queue
+	 * it once for freeing, but note whenever the free must be deferred.
+	 *
+	 * (This would be much simpler if the struct page had an rcu_head for
+	 * each fragment, or if we could allocate a separate array for that.)
+	 *
+	 * Convert our refcount of 1 to a refcount of PTE_FREE_DEFERRED, and
+	 * proceed to call_rcu() only when the rcu_head is not already in use.
+	 */
+	if (atomic_add_return(PTE_FREE_DEFERRED - 1, &page->pt_frag_refcount) <
+			      PTE_FREE_DEFERRED + PTE_FREE_DEFERRED)
+		call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06  3:40         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Hugh Dickins, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Fri, 2 Jun 2023, Jason Gunthorpe wrote:
> On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > +}
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.
> > 
> > I have no idea how to solve this problem.
> 
> Maybe power and s390 should allocate a side structure, sort of a
> pre-memdesc thing to store enough extra data?
> 
> If we can get enough bytes then something like this would let a single
> rcu head be shared to manage the free bits.
> 
> struct 64k_page {
>     u8 free_pages;
>     u8 pending_rcu_free_pages;
>     struct rcu_head head;
> }
> 
> free_sub_page(sub_id)
>     if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
>          call_rcu(&64k_page->head)
> 
> rcu_func()
>    64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)
> 
>    if (64k_pages->free_pages == all_ones)
>       free_pgea(64k_page);

Or simply allocate as many rcu_heads as page tables.

I have not thought through your suggestion above, because I'm against
asking s390, or any other architecture, to degrade its page table
implementation by demanding more memory, just for the sake of my patch
series.  In a future memdesc world it might turn out to be reasonable,
but not for this (if I can possibly avoid it).

Below is what I believe to be the correct powerpc patch (built but not
retested).  sparc I thought was going to be an equal problem, but turns
out not: I'll comment on 06/12.  And let's move s390 discussion to 07/12.

[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c     | 51 ++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+	int refcount;
+
+	page = container_of(head, struct page, rcu_head);
+	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+				     &page->pt_frag_refcount);
+	if (refcount < PTE_FREE_DEFERRED) {
+		pte_fragment_free((unsigned long *)page_address(page), 0);
+		return;
+	}
+	/*
+	 * One page may be shared between PTE_FRAG_NR pagetables.
+	 * At least one more call to pte_free_defer() came in while we
+	 * were already deferring, so the free must be deferred again;
+	 * but just for one grace period, however many calls came in.
+	 */
+	while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+		refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+					     &page->pt_frag_refcount);
+	}
+	/* Remove that refcount of 1 left for fragment freeing above */
+	atomic_dec(&page->pt_frag_refcount);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	/*
+	 * One page may be shared between PTE_FRAG_NR pagetables: only queue
+	 * it once for freeing, but note whenever the free must be deferred.
+	 *
+	 * (This would be much simpler if the struct page had an rcu_head for
+	 * each fragment, or if we could allocate a separate array for that.)
+	 *
+	 * Convert our refcount of 1 to a refcount of PTE_FREE_DEFERRED, and
+	 * proceed to call_rcu() only when the rcu_head is not already in use.
+	 */
+	if (atomic_add_return(PTE_FREE_DEFERRED - 1, &page->pt_frag_refcount) <
+			      PTE_FREE_DEFERRED + PTE_FREE_DEFERRED)
+		call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06  3:40         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual

On Fri, 2 Jun 2023, Jason Gunthorpe wrote:
> On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	call_rcu(&page->rcu_head, pte_free_now);
> > > +}
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.
> > 
> > I have no idea how to solve this problem.
> 
> Maybe power and s390 should allocate a side structure, sort of a
> pre-memdesc thing to store enough extra data?
> 
> If we can get enough bytes then something like this would let a single
> rcu head be shared to manage the free bits.
> 
> struct 64k_page {
>     u8 free_pages;
>     u8 pending_rcu_free_pages;
>     struct rcu_head head;
> }
> 
> free_sub_page(sub_id)
>     if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
>          call_rcu(&64k_page->head)
> 
> rcu_func()
>    64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)
> 
>    if (64k_pages->free_pages == all_ones)
>       free_pgea(64k_page);

Or simply allocate as many rcu_heads as page tables.

I have not thought through your suggestion above, because I'm against
asking s390, or any other architecture, to degrade its page table
implementation by demanding more memory, just for the sake of my patch
series.  In a future memdesc world it might turn out to be reasonable,
but not for this (if I can possibly avoid it).

Below is what I believe to be the correct powerpc patch (built but not
retested).  sparc I thought was going to be an equal problem, but turns
out not: I'll comment on 06/12.  And let's move s390 discussion to 07/12.

[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c     | 51 ++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
 		__free_page(page);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+	int refcount;
+
+	page = container_of(head, struct page, rcu_head);
+	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+				     &page->pt_frag_refcount);
+	if (refcount < PTE_FREE_DEFERRED) {
+		pte_fragment_free((unsigned long *)page_address(page), 0);
+		return;
+	}
+	/*
+	 * One page may be shared between PTE_FRAG_NR pagetables.
+	 * At least one more call to pte_free_defer() came in while we
+	 * were already deferring, so the free must be deferred again;
+	 * but just for one grace period, however many calls came in.
+	 */
+	while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+		refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+					     &page->pt_frag_refcount);
+	}
+	/* Remove that refcount of 1 left for fragment freeing above */
+	atomic_dec(&page->pt_frag_refcount);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	/*
+	 * One page may be shared between PTE_FRAG_NR pagetables: only queue
+	 * it once for freeing, but note whenever the free must be deferred.
+	 *
+	 * (This would be much simpler if the struct page had an rcu_head for
+	 * each fragment, or if we could allocate a separate array for that.)
+	 *
+	 * Convert our refcount of 1 to a refcount of PTE_FREE_DEFERRED, and
+	 * proceed to call_rcu() only when the rcu_head is not already in use.
+	 */
+	if (atomic_add_return(PTE_FREE_DEFERRED - 1, &page->pt_frag_refcount) <
+			      PTE_FREE_DEFERRED + PTE_FREE_DEFERRED)
+		call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page
  2023-05-29  6:21   ` Hugh Dickins
  (?)
@ 2023-06-06  3:46     ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  3:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.
The patch should be fine as posted (except its title is misleading).

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/sparc/include/asm/pgalloc_64.h |  4 ++++
>  arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
> index 7b5561d17ab1..caa7632be4c2 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
>  void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
>  void pte_free(struct mm_struct *mm, pgtable_t ptepage);
>  
> +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
>  #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
>  
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..b7c6aa085ef6 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	__pte_free((pgtable_t)page_to_virt(page));
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}
> +
>  void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			  pmd_t *pmd)
>  {
> -- 
> 2.35.3
> 
> 

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06  3:46     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  3:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.
The patch should be fine as posted (except its title is misleading).

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/sparc/include/asm/pgalloc_64.h |  4 ++++
>  arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
> index 7b5561d17ab1..caa7632be4c2 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
>  void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
>  void pte_free(struct mm_struct *mm, pgtable_t ptepage);
>  
> +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
>  #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
>  
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..b7c6aa085ef6 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	__pte_free((pgtable_t)page_to_virt(page));
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}
> +
>  void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			  pmd_t *pmd)
>  {
> -- 
> 2.35.3
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06  3:46     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  3:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Ca rstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Andrew Morton, Mel Gorman, David S. Miller,
	Zack Rusin, Mike Kravetz

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.
The patch should be fine as posted (except its title is misleading).

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/sparc/include/asm/pgalloc_64.h |  4 ++++
>  arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
> index 7b5561d17ab1..caa7632be4c2 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
>  void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
>  void pte_free(struct mm_struct *mm, pgtable_t ptepage);
>  
> +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
>  #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
>  
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..b7c6aa085ef6 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	__pte_free((pgtable_t)page_to_virt(page));
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	call_rcu(&page->rcu_head, pte_free_now);
> +}
> +
>  void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			  pmd_t *pmd)
>  {
> -- 
> 2.35.3
> 
> 

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-05-29  6:22   ` Hugh Dickins
  (?)
@ 2023-06-06  5:11     ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  5:11 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+	struct list_head *listed;
 	unsigned long *table;
 	struct page *page;
 	unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
 		table = NULL;
 		spin_lock_bh(&mm->context.lock);
 		if (!list_empty(&mm->context.pgtable_list)) {
-			page = list_first_entry(&mm->context.pgtable_list,
-						struct page, lru);
+			listed = mm->context.pgtable_list.next;
+			page = virt_to_page(listed);
 			mask = atomic_read(&page->_refcount) >> 24;
 			/*
 			 * The pending removal bits must also be checked.
@@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
 				bit = mask & 1;		/* =1 -> second 2K */
 				if (bit)
 					table += PTRS_PER_PTE;
+				BUG_ON(table != (unsigned long *)listed);
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
-				list_del(&page->lru);
+				list_del(listed);
+				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
+				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
 		/* Return the first 2K fragment of the page */
 		atomic_xor_bits(&page->_refcount, 0x01U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
+		listed = (struct list head *)(table + PTRS_PER_PTE);
 		spin_lock_bh(&mm->context.lock);
-		list_add(&page->lru, &mm->context.pgtable_list);
+		list_add(listed, &mm->context.pgtable_list);
 		spin_unlock_bh(&mm->context.lock);
 	}
 	return table;
@@ -310,6 +315,7 @@ static void page_table_release_check(str
 
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
+	struct list_head *listed;
 	unsigned int mask, bit, half;
 	struct page *page;
 
@@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
-			list_add(&page->lru, &mm->context.pgtable_list);
-		else
-			list_del(&page->lru);
+		if (mask & 0x03U) {
+			listed = (struct list_head *)table;
+			list_add(listed, &mm->context.pgtable_list);
+		} else {
+			/*
+			 * Get address of the other page table sharing the page.
+			 * There are sure to be MUCH better ways to do all this!
+			 * But I'm rushing, while trying to keep to the obvious.
+			 */
+			listed = (struct list_head *)(table + PTRS_PER_PTE);
+			if (virt_to_page(listed) != page) {
+				/* sizeof(*listed) is twice sizeof(*table) */
+				listed -= PTRS_PER_PTE;
+			}
+			list_del(listed);
+			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
@@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
+	struct list_head *listed;
 	struct mm_struct *mm;
 	struct page *page;
 	unsigned int bit, mask;
@@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
-		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
-		list_del(&page->lru);
+	if (mask & 0x03U) {
+		listed = (struct list_head *)table;
+		list_add_tail(listed, &mm->context.pgtable_list);
+	} else {
+		/*
+		 * Get address of the other page table sharing the page.
+		 * There are sure to be MUCH better ways to do all this!
+		 * But I'm rushing, and trying to keep to the obvious.
+		 */
+		listed = (struct list_head *)(table + PTRS_PER_PTE);
+		if (virt_to_page(listed) != page) {
+			/* sizeof(*listed) is twice sizeof(*table) */
+			listed -= PTRS_PER_PTE;
+		}
+		list_del(listed);
+		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-06  5:11     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  5:11 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+	struct list_head *listed;
 	unsigned long *table;
 	struct page *page;
 	unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
 		table = NULL;
 		spin_lock_bh(&mm->context.lock);
 		if (!list_empty(&mm->context.pgtable_list)) {
-			page = list_first_entry(&mm->context.pgtable_list,
-						struct page, lru);
+			listed = mm->context.pgtable_list.next;
+			page = virt_to_page(listed);
 			mask = atomic_read(&page->_refcount) >> 24;
 			/*
 			 * The pending removal bits must also be checked.
@@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
 				bit = mask & 1;		/* =1 -> second 2K */
 				if (bit)
 					table += PTRS_PER_PTE;
+				BUG_ON(table != (unsigned long *)listed);
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
-				list_del(&page->lru);
+				list_del(listed);
+				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
+				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
 		/* Return the first 2K fragment of the page */
 		atomic_xor_bits(&page->_refcount, 0x01U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
+		listed = (struct list head *)(table + PTRS_PER_PTE);
 		spin_lock_bh(&mm->context.lock);
-		list_add(&page->lru, &mm->context.pgtable_list);
+		list_add(listed, &mm->context.pgtable_list);
 		spin_unlock_bh(&mm->context.lock);
 	}
 	return table;
@@ -310,6 +315,7 @@ static void page_table_release_check(str
 
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
+	struct list_head *listed;
 	unsigned int mask, bit, half;
 	struct page *page;
 
@@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
-			list_add(&page->lru, &mm->context.pgtable_list);
-		else
-			list_del(&page->lru);
+		if (mask & 0x03U) {
+			listed = (struct list_head *)table;
+			list_add(listed, &mm->context.pgtable_list);
+		} else {
+			/*
+			 * Get address of the other page table sharing the page.
+			 * There are sure to be MUCH better ways to do all this!
+			 * But I'm rushing, while trying to keep to the obvious.
+			 */
+			listed = (struct list_head *)(table + PTRS_PER_PTE);
+			if (virt_to_page(listed) != page) {
+				/* sizeof(*listed) is twice sizeof(*table) */
+				listed -= PTRS_PER_PTE;
+			}
+			list_del(listed);
+			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
@@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
+	struct list_head *listed;
 	struct mm_struct *mm;
 	struct page *page;
 	unsigned int bit, mask;
@@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
-		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
-		list_del(&page->lru);
+	if (mask & 0x03U) {
+		listed = (struct list_head *)table;
+		list_add_tail(listed, &mm->context.pgtable_list);
+	} else {
+		/*
+		 * Get address of the other page table sharing the page.
+		 * There are sure to be MUCH better ways to do all this!
+		 * But I'm rushing, and trying to keep to the obvious.
+		 */
+		listed = (struct list_head *)(table + PTRS_PER_PTE);
+		if (virt_to_page(listed) != page) {
+			/* sizeof(*listed) is twice sizeof(*table) */
+			listed -= PTRS_PER_PTE;
+		}
+		list_del(listed);
+		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-06  5:11     ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  5:11 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+	struct list_head *listed;
 	unsigned long *table;
 	struct page *page;
 	unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
 		table = NULL;
 		spin_lock_bh(&mm->context.lock);
 		if (!list_empty(&mm->context.pgtable_list)) {
-			page = list_first_entry(&mm->context.pgtable_list,
-						struct page, lru);
+			listed = mm->context.pgtable_list.next;
+			page = virt_to_page(listed);
 			mask = atomic_read(&page->_refcount) >> 24;
 			/*
 			 * The pending removal bits must also be checked.
@@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
 				bit = mask & 1;		/* =1 -> second 2K */
 				if (bit)
 					table += PTRS_PER_PTE;
+				BUG_ON(table != (unsigned long *)listed);
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
-				list_del(&page->lru);
+				list_del(listed);
+				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
+				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
 		/* Return the first 2K fragment of the page */
 		atomic_xor_bits(&page->_refcount, 0x01U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
+		listed = (struct list head *)(table + PTRS_PER_PTE);
 		spin_lock_bh(&mm->context.lock);
-		list_add(&page->lru, &mm->context.pgtable_list);
+		list_add(listed, &mm->context.pgtable_list);
 		spin_unlock_bh(&mm->context.lock);
 	}
 	return table;
@@ -310,6 +315,7 @@ static void page_table_release_check(str
 
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
+	struct list_head *listed;
 	unsigned int mask, bit, half;
 	struct page *page;
 
@@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
-			list_add(&page->lru, &mm->context.pgtable_list);
-		else
-			list_del(&page->lru);
+		if (mask & 0x03U) {
+			listed = (struct list_head *)table;
+			list_add(listed, &mm->context.pgtable_list);
+		} else {
+			/*
+			 * Get address of the other page table sharing the page.
+			 * There are sure to be MUCH better ways to do all this!
+			 * But I'm rushing, while trying to keep to the obvious.
+			 */
+			listed = (struct list_head *)(table + PTRS_PER_PTE);
+			if (virt_to_page(listed) != page) {
+				/* sizeof(*listed) is twice sizeof(*table) */
+				listed -= PTRS_PER_PTE;
+			}
+			list_del(listed);
+			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
@@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 			 unsigned long vmaddr)
 {
+	struct list_head *listed;
 	struct mm_struct *mm;
 	struct page *page;
 	unsigned int bit, mask;
@@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
-		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
-		list_del(&page->lru);
+	if (mask & 0x03U) {
+		listed = (struct list_head *)table;
+		list_add_tail(listed, &mm->context.pgtable_list);
+	} else {
+		/*
+		 * Get address of the other page table sharing the page.
+		 * There are sure to be MUCH better ways to do all this!
+		 * But I'm rushing, and trying to keep to the obvious.
+		 */
+		listed = (struct list_head *)(table + PTRS_PER_PTE);
+		if (virt_to_page(listed) != page) {
+			/* sizeof(*listed) is twice sizeof(*table) */
+			listed -= PTRS_PER_PTE;
+		}
+		list_del(listed);
+		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
+		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-05-31 15:34     ` Jann Horn
  (?)
@ 2023-06-06  6:18       ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  6:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3636 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
...
> > +                * Note that vma->anon_vma check is racy: it can be set after
> > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +                * prevented establishing new ptes of the page. So we are safe
> > +                * to remove page table below, without even checking it's empty.
> 
> This "we are safe to remove page table below, without even checking
> it's empty" assumes that the only way to create new anonymous PTEs is
> to use existing file PTEs, right? What about private shmem VMAs that
> are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> looking at the mapping and its pages (except for checking that the
> insertion point is before end-of-file), protected only by mmap_lock
> (shared) and pte_offset_map_lock().

Right, from your comments and Peter's, thank you both, I can see that
userfaultfd breaks the usual assumptions here: so I'm putting an
		if (unlikely(vma->anon_vma || userfaultfd_wp(vma)))
check in once we've got the ptlock; with a comment above it to point
the blame at uffd, though I gave up on describing all the detail.
And deleted this earlier "we are safe" paragraph.

You did suggest, in another mail, that perhaps there should be a scan
checking all pte_none() when we get the ptlock.  I wasn't keen on yet
another debug scan for bugs and didn't add that, thinking I was going
to add a patch on the end to do so in page_table_check_pte_clear_range().

But when I came to write that patch, found that I'd been misled by its
name: it's about checking or adjusting some accounting, not really a
suitable place to check for pte_none() at all; so just scrapped it.

...
> > -                       collapse_and_free_pmd(mm, vma, addr, pmd);
> 
> The old code called collapse_and_free_pmd(), which involves MMU
> notifier invocation...

...
> > +               pml = pmd_lock(mm, pmd);
> > +               ptl = pte_lockptr(mm, pmd);
> > +               if (ptl != pml)
> > +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 
> ... while the new code only does pmdp_collapse_flush(), which clears
> the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> notifiers. My understanding is that that's problematic - maybe (?) it
> is sort of okay with regards to classic MMU notifier users like KVM,
> but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> consumes the normal page tables?

Right, I intentionally left out the MMU notifier invocation, knowing
that we have already done an MMU notifier invocation when unmapping
any PTEs which were mapped: it was necessary for collapse_and_free_pmd()
in the collapse_pte_mapped_thp() case, but there was no notifier in this
case for many years, and I was glad to be rid of it.

However, I now see that you were adding it intentionally even for this
case in your f268f6cf875f; and from later comments in this thread, it
looks like there is still uncertainty about whether it is needed here,
but safer to assume that it is needed: I'll add it back.

> 
> (FWIW, last I looked, there also seemed to be some other issues with
> MMU notifier usage wrt IOMMUv2, see the thread
> <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-06-06  6:18       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  6:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3636 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
...
> > +                * Note that vma->anon_vma check is racy: it can be set after
> > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +                * prevented establishing new ptes of the page. So we are safe
> > +                * to remove page table below, without even checking it's empty.
> 
> This "we are safe to remove page table below, without even checking
> it's empty" assumes that the only way to create new anonymous PTEs is
> to use existing file PTEs, right? What about private shmem VMAs that
> are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> looking at the mapping and its pages (except for checking that the
> insertion point is before end-of-file), protected only by mmap_lock
> (shared) and pte_offset_map_lock().

Right, from your comments and Peter's, thank you both, I can see that
userfaultfd breaks the usual assumptions here: so I'm putting an
		if (unlikely(vma->anon_vma || userfaultfd_wp(vma)))
check in once we've got the ptlock; with a comment above it to point
the blame at uffd, though I gave up on describing all the detail.
And deleted this earlier "we are safe" paragraph.

You did suggest, in another mail, that perhaps there should be a scan
checking all pte_none() when we get the ptlock.  I wasn't keen on yet
another debug scan for bugs and didn't add that, thinking I was going
to add a patch on the end to do so in page_table_check_pte_clear_range().

But when I came to write that patch, found that I'd been misled by its
name: it's about checking or adjusting some accounting, not really a
suitable place to check for pte_none() at all; so just scrapped it.

...
> > -                       collapse_and_free_pmd(mm, vma, addr, pmd);
> 
> The old code called collapse_and_free_pmd(), which involves MMU
> notifier invocation...

...
> > +               pml = pmd_lock(mm, pmd);
> > +               ptl = pte_lockptr(mm, pmd);
> > +               if (ptl != pml)
> > +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 
> ... while the new code only does pmdp_collapse_flush(), which clears
> the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> notifiers. My understanding is that that's problematic - maybe (?) it
> is sort of okay with regards to classic MMU notifier users like KVM,
> but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> consumes the normal page tables?

Right, I intentionally left out the MMU notifier invocation, knowing
that we have already done an MMU notifier invocation when unmapping
any PTEs which were mapped: it was necessary for collapse_and_free_pmd()
in the collapse_pte_mapped_thp() case, but there was no notifier in this
case for many years, and I was glad to be rid of it.

However, I now see that you were adding it intentionally even for this
case in your f268f6cf875f; and from later comments in this thread, it
looks like there is still uncertainty about whether it is needed here,
but safer to assume that it is needed: I'll add it back.

> 
> (FWIW, last I looked, there also seemed to be some other issues with
> MMU notifier usage wrt IOMMUv2, see the thread
> <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
@ 2023-06-06  6:18       ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  6:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

[-- Attachment #1: Type: text/plain, Size: 3636 bytes --]

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@google.com> wrote:
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
...
> > +                * Note that vma->anon_vma check is racy: it can be set after
> > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +                * prevented establishing new ptes of the page. So we are safe
> > +                * to remove page table below, without even checking it's empty.
> 
> This "we are safe to remove page table below, without even checking
> it's empty" assumes that the only way to create new anonymous PTEs is
> to use existing file PTEs, right? What about private shmem VMAs that
> are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> looking at the mapping and its pages (except for checking that the
> insertion point is before end-of-file), protected only by mmap_lock
> (shared) and pte_offset_map_lock().

Right, from your comments and Peter's, thank you both, I can see that
userfaultfd breaks the usual assumptions here: so I'm putting an
		if (unlikely(vma->anon_vma || userfaultfd_wp(vma)))
check in once we've got the ptlock; with a comment above it to point
the blame at uffd, though I gave up on describing all the detail.
And deleted this earlier "we are safe" paragraph.

You did suggest, in another mail, that perhaps there should be a scan
checking all pte_none() when we get the ptlock.  I wasn't keen on yet
another debug scan for bugs and didn't add that, thinking I was going
to add a patch on the end to do so in page_table_check_pte_clear_range().

But when I came to write that patch, found that I'd been misled by its
name: it's about checking or adjusting some accounting, not really a
suitable place to check for pte_none() at all; so just scrapped it.

...
> > -                       collapse_and_free_pmd(mm, vma, addr, pmd);
> 
> The old code called collapse_and_free_pmd(), which involves MMU
> notifier invocation...

...
> > +               pml = pmd_lock(mm, pmd);
> > +               ptl = pte_lockptr(mm, pmd);
> > +               if (ptl != pml)
> > +                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 
> ... while the new code only does pmdp_collapse_flush(), which clears
> the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> notifiers. My understanding is that that's problematic - maybe (?) it
> is sort of okay with regards to classic MMU notifier users like KVM,
> but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> consumes the normal page tables?

Right, I intentionally left out the MMU notifier invocation, knowing
that we have already done an MMU notifier invocation when unmapping
any PTEs which were mapped: it was necessary for collapse_and_free_pmd()
in the collapse_pte_mapped_thp() case, but there was no notifier in this
case for many years, and I was glad to be rid of it.

However, I now see that you were adding it intentionally even for this
case in your f268f6cf875f; and from later comments in this thread, it
looks like there is still uncertainty about whether it is needed here,
but safer to assume that it is needed: I'll add it back.

> 
> (FWIW, last I looked, there also seemed to be some other issues with
> MMU notifier usage wrt IOMMUv2, see the thread
> <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
  2023-06-02 15:26       ` Jann Horn
  (?)
@ 2023-06-06  6:28         ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  6:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3219 bytes --]

On Fri, 2 Jun 2023, Jann Horn wrote:
> On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote:
> 
> > The most obvious vital thing (in the split ptlock case) is that it
> > remains a struct page with a usable ptl spinlock embedded in it.
> >
> > The question becomes more urgent when/if extending to replacing the
> > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> > wants to deposit the page table for later use even in the shmem/file
> > case (and all arches in the anon case): I did work out the details once
> > before, but I'm not sure whether I would still agree with myself; and was
> > glad to leave replacement out of this series, to revisit some time later.
> >
> > >
> > > So in particular, in handle_pte_fault() we can reach the "if
> > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > > detached zeroed page table, but we're okay with that because in that
> > > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > > out?
> >
> > There is no current (even at end of series) circumstance in which we
> > could be pointing to a detached page table there; but yes, I want to
> > allow for that, and yes I agree with your analysis.
> 
> Hmm, what am I missing here?

I spent quite a while trying to reconstruct what I had been thinking,
what meaning of "detached" or "there" I had in mind when I asserted so
confidently "There is no current (even at end of series) circumstance
in which we could be pointing to a detached page table there".

But had to give up and get on with more useful work.
Of course you are right, and that is what this series is about.

Hugh

> 
> static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> {
>   pte_t entry;
> 
>   if (unlikely(pmd_none(*vmf->pmd))) {
>     [not executed]
>   } else {
>     /*
>      * A regular pmd is established and it can't morph into a huge
>      * pmd by anon khugepaged, since that takes mmap_lock in write
>      * mode; but shmem or file collapse to THP could still morph
>      * it into a huge pmd: just retry later if so.
>      */
>     vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>              vmf->address, &vmf->ptl);
>     if (unlikely(!vmf->pte))
>       [not executed]
>     // this reads a present readonly PTE
>     vmf->orig_pte = ptep_get_lockless(vmf->pte);
>     vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
>     if (pte_none(vmf->orig_pte)) {
>       [not executed]
>     }
>   }
> 
>   [at this point, a concurrent THP collapse operation detaches the page table]
>   // vmf->pte now points into a detached page table
> 
>   if (!vmf->pte)
>     [not executed]
> 
>   if (!pte_present(vmf->orig_pte))
>     [not executed]
> 
>   if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>     [not executed]
> 
>   spin_lock(vmf->ptl);
>   entry = vmf->orig_pte;
>   // vmf->pte still points into a detached page table
>   if (unlikely(!pte_same(*vmf->pte, entry))) {
>     update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>     goto unlock;
>   }
>   [...]
> }

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-06-06  6:28         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  6:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3219 bytes --]

On Fri, 2 Jun 2023, Jann Horn wrote:
> On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote:
> 
> > The most obvious vital thing (in the split ptlock case) is that it
> > remains a struct page with a usable ptl spinlock embedded in it.
> >
> > The question becomes more urgent when/if extending to replacing the
> > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> > wants to deposit the page table for later use even in the shmem/file
> > case (and all arches in the anon case): I did work out the details once
> > before, but I'm not sure whether I would still agree with myself; and was
> > glad to leave replacement out of this series, to revisit some time later.
> >
> > >
> > > So in particular, in handle_pte_fault() we can reach the "if
> > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > > detached zeroed page table, but we're okay with that because in that
> > > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > > out?
> >
> > There is no current (even at end of series) circumstance in which we
> > could be pointing to a detached page table there; but yes, I want to
> > allow for that, and yes I agree with your analysis.
> 
> Hmm, what am I missing here?

I spent quite a while trying to reconstruct what I had been thinking,
what meaning of "detached" or "there" I had in mind when I asserted so
confidently "There is no current (even at end of series) circumstance
in which we could be pointing to a detached page table there".

But had to give up and get on with more useful work.
Of course you are right, and that is what this series is about.

Hugh

> 
> static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> {
>   pte_t entry;
> 
>   if (unlikely(pmd_none(*vmf->pmd))) {
>     [not executed]
>   } else {
>     /*
>      * A regular pmd is established and it can't morph into a huge
>      * pmd by anon khugepaged, since that takes mmap_lock in write
>      * mode; but shmem or file collapse to THP could still morph
>      * it into a huge pmd: just retry later if so.
>      */
>     vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>              vmf->address, &vmf->ptl);
>     if (unlikely(!vmf->pte))
>       [not executed]
>     // this reads a present readonly PTE
>     vmf->orig_pte = ptep_get_lockless(vmf->pte);
>     vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
>     if (pte_none(vmf->orig_pte)) {
>       [not executed]
>     }
>   }
> 
>   [at this point, a concurrent THP collapse operation detaches the page table]
>   // vmf->pte now points into a detached page table
> 
>   if (!vmf->pte)
>     [not executed]
> 
>   if (!pte_present(vmf->orig_pte))
>     [not executed]
> 
>   if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>     [not executed]
> 
>   spin_lock(vmf->ptl);
>   entry = vmf->orig_pte;
>   // vmf->pte still points into a detached page table
>   if (unlikely(!pte_same(*vmf->pte, entry))) {
>     update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>     goto unlock;
>   }
>   [...]
> }

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 00/12] mm: free retracted page table by RCU
@ 2023-06-06  6:28         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-06  6:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

[-- Attachment #1: Type: text/plain, Size: 3219 bytes --]

On Fri, 2 Jun 2023, Jann Horn wrote:
> On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins <hughd@google.com> wrote:
> 
> > The most obvious vital thing (in the split ptlock case) is that it
> > remains a struct page with a usable ptl spinlock embedded in it.
> >
> > The question becomes more urgent when/if extending to replacing the
> > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> > wants to deposit the page table for later use even in the shmem/file
> > case (and all arches in the anon case): I did work out the details once
> > before, but I'm not sure whether I would still agree with myself; and was
> > glad to leave replacement out of this series, to revisit some time later.
> >
> > >
> > > So in particular, in handle_pte_fault() we can reach the "if
> > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > > detached zeroed page table, but we're okay with that because in that
> > > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > > out?
> >
> > There is no current (even at end of series) circumstance in which we
> > could be pointing to a detached page table there; but yes, I want to
> > allow for that, and yes I agree with your analysis.
> 
> Hmm, what am I missing here?

I spent quite a while trying to reconstruct what I had been thinking,
what meaning of "detached" or "there" I had in mind when I asserted so
confidently "There is no current (even at end of series) circumstance
in which we could be pointing to a detached page table there".

But had to give up and get on with more useful work.
Of course you are right, and that is what this series is about.

Hugh

> 
> static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> {
>   pte_t entry;
> 
>   if (unlikely(pmd_none(*vmf->pmd))) {
>     [not executed]
>   } else {
>     /*
>      * A regular pmd is established and it can't morph into a huge
>      * pmd by anon khugepaged, since that takes mmap_lock in write
>      * mode; but shmem or file collapse to THP could still morph
>      * it into a huge pmd: just retry later if so.
>      */
>     vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>              vmf->address, &vmf->ptl);
>     if (unlikely(!vmf->pte))
>       [not executed]
>     // this reads a present readonly PTE
>     vmf->orig_pte = ptep_get_lockless(vmf->pte);
>     vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
>     if (pte_none(vmf->orig_pte)) {
>       [not executed]
>     }
>   }
> 
>   [at this point, a concurrent THP collapse operation detaches the page table]
>   // vmf->pte now points into a detached page table
> 
>   if (!vmf->pte)
>     [not executed]
> 
>   if (!pte_present(vmf->orig_pte))
>     [not executed]
> 
>   if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>     [not executed]
> 
>   spin_lock(vmf->ptl);
>   entry = vmf->orig_pte;
>   // vmf->pte still points into a detached page table
>   if (unlikely(!pte_same(*vmf->pte, entry))) {
>     update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>     goto unlock;
>   }
>   [...]
> }

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-06-06  3:40         ` Hugh Dickins
  (?)
@ 2023-06-06 18:23           ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 18:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Axel Rasmussen, Anshuman Khandual, Pasha Tatashin,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:

> diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> index 20652daa1d7e..e4f58c5fc2ac 100644
> --- a/arch/powerpc/mm/pgtable-frag.c
> +++ b/arch/powerpc/mm/pgtable-frag.c
> @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
>  		__free_page(page);
>  	}
>  }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> +
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +	int refcount;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> +				     &page->pt_frag_refcount);
> +	if (refcount < PTE_FREE_DEFERRED) {
> +		pte_fragment_free((unsigned long *)page_address(page), 0);
> +		return;
> +	}

From what I can tell power doesn't recycle the sub fragment into any
kind of free list. It just waits for the last fragment to be unused
and then frees the whole page.

So why not simply go into pte_fragment_free() and do the call_rcu directly:

	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
		if (!kernel)
			pgtable_pte_page_dtor(page);
		call_rcu(&page->rcu_head, free_page_rcu)

?

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06 18:23           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 18:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, David Hildenbrand, Suren Baghdasaryan,
	Qi Zheng, Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra,
	Will Deacon, Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Axel Rasmussen, Anshuman Khandual, Pasha Tatashin,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:

> diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> index 20652daa1d7e..e4f58c5fc2ac 100644
> --- a/arch/powerpc/mm/pgtable-frag.c
> +++ b/arch/powerpc/mm/pgtable-frag.c
> @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
>  		__free_page(page);
>  	}
>  }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> +
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +	int refcount;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> +				     &page->pt_frag_refcount);
> +	if (refcount < PTE_FREE_DEFERRED) {
> +		pte_fragment_free((unsigned long *)page_address(page), 0);
> +		return;
> +	}

From what I can tell power doesn't recycle the sub fragment into any
kind of free list. It just waits for the last fragment to be unused
and then frees the whole page.

So why not simply go into pte_fragment_free() and do the call_rcu directly:

	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
		if (!kernel)
			pgtable_pte_page_dtor(page);
		call_rcu(&page->rcu_head, free_page_rcu)

?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06 18:23           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 18:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Aneesh Kumar K.V, Axel Rasmussen,
	Christian Borntraeger, Thomas Hellstrom, Ralph Campbell,
	Pasha Tatashin, Anshuman Khandual, Heiko Carstens, Qi Z heng,
	Suren Baghdasaryan, linux-arm-kernel, SeongJae Park, Jann Horn,
	linux-mm, linuxppc-dev, Kirill A. Shutemov, Naoya Horiguchi,
	linux-kernel, Minchan Kim, Mike Rapoport, Andrew Morton,
	Mel Gorman, David S. Miller, Zack Rusin, Mike Kravetz

On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:

> diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> index 20652daa1d7e..e4f58c5fc2ac 100644
> --- a/arch/powerpc/mm/pgtable-frag.c
> +++ b/arch/powerpc/mm/pgtable-frag.c
> @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
>  		__free_page(page);
>  	}
>  }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> +
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +	int refcount;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> +				     &page->pt_frag_refcount);
> +	if (refcount < PTE_FREE_DEFERRED) {
> +		pte_fragment_free((unsigned long *)page_address(page), 0);
> +		return;
> +	}

From what I can tell power doesn't recycle the sub fragment into any
kind of free list. It just waits for the last fragment to be unused
and then frees the whole page.

So why not simply go into pte_fragment_free() and do the call_rcu directly:

	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
		if (!kernel)
			pgtable_pte_page_dtor(page);
		call_rcu(&page->rcu_head, free_page_rcu)

?

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-06  5:11     ` Hugh Dickins
  (?)
@ 2023-06-06 18:39       ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 18:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:

> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

It is not RCU anymore if it writes to the page table itself before the
grace period, so this change seems to break the RCU behavior of
page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
called after the stores.

Maybe something like an xarray on the mm to hold the frags?

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-06 18:39       ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 18:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Aneesh Kumar K.V, Axel Rasmussen,
	Gerald Schaefer, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Vasily G orbik,
	Anshuman Khandual, Heiko Carstens, Qi Zheng, Suren Baghdasaryan,
	linux-arm-kernel, SeongJae Park, Jann Horn, linux-mm,
	linuxppc-dev, Naoya Horiguchi, Zack Rusin, Minchan Kim,
	Kirill A. Shutemov, Andrew Morton, Mel Gorman, David S. Miller,
	Mike Rapoport, Mike Kravetz

On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:

> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

It is not RCU anymore if it writes to the page table itself before the
grace period, so this change seems to break the RCU behavior of
page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
called after the stores.

Maybe something like an xarray on the mm to hold the frags?

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-06 18:39       ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 18:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:

> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

It is not RCU anymore if it writes to the page table itself before the
grace period, so this change seems to break the RCU behavior of
page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
called after the stores.

Maybe something like an xarray on the mm to hold the frags?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-06-06 18:23           ` Jason Gunthorpe
  (?)
@ 2023-06-06 19:03             ` Peter Xu
  -1 siblings, 0 replies; 158+ messages in thread
From: Peter Xu @ 2023-06-06 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Matthew Wilcox, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> 
> > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > index 20652daa1d7e..e4f58c5fc2ac 100644
> > --- a/arch/powerpc/mm/pgtable-frag.c
> > +++ b/arch/powerpc/mm/pgtable-frag.c
> > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> >  		__free_page(page);
> >  	}
> >  }
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > +
> > +static void pte_free_now(struct rcu_head *head)
> > +{
> > +	struct page *page;
> > +	int refcount;
> > +
> > +	page = container_of(head, struct page, rcu_head);
> > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > +				     &page->pt_frag_refcount);
> > +	if (refcount < PTE_FREE_DEFERRED) {
> > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > +		return;
> > +	}
> 
> From what I can tell power doesn't recycle the sub fragment into any
> kind of free list. It just waits for the last fragment to be unused
> and then frees the whole page.
> 
> So why not simply go into pte_fragment_free() and do the call_rcu directly:
> 
> 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> 		if (!kernel)
> 			pgtable_pte_page_dtor(page);
> 		call_rcu(&page->rcu_head, free_page_rcu)

We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
in Hugh's series IIUC we need the spinlock being there for the rcu section
alongside the page itself.  So even if to do so we'll need to also rcu call 
pgtable_pte_page_dtor() when needed.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06 19:03             ` Peter Xu
  0 siblings, 0 replies; 158+ messages in thread
From: Peter Xu @ 2023-06-06 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Carstens, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Andrew Morton, Mel Gorman, David S. Miller,
	Zack Rusin, Mike Kravetz

On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> 
> > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > index 20652daa1d7e..e4f58c5fc2ac 100644
> > --- a/arch/powerpc/mm/pgtable-frag.c
> > +++ b/arch/powerpc/mm/pgtable-frag.c
> > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> >  		__free_page(page);
> >  	}
> >  }
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > +
> > +static void pte_free_now(struct rcu_head *head)
> > +{
> > +	struct page *page;
> > +	int refcount;
> > +
> > +	page = container_of(head, struct page, rcu_head);
> > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > +				     &page->pt_frag_refcount);
> > +	if (refcount < PTE_FREE_DEFERRED) {
> > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > +		return;
> > +	}
> 
> From what I can tell power doesn't recycle the sub fragment into any
> kind of free list. It just waits for the last fragment to be unused
> and then frees the whole page.
> 
> So why not simply go into pte_fragment_free() and do the call_rcu directly:
> 
> 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> 		if (!kernel)
> 			pgtable_pte_page_dtor(page);
> 		call_rcu(&page->rcu_head, free_page_rcu)

We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
in Hugh's series IIUC we need the spinlock being there for the rcu section
alongside the page itself.  So even if to do so we'll need to also rcu call 
pgtable_pte_page_dtor() when needed.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06 19:03             ` Peter Xu
  0 siblings, 0 replies; 158+ messages in thread
From: Peter Xu @ 2023-06-06 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Matthew Wilcox, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> 
> > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > index 20652daa1d7e..e4f58c5fc2ac 100644
> > --- a/arch/powerpc/mm/pgtable-frag.c
> > +++ b/arch/powerpc/mm/pgtable-frag.c
> > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> >  		__free_page(page);
> >  	}
> >  }
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > +
> > +static void pte_free_now(struct rcu_head *head)
> > +{
> > +	struct page *page;
> > +	int refcount;
> > +
> > +	page = container_of(head, struct page, rcu_head);
> > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > +				     &page->pt_frag_refcount);
> > +	if (refcount < PTE_FREE_DEFERRED) {
> > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > +		return;
> > +	}
> 
> From what I can tell power doesn't recycle the sub fragment into any
> kind of free list. It just waits for the last fragment to be unused
> and then frees the whole page.
> 
> So why not simply go into pte_fragment_free() and do the call_rcu directly:
> 
> 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> 		if (!kernel)
> 			pgtable_pte_page_dtor(page);
> 		call_rcu(&page->rcu_head, free_page_rcu)

We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
in Hugh's series IIUC we need the spinlock being there for the rcu section
alongside the page itself.  So even if to do so we'll need to also rcu call 
pgtable_pte_page_dtor() when needed.

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-06-06 19:03             ` Peter Xu
  (?)
@ 2023-06-06 19:08               ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 19:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Qi Zheng, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual,
	Heiko Carstens, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, Minchan Kim,
	Mike Rapoport, Andrew Morton, Mel Gorman, David S. Miller,
	Zack Rusin, Mike Kravetz

On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > 
> > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > >  		__free_page(page);
> > >  	}
> > >  }
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > +
> > > +static void pte_free_now(struct rcu_head *head)
> > > +{
> > > +	struct page *page;
> > > +	int refcount;
> > > +
> > > +	page = container_of(head, struct page, rcu_head);
> > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > +				     &page->pt_frag_refcount);
> > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > +		return;
> > > +	}
> > 
> > From what I can tell power doesn't recycle the sub fragment into any
> > kind of free list. It just waits for the last fragment to be unused
> > and then frees the whole page.
> > 
> > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > 
> > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > 		if (!kernel)
> > 			pgtable_pte_page_dtor(page);
> > 		call_rcu(&page->rcu_head, free_page_rcu)
> 
> We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> in Hugh's series IIUC we need the spinlock being there for the rcu section
> alongside the page itself.  So even if to do so we'll need to also rcu call 
> pgtable_pte_page_dtor() when needed.

Er yes, I botched that, the dtor and the free_page should be in a the
rcu callback function

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06 19:08               ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 19:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Matthew Wilcox, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > 
> > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > >  		__free_page(page);
> > >  	}
> > >  }
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > +
> > > +static void pte_free_now(struct rcu_head *head)
> > > +{
> > > +	struct page *page;
> > > +	int refcount;
> > > +
> > > +	page = container_of(head, struct page, rcu_head);
> > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > +				     &page->pt_frag_refcount);
> > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > +		return;
> > > +	}
> > 
> > From what I can tell power doesn't recycle the sub fragment into any
> > kind of free list. It just waits for the last fragment to be unused
> > and then frees the whole page.
> > 
> > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > 
> > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > 		if (!kernel)
> > 			pgtable_pte_page_dtor(page);
> > 		call_rcu(&page->rcu_head, free_page_rcu)
> 
> We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> in Hugh's series IIUC we need the spinlock being there for the rcu section
> alongside the page itself.  So even if to do so we'll need to also rcu call 
> pgtable_pte_page_dtor() when needed.

Er yes, I botched that, the dtor and the free_page should be in a the
rcu callback function

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-06 19:08               ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 19:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Matthew Wilcox, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Jann Horn, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > 
> > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > >  		__free_page(page);
> > >  	}
> > >  }
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > +
> > > +static void pte_free_now(struct rcu_head *head)
> > > +{
> > > +	struct page *page;
> > > +	int refcount;
> > > +
> > > +	page = container_of(head, struct page, rcu_head);
> > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > +				     &page->pt_frag_refcount);
> > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > +		return;
> > > +	}
> > 
> > From what I can tell power doesn't recycle the sub fragment into any
> > kind of free list. It just waits for the last fragment to be unused
> > and then frees the whole page.
> > 
> > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > 
> > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > 		if (!kernel)
> > 			pgtable_pte_page_dtor(page);
> > 		call_rcu(&page->rcu_head, free_page_rcu)
> 
> We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> in Hugh's series IIUC we need the spinlock being there for the rcu section
> alongside the page itself.  So even if to do so we'll need to also rcu call 
> pgtable_pte_page_dtor() when needed.

Er yes, I botched that, the dtor and the free_page should be in a the
rcu callback function

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-06  5:11     ` Hugh Dickins
  (?)
@ 2023-06-06 19:40       ` Gerald Schaefer
  -1 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-06 19:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Sun, 28 May 2023, Hugh Dickins wrote:
> 
> > Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This version is more complicated than others: because page_table_free()
> > needs to know which fragment is being freed, and which mm to link it to.
> > 
> > page_table_free()'s fragment handling is clever, but I could too easily
> > break it: what's done here in pte_free_defer() and pte_free_now() might
> > be better integrated with page_table_free()'s cleverness, but not by me!
> > 
> > By the time that page_table_free() gets called via RCU, it's conceivable
> > that mm would already have been freed: so mmgrab() in pte_free_defer()
> > and mmdrop() in pte_free_now().  No, that is not a good context to call
> > mmdrop() from, so make mmdrop_async() public and use that.  
> 
> But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
> between multiple page tables is tricky: something I knew but had lost
> sight of.  So the powerpc and s390 patches were broken: powerpc fairly
> easily fixed, but s390 more painful.
> 
> In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
> On Thu, 1 Jun 2023 15:57:51 +0200
> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > 
> > Yes, we have 2 pagetables in one 4K page, which could result in same
> > rcu_head reuse. It might be possible to use the cleverness from our
> > page_table_free() function, e.g. to only do the call_rcu() once, for
> > the case where both 2K pagetable fragments become unused, similar to
> > how we decide when to actually call __free_page().
> > 
> > However, it might be much worse, and page->rcu_head from a pagetable
> > page cannot be used at all for s390, because we also use page->lru
> > to keep our list of free 2K pagetable fragments. I always get confused
> > by struct page unions, so not completely sure, but it seems to me that
> > page->rcu_head would overlay with page->lru, right?  
> 
> Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> I'm wrong) I think that s390 could use exactly the same technique for
> its list of free 2K pagetable fragments as it uses for its list of THP
> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

Nice idea, I think that could actually work, since we only need the empty
2K halves on the list. So it should be possible to store the list_head
inside those.

> 
> And while it could use third and fourth longs instead, I don't see any
> need for that: a deposited pagetable has been allocated, so would not
> be on the list of free fragments.

Correct, that should not interfere.

> 
> Below is one of the grossest patches I've ever posted: gross because
> it's a rushed attempt to see whether that is viable, while it would take
> me longer to understand all the s390 cleverness there (even though the
> PP AA commentary above page_table_alloc() is excellent).

Sounds fair, this is also one of the grossest code we have, which is also
why Alexander added the comment. I guess we could need even more comments
inside the code, as it still confuses me more than it should.

Considering that, you did remarkably well. Your patch seems to work fine,
at least it survived some LTP mm tests. I will also add it to our CI runs,
to give it some more testing. Will report tomorrow when it broke something.
See also below for some patch comments.

> 
> I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

cmma_init_nodat() should be disjoint, not only because it is __init,
but also because it explicitly skips pagetable pages, so it should
never touch page->lru of those.

Not very familiar with the gmap code, it does look disjoint, and we should
also use complete 4K pages for pagetables instead of 2K fragments there,
but Christian or Claudio should also have a look.

> 
> Gerald, s390 folk: would it be possible for you to give this
> a try, suggest corrections and improvements, and then I can make it
> a separate patch of the series; and work on avoiding concurrent use
> of the rcu_head by pagetable fragment buddies (ideally fit in with
> the scheme already there, maybe DD bits to go along with the PP AA).

It feels like it could be possible to not only avoid the double
rcu_head, but also avoid passing over the mm via page->pt_mm.
I.e. have pte_free_defer(), which has the mm, do all the checks and
list updates that page_table_free() does, for which we need the mm.
Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
and do call_rcu(pte_free_now) instead. The pte_free_now() could then
just do _dtor/__free_page similar to the generic version.

I must admit that I still have no good overview of the "big picture"
here, and especially if this approach would still fit in. Probably not,
as the to-be-freed pagetables would still be accessible, but not really
valid, if we added them back to the list, with list_heads inside them.
So maybe call_rcu() has to be done always, and not only for the case
where the whole 4K page becomes free, then we probably cannot do w/o
passing over the mm for proper list handling.

Ah, and they could also be re-used, once they are back on the list,
which will probably not go well. Is that what you meant with DD bits,
i.e. mark such fragments to prevent re-use? Smells a bit like the
"pending purge"

> 
> Why am I even asking you to move away from page->lru: why don't I
> thread s390's pte_free_defer() pagetables like THP's deposit does?
> I cannot, because the deferred pagetables have to remain accessible
> as valid pagetables, until the RCU grace period has elapsed - unless
> all the list pointers would appear as pte_none(), which I doubt.

Yes, only empty and invalid PTEs will appear as pte_none(), i.e. entries
that contain only 0x400.

Ok, I guess that also explains why the approach mentioned above,
to avoid passing over the mm and do the list handling already in
pte_free_defer(), will not be so easy or possible at all.

> 
> (That may limit our possibilities with the deposited pagetables in
> future: I can imagine them too wanting to remain accessible as valid
> pagetables.  But that's not needed by this series, and s390 only uses
> deposit/withdraw for anon THP; and some are hoping that we might be
> able to move away from deposit/withdraw altogther - though powerpc's
> special use will make that more difficult.)
> 
> Thanks!
> Hugh
> 
> --- 6.4-rc5/arch/s390/mm/pgalloc.c
> +++ linux/arch/s390/mm/pgalloc.c
> @@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
>   */
>  unsigned long *page_table_alloc(struct mm_struct *mm)
>  {
> +	struct list_head *listed;
>  	unsigned long *table;
>  	struct page *page;
>  	unsigned int mask, bit;
> @@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
>  		table = NULL;
>  		spin_lock_bh(&mm->context.lock);
>  		if (!list_empty(&mm->context.pgtable_list)) {
> -			page = list_first_entry(&mm->context.pgtable_list,
> -						struct page, lru);
> +			listed = mm->context.pgtable_list.next;
> +			page = virt_to_page(listed);
>  			mask = atomic_read(&page->_refcount) >> 24;
>  			/*
>  			 * The pending removal bits must also be checked.
> @@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
>  				bit = mask & 1;		/* =1 -> second 2K */
>  				if (bit)
>  					table += PTRS_PER_PTE;
> +				BUG_ON(table != (unsigned long *)listed);
>  				atomic_xor_bits(&page->_refcount,
>  							0x01U << (bit + 24));
> -				list_del(&page->lru);
> +				list_del(listed);
> +				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
> +				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
>  			}
>  		}
>  		spin_unlock_bh(&mm->context.lock);
> @@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
>  		/* Return the first 2K fragment of the page */
>  		atomic_xor_bits(&page->_refcount, 0x01U << 24);
>  		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
> +		listed = (struct list head *)(table + PTRS_PER_PTE);

Missing "_" in "struct list head"

>  		spin_lock_bh(&mm->context.lock);
> -		list_add(&page->lru, &mm->context.pgtable_list);
> +		list_add(listed, &mm->context.pgtable_list);
>  		spin_unlock_bh(&mm->context.lock);
>  	}
>  	return table;
> @@ -310,6 +315,7 @@ static void page_table_release_check(str
>  
>  void page_table_free(struct mm_struct *mm, unsigned long *table)
>  {
> +	struct list_head *listed;
>  	unsigned int mask, bit, half;
>  	struct page *page;

Not sure if "reverse X-mas" is still part of any style guidelines,
but I still am a big fan of that :-). Although the other code in that
file is also not consistently using it ...

>  
> @@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
>  		 */
>  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  		mask >>= 24;
> -		if (mask & 0x03U)
> -			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> -			list_del(&page->lru);
> +		if (mask & 0x03U) {
> +			listed = (struct list_head *)table;
> +			list_add(listed, &mm->context.pgtable_list);
> +		} else {
> +			/*
> +			 * Get address of the other page table sharing the page.
> +			 * There are sure to be MUCH better ways to do all this!
> +			 * But I'm rushing, while trying to keep to the obvious.
> +			 */
> +			listed = (struct list_head *)(table + PTRS_PER_PTE);
> +			if (virt_to_page(listed) != page) {
> +				/* sizeof(*listed) is twice sizeof(*table) */
> +				listed -= PTRS_PER_PTE;
> +			}

Bitwise XOR with 0x800 should do the trick here, i.e. give you the address
of the other 2K half, like this:

			listed = (struct list_head *)((unsigned long) table ^ 0x800UL);

> +			list_del(listed);
> +			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> +			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> +		}
>  		spin_unlock_bh(&mm->context.lock);
>  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
>  		mask >>= 24;
> @@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
>  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  			 unsigned long vmaddr)
>  {
> +	struct list_head *listed;
>  	struct mm_struct *mm;
>  	struct page *page;
>  	unsigned int bit, mask;
> @@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> -	else
> -		list_del(&page->lru);
> +	if (mask & 0x03U) {
> +		listed = (struct list_head *)table;
> +		list_add_tail(listed, &mm->context.pgtable_list);
> +	} else {
> +		/*
> +		 * Get address of the other page table sharing the page.
> +		 * There are sure to be MUCH better ways to do all this!
> +		 * But I'm rushing, and trying to keep to the obvious.
> +		 */
> +		listed = (struct list_head *)(table + PTRS_PER_PTE);
> +		if (virt_to_page(listed) != page) {
> +			/* sizeof(*listed) is twice sizeof(*table) */
> +			listed -= PTRS_PER_PTE;
> +		}

Same as above.

> +		list_del(listed);
> +		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> +		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> +	}
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
>  	tlb_remove_table(tlb, table);

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-06 19:40       ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-06 19:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Vasily Gorbik, Anshuman Khandual,
	Heiko Carstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Naoya Horiguchi, Zack Rusin, Minchan Kim, Kirill A. Shutemov,
	Andrew Morton, Mel Gorman, David S. Miller, Mike Rapoport,
	Mike Kravetz

On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Sun, 28 May 2023, Hugh Dickins wrote:
> 
> > Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This version is more complicated than others: because page_table_free()
> > needs to know which fragment is being freed, and which mm to link it to.
> > 
> > page_table_free()'s fragment handling is clever, but I could too easily
> > break it: what's done here in pte_free_defer() and pte_free_now() might
> > be better integrated with page_table_free()'s cleverness, but not by me!
> > 
> > By the time that page_table_free() gets called via RCU, it's conceivable
> > that mm would already have been freed: so mmgrab() in pte_free_defer()
> > and mmdrop() in pte_free_now().  No, that is not a good context to call
> > mmdrop() from, so make mmdrop_async() public and use that.  
> 
> But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
> between multiple page tables is tricky: something I knew but had lost
> sight of.  So the powerpc and s390 patches were broken: powerpc fairly
> easily fixed, but s390 more painful.
> 
> In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
> On Thu, 1 Jun 2023 15:57:51 +0200
> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > 
> > Yes, we have 2 pagetables in one 4K page, which could result in same
> > rcu_head reuse. It might be possible to use the cleverness from our
> > page_table_free() function, e.g. to only do the call_rcu() once, for
> > the case where both 2K pagetable fragments become unused, similar to
> > how we decide when to actually call __free_page().
> > 
> > However, it might be much worse, and page->rcu_head from a pagetable
> > page cannot be used at all for s390, because we also use page->lru
> > to keep our list of free 2K pagetable fragments. I always get confused
> > by struct page unions, so not completely sure, but it seems to me that
> > page->rcu_head would overlay with page->lru, right?  
> 
> Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> I'm wrong) I think that s390 could use exactly the same technique for
> its list of free 2K pagetable fragments as it uses for its list of THP
> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

Nice idea, I think that could actually work, since we only need the empty
2K halves on the list. So it should be possible to store the list_head
inside those.

> 
> And while it could use third and fourth longs instead, I don't see any
> need for that: a deposited pagetable has been allocated, so would not
> be on the list of free fragments.

Correct, that should not interfere.

> 
> Below is one of the grossest patches I've ever posted: gross because
> it's a rushed attempt to see whether that is viable, while it would take
> me longer to understand all the s390 cleverness there (even though the
> PP AA commentary above page_table_alloc() is excellent).

Sounds fair, this is also one of the grossest code we have, which is also
why Alexander added the comment. I guess we could need even more comments
inside the code, as it still confuses me more than it should.

Considering that, you did remarkably well. Your patch seems to work fine,
at least it survived some LTP mm tests. I will also add it to our CI runs,
to give it some more testing. Will report tomorrow when it broke something.
See also below for some patch comments.

> 
> I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

cmma_init_nodat() should be disjoint, not only because it is __init,
but also because it explicitly skips pagetable pages, so it should
never touch page->lru of those.

Not very familiar with the gmap code, it does look disjoint, and we should
also use complete 4K pages for pagetables instead of 2K fragments there,
but Christian or Claudio should also have a look.

> 
> Gerald, s390 folk: would it be possible for you to give this
> a try, suggest corrections and improvements, and then I can make it
> a separate patch of the series; and work on avoiding concurrent use
> of the rcu_head by pagetable fragment buddies (ideally fit in with
> the scheme already there, maybe DD bits to go along with the PP AA).

It feels like it could be possible to not only avoid the double
rcu_head, but also avoid passing over the mm via page->pt_mm.
I.e. have pte_free_defer(), which has the mm, do all the checks and
list updates that page_table_free() does, for which we need the mm.
Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
and do call_rcu(pte_free_now) instead. The pte_free_now() could then
just do _dtor/__free_page similar to the generic version.

I must admit that I still have no good overview of the "big picture"
here, and especially if this approach would still fit in. Probably not,
as the to-be-freed pagetables would still be accessible, but not really
valid, if we added them back to the list, with list_heads inside them.
So maybe call_rcu() has to be done always, and not only for the case
where the whole 4K page becomes free, then we probably cannot do w/o
passing over the mm for proper list handling.

Ah, and they could also be re-used, once they are back on the list,
which will probably not go well. Is that what you meant with DD bits,
i.e. mark such fragments to prevent re-use? Smells a bit like the
"pending purge"

> 
> Why am I even asking you to move away from page->lru: why don't I
> thread s390's pte_free_defer() pagetables like THP's deposit does?
> I cannot, because the deferred pagetables have to remain accessible
> as valid pagetables, until the RCU grace period has elapsed - unless
> all the list pointers would appear as pte_none(), which I doubt.

Yes, only empty and invalid PTEs will appear as pte_none(), i.e. entries
that contain only 0x400.

Ok, I guess that also explains why the approach mentioned above,
to avoid passing over the mm and do the list handling already in
pte_free_defer(), will not be so easy or possible at all.

> 
> (That may limit our possibilities with the deposited pagetables in
> future: I can imagine them too wanting to remain accessible as valid
> pagetables.  But that's not needed by this series, and s390 only uses
> deposit/withdraw for anon THP; and some are hoping that we might be
> able to move away from deposit/withdraw altogther - though powerpc's
> special use will make that more difficult.)
> 
> Thanks!
> Hugh
> 
> --- 6.4-rc5/arch/s390/mm/pgalloc.c
> +++ linux/arch/s390/mm/pgalloc.c
> @@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
>   */
>  unsigned long *page_table_alloc(struct mm_struct *mm)
>  {
> +	struct list_head *listed;
>  	unsigned long *table;
>  	struct page *page;
>  	unsigned int mask, bit;
> @@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
>  		table = NULL;
>  		spin_lock_bh(&mm->context.lock);
>  		if (!list_empty(&mm->context.pgtable_list)) {
> -			page = list_first_entry(&mm->context.pgtable_list,
> -						struct page, lru);
> +			listed = mm->context.pgtable_list.next;
> +			page = virt_to_page(listed);
>  			mask = atomic_read(&page->_refcount) >> 24;
>  			/*
>  			 * The pending removal bits must also be checked.
> @@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
>  				bit = mask & 1;		/* =1 -> second 2K */
>  				if (bit)
>  					table += PTRS_PER_PTE;
> +				BUG_ON(table != (unsigned long *)listed);
>  				atomic_xor_bits(&page->_refcount,
>  							0x01U << (bit + 24));
> -				list_del(&page->lru);
> +				list_del(listed);
> +				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
> +				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
>  			}
>  		}
>  		spin_unlock_bh(&mm->context.lock);
> @@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
>  		/* Return the first 2K fragment of the page */
>  		atomic_xor_bits(&page->_refcount, 0x01U << 24);
>  		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
> +		listed = (struct list head *)(table + PTRS_PER_PTE);

Missing "_" in "struct list head"

>  		spin_lock_bh(&mm->context.lock);
> -		list_add(&page->lru, &mm->context.pgtable_list);
> +		list_add(listed, &mm->context.pgtable_list);
>  		spin_unlock_bh(&mm->context.lock);
>  	}
>  	return table;
> @@ -310,6 +315,7 @@ static void page_table_release_check(str
>  
>  void page_table_free(struct mm_struct *mm, unsigned long *table)
>  {
> +	struct list_head *listed;
>  	unsigned int mask, bit, half;
>  	struct page *page;

Not sure if "reverse X-mas" is still part of any style guidelines,
but I still am a big fan of that :-). Although the other code in that
file is also not consistently using it ...

>  
> @@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
>  		 */
>  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  		mask >>= 24;
> -		if (mask & 0x03U)
> -			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> -			list_del(&page->lru);
> +		if (mask & 0x03U) {
> +			listed = (struct list_head *)table;
> +			list_add(listed, &mm->context.pgtable_list);
> +		} else {
> +			/*
> +			 * Get address of the other page table sharing the page.
> +			 * There are sure to be MUCH better ways to do all this!
> +			 * But I'm rushing, while trying to keep to the obvious.
> +			 */
> +			listed = (struct list_head *)(table + PTRS_PER_PTE);
> +			if (virt_to_page(listed) != page) {
> +				/* sizeof(*listed) is twice sizeof(*table) */
> +				listed -= PTRS_PER_PTE;
> +			}

Bitwise XOR with 0x800 should do the trick here, i.e. give you the address
of the other 2K half, like this:

			listed = (struct list_head *)((unsigned long) table ^ 0x800UL);

> +			list_del(listed);
> +			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> +			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> +		}
>  		spin_unlock_bh(&mm->context.lock);
>  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
>  		mask >>= 24;
> @@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
>  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  			 unsigned long vmaddr)
>  {
> +	struct list_head *listed;
>  	struct mm_struct *mm;
>  	struct page *page;
>  	unsigned int bit, mask;
> @@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> -	else
> -		list_del(&page->lru);
> +	if (mask & 0x03U) {
> +		listed = (struct list_head *)table;
> +		list_add_tail(listed, &mm->context.pgtable_list);
> +	} else {
> +		/*
> +		 * Get address of the other page table sharing the page.
> +		 * There are sure to be MUCH better ways to do all this!
> +		 * But I'm rushing, and trying to keep to the obvious.
> +		 */
> +		listed = (struct list_head *)(table + PTRS_PER_PTE);
> +		if (virt_to_page(listed) != page) {
> +			/* sizeof(*listed) is twice sizeof(*table) */
> +			listed -= PTRS_PER_PTE;
> +		}

Same as above.

> +		list_del(listed);
> +		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> +		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> +	}
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
>  	tlb_remove_table(tlb, table);

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-06 19:40       ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-06 19:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Sun, 28 May 2023, Hugh Dickins wrote:
> 
> > Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This version is more complicated than others: because page_table_free()
> > needs to know which fragment is being freed, and which mm to link it to.
> > 
> > page_table_free()'s fragment handling is clever, but I could too easily
> > break it: what's done here in pte_free_defer() and pte_free_now() might
> > be better integrated with page_table_free()'s cleverness, but not by me!
> > 
> > By the time that page_table_free() gets called via RCU, it's conceivable
> > that mm would already have been freed: so mmgrab() in pte_free_defer()
> > and mmdrop() in pte_free_now().  No, that is not a good context to call
> > mmdrop() from, so make mmdrop_async() public and use that.  
> 
> But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
> between multiple page tables is tricky: something I knew but had lost
> sight of.  So the powerpc and s390 patches were broken: powerpc fairly
> easily fixed, but s390 more painful.
> 
> In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
> On Thu, 1 Jun 2023 15:57:51 +0200
> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > 
> > Yes, we have 2 pagetables in one 4K page, which could result in same
> > rcu_head reuse. It might be possible to use the cleverness from our
> > page_table_free() function, e.g. to only do the call_rcu() once, for
> > the case where both 2K pagetable fragments become unused, similar to
> > how we decide when to actually call __free_page().
> > 
> > However, it might be much worse, and page->rcu_head from a pagetable
> > page cannot be used at all for s390, because we also use page->lru
> > to keep our list of free 2K pagetable fragments. I always get confused
> > by struct page unions, so not completely sure, but it seems to me that
> > page->rcu_head would overlay with page->lru, right?  
> 
> Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> I'm wrong) I think that s390 could use exactly the same technique for
> its list of free 2K pagetable fragments as it uses for its list of THP
> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

Nice idea, I think that could actually work, since we only need the empty
2K halves on the list. So it should be possible to store the list_head
inside those.

> 
> And while it could use third and fourth longs instead, I don't see any
> need for that: a deposited pagetable has been allocated, so would not
> be on the list of free fragments.

Correct, that should not interfere.

> 
> Below is one of the grossest patches I've ever posted: gross because
> it's a rushed attempt to see whether that is viable, while it would take
> me longer to understand all the s390 cleverness there (even though the
> PP AA commentary above page_table_alloc() is excellent).

Sounds fair, this is also one of the grossest code we have, which is also
why Alexander added the comment. I guess we could need even more comments
inside the code, as it still confuses me more than it should.

Considering that, you did remarkably well. Your patch seems to work fine,
at least it survived some LTP mm tests. I will also add it to our CI runs,
to give it some more testing. Will report tomorrow when it broke something.
See also below for some patch comments.

> 
> I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

cmma_init_nodat() should be disjoint, not only because it is __init,
but also because it explicitly skips pagetable pages, so it should
never touch page->lru of those.

Not very familiar with the gmap code, it does look disjoint, and we should
also use complete 4K pages for pagetables instead of 2K fragments there,
but Christian or Claudio should also have a look.

> 
> Gerald, s390 folk: would it be possible for you to give this
> a try, suggest corrections and improvements, and then I can make it
> a separate patch of the series; and work on avoiding concurrent use
> of the rcu_head by pagetable fragment buddies (ideally fit in with
> the scheme already there, maybe DD bits to go along with the PP AA).

It feels like it could be possible to not only avoid the double
rcu_head, but also avoid passing over the mm via page->pt_mm.
I.e. have pte_free_defer(), which has the mm, do all the checks and
list updates that page_table_free() does, for which we need the mm.
Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
and do call_rcu(pte_free_now) instead. The pte_free_now() could then
just do _dtor/__free_page similar to the generic version.

I must admit that I still have no good overview of the "big picture"
here, and especially if this approach would still fit in. Probably not,
as the to-be-freed pagetables would still be accessible, but not really
valid, if we added them back to the list, with list_heads inside them.
So maybe call_rcu() has to be done always, and not only for the case
where the whole 4K page becomes free, then we probably cannot do w/o
passing over the mm for proper list handling.

Ah, and they could also be re-used, once they are back on the list,
which will probably not go well. Is that what you meant with DD bits,
i.e. mark such fragments to prevent re-use? Smells a bit like the
"pending purge"

> 
> Why am I even asking you to move away from page->lru: why don't I
> thread s390's pte_free_defer() pagetables like THP's deposit does?
> I cannot, because the deferred pagetables have to remain accessible
> as valid pagetables, until the RCU grace period has elapsed - unless
> all the list pointers would appear as pte_none(), which I doubt.

Yes, only empty and invalid PTEs will appear as pte_none(), i.e. entries
that contain only 0x400.

Ok, I guess that also explains why the approach mentioned above,
to avoid passing over the mm and do the list handling already in
pte_free_defer(), will not be so easy or possible at all.

> 
> (That may limit our possibilities with the deposited pagetables in
> future: I can imagine them too wanting to remain accessible as valid
> pagetables.  But that's not needed by this series, and s390 only uses
> deposit/withdraw for anon THP; and some are hoping that we might be
> able to move away from deposit/withdraw altogther - though powerpc's
> special use will make that more difficult.)
> 
> Thanks!
> Hugh
> 
> --- 6.4-rc5/arch/s390/mm/pgalloc.c
> +++ linux/arch/s390/mm/pgalloc.c
> @@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
>   */
>  unsigned long *page_table_alloc(struct mm_struct *mm)
>  {
> +	struct list_head *listed;
>  	unsigned long *table;
>  	struct page *page;
>  	unsigned int mask, bit;
> @@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
>  		table = NULL;
>  		spin_lock_bh(&mm->context.lock);
>  		if (!list_empty(&mm->context.pgtable_list)) {
> -			page = list_first_entry(&mm->context.pgtable_list,
> -						struct page, lru);
> +			listed = mm->context.pgtable_list.next;
> +			page = virt_to_page(listed);
>  			mask = atomic_read(&page->_refcount) >> 24;
>  			/*
>  			 * The pending removal bits must also be checked.
> @@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
>  				bit = mask & 1;		/* =1 -> second 2K */
>  				if (bit)
>  					table += PTRS_PER_PTE;
> +				BUG_ON(table != (unsigned long *)listed);
>  				atomic_xor_bits(&page->_refcount,
>  							0x01U << (bit + 24));
> -				list_del(&page->lru);
> +				list_del(listed);
> +				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
> +				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
>  			}
>  		}
>  		spin_unlock_bh(&mm->context.lock);
> @@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
>  		/* Return the first 2K fragment of the page */
>  		atomic_xor_bits(&page->_refcount, 0x01U << 24);
>  		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
> +		listed = (struct list head *)(table + PTRS_PER_PTE);

Missing "_" in "struct list head"

>  		spin_lock_bh(&mm->context.lock);
> -		list_add(&page->lru, &mm->context.pgtable_list);
> +		list_add(listed, &mm->context.pgtable_list);
>  		spin_unlock_bh(&mm->context.lock);
>  	}
>  	return table;
> @@ -310,6 +315,7 @@ static void page_table_release_check(str
>  
>  void page_table_free(struct mm_struct *mm, unsigned long *table)
>  {
> +	struct list_head *listed;
>  	unsigned int mask, bit, half;
>  	struct page *page;

Not sure if "reverse X-mas" is still part of any style guidelines,
but I still am a big fan of that :-). Although the other code in that
file is also not consistently using it ...

>  
> @@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
>  		 */
>  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  		mask >>= 24;
> -		if (mask & 0x03U)
> -			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> -			list_del(&page->lru);
> +		if (mask & 0x03U) {
> +			listed = (struct list_head *)table;
> +			list_add(listed, &mm->context.pgtable_list);
> +		} else {
> +			/*
> +			 * Get address of the other page table sharing the page.
> +			 * There are sure to be MUCH better ways to do all this!
> +			 * But I'm rushing, while trying to keep to the obvious.
> +			 */
> +			listed = (struct list_head *)(table + PTRS_PER_PTE);
> +			if (virt_to_page(listed) != page) {
> +				/* sizeof(*listed) is twice sizeof(*table) */
> +				listed -= PTRS_PER_PTE;
> +			}

Bitwise XOR with 0x800 should do the trick here, i.e. give you the address
of the other 2K half, like this:

			listed = (struct list_head *)((unsigned long) table ^ 0x800UL);

> +			list_del(listed);
> +			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> +			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> +		}
>  		spin_unlock_bh(&mm->context.lock);
>  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
>  		mask >>= 24;
> @@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
>  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  			 unsigned long vmaddr)
>  {
> +	struct list_head *listed;
>  	struct mm_struct *mm;
>  	struct page *page;
>  	unsigned int bit, mask;
> @@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> -	else
> -		list_del(&page->lru);
> +	if (mask & 0x03U) {
> +		listed = (struct list_head *)table;
> +		list_add_tail(listed, &mm->context.pgtable_list);
> +	} else {
> +		/*
> +		 * Get address of the other page table sharing the page.
> +		 * There are sure to be MUCH better ways to do all this!
> +		 * But I'm rushing, and trying to keep to the obvious.
> +		 */
> +		listed = (struct list_head *)(table + PTRS_PER_PTE);
> +		if (virt_to_page(listed) != page) {
> +			/* sizeof(*listed) is twice sizeof(*table) */
> +			listed -= PTRS_PER_PTE;
> +		}

Same as above.

> +		list_del(listed);
> +		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> +		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> +	}
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
>  	tlb_remove_table(tlb, table);

Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
  2023-06-06 19:08               ` Jason Gunthorpe
  (?)
@ 2023-06-07  3:49                 ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-07  3:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Hugh Dickins, Matthew Wilcox, Andrew Morton,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > > 
> > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > > >  		__free_page(page);
> > > >  	}
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > > +
> > > > +static void pte_free_now(struct rcu_head *head)
> > > > +{
> > > > +	struct page *page;
> > > > +	int refcount;
> > > > +
> > > > +	page = container_of(head, struct page, rcu_head);
> > > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > > +				     &page->pt_frag_refcount);
> > > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > > +		return;
> > > > +	}
> > > 
> > > From what I can tell power doesn't recycle the sub fragment into any
> > > kind of free list. It just waits for the last fragment to be unused
> > > and then frees the whole page.

Yes, it's relatively simple in that way: not as sophisticated as s390.

> > > 
> > > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > > 
> > > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > > 		if (!kernel)
> > > 			pgtable_pte_page_dtor(page);
> > > 		call_rcu(&page->rcu_head, free_page_rcu)
> > 
> > We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> > in Hugh's series IIUC we need the spinlock being there for the rcu section
> > alongside the page itself.  So even if to do so we'll need to also rcu call 
> > pgtable_pte_page_dtor() when needed.

Thanks, Peter, yes that's right.

> 
> Er yes, I botched that, the dtor and the free_page should be in a the
> rcu callback function

But it was just a botched detail, and won't have answered Jason's doubt.

I had three (or perhaps it amounts to two) reasons for doing it this way:
none of which may seem good enough reasons to you.  Certainly I'd agree
that the way it's done seems... arcane.

One, as I've indicated before, I don't actually dare to go all
the way into RCU freeing of all page tables for powerpc (or any other):
I should think it's a good idea that everyone wants in the end, but I'm
limited by my time and competence - and dread of losing my way in the
mmu_gather TLB #ifdef maze.  It's work for someone else not me.

(pte_free_defer() do as you suggest, without changing pte_fragment_free()
itself?  No, that doesn't work out when defer does, say, the decrement of
pt_frag_refcount from 2 to 1, then pte_fragment_free() does the decrement
from 1 to 0: page freed without deferral.)

Two, this was the code I'd worked out before, and was used in production,
so I had confidence in it - it was just my mistake that I'd forgotten the
single rcu_head issue, and thought I could avoid it in the initial posting.
powerpc has changed around since then, but apparently not in any way that
affects this.  And it's too easy to agree in review that something can be
simpler, without bringing back to mind why the complications are there.

Three (just an explanation of why the old code was like this), powerpc
relies on THP's page table deposit+withdraw protocol, even for shmem/
file THPs.  I've skirted that issue in this series, by sticking with
retract_page_tables(), not attempting to insert huge pmd immediately.
But if huge pmd is inserted to replace ptetable pmd, then ptetable must
be deposited: pte_free_defer() as written protects the deposited ptetable
from then being freed without deferral (rather like in the example above).

But does not protect it from being withdrawn and reused within that
grace period.  Jann has grave doubts whether that can ever be allowed
(or perhaps I should grant him certainty, and examples that it cannot).
I did convince myself, back in the day, that it was safe here: but I'll
have to put in a lot more thought to re-justify it now, and on the way
may instead be completely persuaded by Jann.

Not very good reasons: good enough, or can you supply a better patch?

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-07  3:49                 ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-07  3:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Hugh Dickins, Matthew Wilcox, Andrew Morton,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > > 
> > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > > >  		__free_page(page);
> > > >  	}
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > > +
> > > > +static void pte_free_now(struct rcu_head *head)
> > > > +{
> > > > +	struct page *page;
> > > > +	int refcount;
> > > > +
> > > > +	page = container_of(head, struct page, rcu_head);
> > > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > > +				     &page->pt_frag_refcount);
> > > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > > +		return;
> > > > +	}
> > > 
> > > From what I can tell power doesn't recycle the sub fragment into any
> > > kind of free list. It just waits for the last fragment to be unused
> > > and then frees the whole page.

Yes, it's relatively simple in that way: not as sophisticated as s390.

> > > 
> > > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > > 
> > > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > > 		if (!kernel)
> > > 			pgtable_pte_page_dtor(page);
> > > 		call_rcu(&page->rcu_head, free_page_rcu)
> > 
> > We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> > in Hugh's series IIUC we need the spinlock being there for the rcu section
> > alongside the page itself.  So even if to do so we'll need to also rcu call 
> > pgtable_pte_page_dtor() when needed.

Thanks, Peter, yes that's right.

> 
> Er yes, I botched that, the dtor and the free_page should be in a the
> rcu callback function

But it was just a botched detail, and won't have answered Jason's doubt.

I had three (or perhaps it amounts to two) reasons for doing it this way:
none of which may seem good enough reasons to you.  Certainly I'd agree
that the way it's done seems... arcane.

One, as I've indicated before, I don't actually dare to go all
the way into RCU freeing of all page tables for powerpc (or any other):
I should think it's a good idea that everyone wants in the end, but I'm
limited by my time and competence - and dread of losing my way in the
mmu_gather TLB #ifdef maze.  It's work for someone else not me.

(pte_free_defer() do as you suggest, without changing pte_fragment_free()
itself?  No, that doesn't work out when defer does, say, the decrement of
pt_frag_refcount from 2 to 1, then pte_fragment_free() does the decrement
from 1 to 0: page freed without deferral.)

Two, this was the code I'd worked out before, and was used in production,
so I had confidence in it - it was just my mistake that I'd forgotten the
single rcu_head issue, and thought I could avoid it in the initial posting.
powerpc has changed around since then, but apparently not in any way that
affects this.  And it's too easy to agree in review that something can be
simpler, without bringing back to mind why the complications are there.

Three (just an explanation of why the old code was like this), powerpc
relies on THP's page table deposit+withdraw protocol, even for shmem/
file THPs.  I've skirted that issue in this series, by sticking with
retract_page_tables(), not attempting to insert huge pmd immediately.
But if huge pmd is inserted to replace ptetable pmd, then ptetable must
be deposited: pte_free_defer() as written protects the deposited ptetable
from then being freed without deferral (rather like in the example above).

But does not protect it from being withdrawn and reused within that
grace period.  Jann has grave doubts whether that can ever be allowed
(or perhaps I should grant him certainty, and examples that it cannot).
I did convince myself, back in the day, that it was safe here: but I'll
have to put in a lot more thought to re-justify it now, and on the way
may instead be completely persuaded by Jann.

Not very good reasons: good enough, or can you supply a better patch?

Thanks,
Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page
@ 2023-06-07  3:49                 ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-07  3:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Anshuman Khandual

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > > 
> > > > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
> > > >  		__free_page(page);
> > > >  	}
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +#define PTE_FREE_DEFERRED 0x10000 /* beyond any PTE_FRAG_NR */
> > > > +
> > > > +static void pte_free_now(struct rcu_head *head)
> > > > +{
> > > > +	struct page *page;
> > > > +	int refcount;
> > > > +
> > > > +	page = container_of(head, struct page, rcu_head);
> > > > +	refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > > +				     &page->pt_frag_refcount);
> > > > +	if (refcount < PTE_FREE_DEFERRED) {
> > > > +		pte_fragment_free((unsigned long *)page_address(page), 0);
> > > > +		return;
> > > > +	}
> > > 
> > > From what I can tell power doesn't recycle the sub fragment into any
> > > kind of free list. It just waits for the last fragment to be unused
> > > and then frees the whole page.

Yes, it's relatively simple in that way: not as sophisticated as s390.

> > > 
> > > So why not simply go into pte_fragment_free() and do the call_rcu directly:
> > > 
> > > 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> > > 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> > > 		if (!kernel)
> > > 			pgtable_pte_page_dtor(page);
> > > 		call_rcu(&page->rcu_head, free_page_rcu)
> > 
> > We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> > in Hugh's series IIUC we need the spinlock being there for the rcu section
> > alongside the page itself.  So even if to do so we'll need to also rcu call 
> > pgtable_pte_page_dtor() when needed.

Thanks, Peter, yes that's right.

> 
> Er yes, I botched that, the dtor and the free_page should be in a the
> rcu callback function

But it was just a botched detail, and won't have answered Jason's doubt.

I had three (or perhaps it amounts to two) reasons for doing it this way:
none of which may seem good enough reasons to you.  Certainly I'd agree
that the way it's done seems... arcane.

One, as I've indicated before, I don't actually dare to go all
the way into RCU freeing of all page tables for powerpc (or any other):
I should think it's a good idea that everyone wants in the end, but I'm
limited by my time and competence - and dread of losing my way in the
mmu_gather TLB #ifdef maze.  It's work for someone else not me.

(pte_free_defer() do as you suggest, without changing pte_fragment_free()
itself?  No, that doesn't work out when defer does, say, the decrement of
pt_frag_refcount from 2 to 1, then pte_fragment_free() does the decrement
from 1 to 0: page freed without deferral.)

Two, this was the code I'd worked out before, and was used in production,
so I had confidence in it - it was just my mistake that I'd forgotten the
single rcu_head issue, and thought I could avoid it in the initial posting.
powerpc has changed around since then, but apparently not in any way that
affects this.  And it's too easy to agree in review that something can be
simpler, without bringing back to mind why the complications are there.

Three (just an explanation of why the old code was like this), powerpc
relies on THP's page table deposit+withdraw protocol, even for shmem/
file THPs.  I've skirted that issue in this series, by sticking with
retract_page_tables(), not attempting to insert huge pmd immediately.
But if huge pmd is inserted to replace ptetable pmd, then ptetable must
be deposited: pte_free_defer() as written protects the deposited ptetable
from then being freed without deferral (rather like in the example above).

But does not protect it from being withdrawn and reused within that
grace period.  Jann has grave doubts whether that can ever be allowed
(or perhaps I should grant him certainty, and examples that it cannot).
I did convince myself, back in the day, that it was safe here: but I'll
have to put in a lot more thought to re-justify it now, and on the way
may instead be completely persuaded by Jann.

Not very good reasons: good enough, or can you supply a better patch?

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-06 18:39       ` Jason Gunthorpe
  (?)
@ 2023-06-08  2:46         ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-08  2:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Gerald Schaefer, Vasily Gorbik, Andrew Morton,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:
> 
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> It is not RCU anymore if it writes to the page table itself before the
> grace period, so this change seems to break the RCU behavior of
> page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
> called after the stores.

Yes indeed, thanks for pointing that out.

> 
> Maybe something like an xarray on the mm to hold the frags?

I think we can manage without that:
I'll say slightly more in reply to Gerald.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08  2:46         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-08  2:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Gerald Schaefer, Vasily Gorbik, Andrew Morton,
	Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:
> 
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> It is not RCU anymore if it writes to the page table itself before the
> grace period, so this change seems to break the RCU behavior of
> page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
> called after the stores.

Yes indeed, thanks for pointing that out.

> 
> Maybe something like an xarray on the mm to hold the frags?

I think we can manage without that:
I'll say slightly more in reply to Gerald.

Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08  2:46         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-08  2:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Aneesh Kumar K.V,
	Axel Rasmussen, Gerald Schaefer, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:
> 
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> It is not RCU anymore if it writes to the page table itself before the
> grace period, so this change seems to break the RCU behavior of
> page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
> called after the stores.

Yes indeed, thanks for pointing that out.

> 
> Maybe something like an xarray on the mm to hold the frags?

I think we can manage without that:
I'll say slightly more in reply to Gerald.

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-06 19:40       ` Gerald Schaefer
  (?)
@ 2023-06-08  3:35         ` Hugh Dickins
  -1 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-08  3:35 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 1 Jun 2023 15:57:51 +0200
> > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > > 
> > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > rcu_head reuse. It might be possible to use the cleverness from our
> > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > the case where both 2K pagetable fragments become unused, similar to
> > > how we decide when to actually call __free_page().
> > > 
> > > However, it might be much worse, and page->rcu_head from a pagetable
> > > page cannot be used at all for s390, because we also use page->lru
> > > to keep our list of free 2K pagetable fragments. I always get confused
> > > by struct page unions, so not completely sure, but it seems to me that
> > > page->rcu_head would overlay with page->lru, right?  
> > 
> > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > I'm wrong) I think that s390 could use exactly the same technique for
> > its list of free 2K pagetable fragments as it uses for its list of THP
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> Nice idea, I think that could actually work, since we only need the empty
> 2K halves on the list. So it should be possible to store the list_head
> inside those.

Jason quickly pointed out the flaw in my thinking there.

> 
> > 
> > And while it could use third and fourth longs instead, I don't see any
> > need for that: a deposited pagetable has been allocated, so would not
> > be on the list of free fragments.
> 
> Correct, that should not interfere.
> 
> > 
> > Below is one of the grossest patches I've ever posted: gross because
> > it's a rushed attempt to see whether that is viable, while it would take
> > me longer to understand all the s390 cleverness there (even though the
> > PP AA commentary above page_table_alloc() is excellent).
> 
> Sounds fair, this is also one of the grossest code we have, which is also
> why Alexander added the comment. I guess we could need even more comments
> inside the code, as it still confuses me more than it should.
> 
> Considering that, you did remarkably well. Your patch seems to work fine,
> at least it survived some LTP mm tests. I will also add it to our CI runs,
> to give it some more testing. Will report tomorrow when it broke something.
> See also below for some patch comments.

Many thanks for your effort on this patch.  I don't expect the testing
of it to catch Jason's point, that I'm corrupting the page table while
it's on its way through RCU to being freed, but he's right nonetheless.

I'll integrate your fixes below into what I have here, but probably
just archive it as something to refer to later in case it might play
a part; but probably it will not - sorry for wasting your time.

> 
> > 
> > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.
> 
> cmma_init_nodat() should be disjoint, not only because it is __init,
> but also because it explicitly skips pagetable pages, so it should
> never touch page->lru of those.
> 
> Not very familiar with the gmap code, it does look disjoint, and we should
> also use complete 4K pages for pagetables instead of 2K fragments there,
> but Christian or Claudio should also have a look.
> 
> > 
> > Gerald, s390 folk: would it be possible for you to give this
> > a try, suggest corrections and improvements, and then I can make it
> > a separate patch of the series; and work on avoiding concurrent use
> > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > the scheme already there, maybe DD bits to go along with the PP AA).
> 
> It feels like it could be possible to not only avoid the double
> rcu_head, but also avoid passing over the mm via page->pt_mm.
> I.e. have pte_free_defer(), which has the mm, do all the checks and
> list updates that page_table_free() does, for which we need the mm.
> Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> just do _dtor/__free_page similar to the generic version.

I'm not sure: I missed your suggestion there when I first skimmed
through, and today have spent more time getting deeper into how it's
done at present.  I am now feeling more confident of a way forward,
a nicely integrated way forward, than I was yesterday.
Though getting it right may not be so easy.

When Jason pointed out the existing RCU, I initially hoped that it might
already provide the necessary framework: but sadly not, because the
unbatched case (used when additional memory is not available) does not
use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
If I used that, it would cripple the s390 implementation unacceptably.

> 
> I must admit that I still have no good overview of the "big picture"
> here, and especially if this approach would still fit in. Probably not,
> as the to-be-freed pagetables would still be accessible, but not really
> valid, if we added them back to the list, with list_heads inside them.
> So maybe call_rcu() has to be done always, and not only for the case
> where the whole 4K page becomes free, then we probably cannot do w/o
> passing over the mm for proper list handling.

My current thinking (but may be proved wrong) is along the lines of:
why does something on its way to being freed need to be on any list
than the rcu_head list?  I expect the current answer is, that the
other half is allocated, so the page won't be freed; but I hope that
we can put it back on that list once we're through with the rcu_head.

But the less I say now, the less I shall make a fool of myself:
I need to get deeper in.

> 
> Ah, and they could also be re-used, once they are back on the list,
> which will probably not go well. Is that what you meant with DD bits,
> i.e. mark such fragments to prevent re-use? Smells a bit like the
> "pending purge"

Yes, we may not need those DD defer bits at all: the pte_free_defer()
pagetables should fit very well with "pending purge" as it is.  They
will go down an unbatched route, but should be obeying the same rules.

> 
> > 
> > Why am I even asking you to move away from page->lru: why don't I
> > thread s390's pte_free_defer() pagetables like THP's deposit does?
> > I cannot, because the deferred pagetables have to remain accessible
> > as valid pagetables, until the RCU grace period has elapsed - unless
> > all the list pointers would appear as pte_none(), which I doubt.
> 
> Yes, only empty and invalid PTEs will appear as pte_none(), i.e. entries
> that contain only 0x400.
> 
> Ok, I guess that also explains why the approach mentioned above,
> to avoid passing over the mm and do the list handling already in
> pte_free_defer(), will not be so easy or possible at all.
> 
> > 
> > (That may limit our possibilities with the deposited pagetables in
> > future: I can imagine them too wanting to remain accessible as valid
> > pagetables.  But that's not needed by this series, and s390 only uses
> > deposit/withdraw for anon THP; and some are hoping that we might be
> > able to move away from deposit/withdraw altogther - though powerpc's
> > special use will make that more difficult.)
> > 
> > Thanks!
> > Hugh
> > 
> > --- 6.4-rc5/arch/s390/mm/pgalloc.c
> > +++ linux/arch/s390/mm/pgalloc.c
> > @@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
> >   */
> >  unsigned long *page_table_alloc(struct mm_struct *mm)
> >  {
> > +	struct list_head *listed;
> >  	unsigned long *table;
> >  	struct page *page;
> >  	unsigned int mask, bit;
> > @@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
> >  		table = NULL;
> >  		spin_lock_bh(&mm->context.lock);
> >  		if (!list_empty(&mm->context.pgtable_list)) {
> > -			page = list_first_entry(&mm->context.pgtable_list,
> > -						struct page, lru);
> > +			listed = mm->context.pgtable_list.next;
> > +			page = virt_to_page(listed);
> >  			mask = atomic_read(&page->_refcount) >> 24;
> >  			/*
> >  			 * The pending removal bits must also be checked.
> > @@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
> >  				bit = mask & 1;		/* =1 -> second 2K */
> >  				if (bit)
> >  					table += PTRS_PER_PTE;
> > +				BUG_ON(table != (unsigned long *)listed);
> >  				atomic_xor_bits(&page->_refcount,
> >  							0x01U << (bit + 24));
> > -				list_del(&page->lru);
> > +				list_del(listed);
> > +				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
> > +				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
> >  			}
> >  		}
> >  		spin_unlock_bh(&mm->context.lock);
> > @@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
> >  		/* Return the first 2K fragment of the page */
> >  		atomic_xor_bits(&page->_refcount, 0x01U << 24);
> >  		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
> > +		listed = (struct list head *)(table + PTRS_PER_PTE);
> 
> Missing "_" in "struct list head"
> 
> >  		spin_lock_bh(&mm->context.lock);
> > -		list_add(&page->lru, &mm->context.pgtable_list);
> > +		list_add(listed, &mm->context.pgtable_list);
> >  		spin_unlock_bh(&mm->context.lock);
> >  	}
> >  	return table;
> > @@ -310,6 +315,7 @@ static void page_table_release_check(str
> >  
> >  void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  {
> > +	struct list_head *listed;
> >  	unsigned int mask, bit, half;
> >  	struct page *page;
> 
> Not sure if "reverse X-mas" is still part of any style guidelines,
> but I still am a big fan of that :-). Although the other code in that
> file is also not consistently using it ...
> 
> >  
> > @@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
> >  		 */
> >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  		mask >>= 24;
> > -		if (mask & 0x03U)
> > -			list_add(&page->lru, &mm->context.pgtable_list);
> > -		else
> > -			list_del(&page->lru);
> > +		if (mask & 0x03U) {
> > +			listed = (struct list_head *)table;
> > +			list_add(listed, &mm->context.pgtable_list);
> > +		} else {
> > +			/*
> > +			 * Get address of the other page table sharing the page.
> > +			 * There are sure to be MUCH better ways to do all this!
> > +			 * But I'm rushing, while trying to keep to the obvious.
> > +			 */
> > +			listed = (struct list_head *)(table + PTRS_PER_PTE);
> > +			if (virt_to_page(listed) != page) {
> > +				/* sizeof(*listed) is twice sizeof(*table) */
> > +				listed -= PTRS_PER_PTE;
> > +			}
> 
> Bitwise XOR with 0x800 should do the trick here, i.e. give you the address
> of the other 2K half, like this:
> 
> 			listed = (struct list_head *)((unsigned long) table ^ 0x800UL);
> 
> > +			list_del(listed);
> > +			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> > +			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> > +		}
> >  		spin_unlock_bh(&mm->context.lock);
> >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> >  		mask >>= 24;
> > @@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
> >  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> >  			 unsigned long vmaddr)
> >  {
> > +	struct list_head *listed;
> >  	struct mm_struct *mm;
> >  	struct page *page;
> >  	unsigned int bit, mask;
> > @@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
> >  	 */
> >  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  	mask >>= 24;
> > -	if (mask & 0x03U)
> > -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> > -	else
> > -		list_del(&page->lru);
> > +	if (mask & 0x03U) {
> > +		listed = (struct list_head *)table;
> > +		list_add_tail(listed, &mm->context.pgtable_list);
> > +	} else {
> > +		/*
> > +		 * Get address of the other page table sharing the page.
> > +		 * There are sure to be MUCH better ways to do all this!
> > +		 * But I'm rushing, and trying to keep to the obvious.
> > +		 */
> > +		listed = (struct list_head *)(table + PTRS_PER_PTE);
> > +		if (virt_to_page(listed) != page) {
> > +			/* sizeof(*listed) is twice sizeof(*table) */
> > +			listed -= PTRS_PER_PTE;
> > +		}
> 
> Same as above.
> 
> > +		list_del(listed);
> > +		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> > +		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> > +	}
> >  	spin_unlock_bh(&mm->context.lock);
> >  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> >  	tlb_remove_table(tlb, table);
> 
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

Thanks a lot, Gerald, sorry that it now looks like wasted effort.

I'm feeling confident enough of getting into s390 PP-AA-world now, that
I think my top priority should be posting a v2 of the two preliminary
series: get those out before focusing back on s390 mm/pgalloc.c.

Is it too early to wish you a happy reverse Xmas?

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08  3:35         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-08  3:35 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Jann Horn, linux-arm-kernel, sparclinux, linuxppc-dev,
	linux-s390, linux-kernel, linux-mm

On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 1 Jun 2023 15:57:51 +0200
> > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > > 
> > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > rcu_head reuse. It might be possible to use the cleverness from our
> > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > the case where both 2K pagetable fragments become unused, similar to
> > > how we decide when to actually call __free_page().
> > > 
> > > However, it might be much worse, and page->rcu_head from a pagetable
> > > page cannot be used at all for s390, because we also use page->lru
> > > to keep our list of free 2K pagetable fragments. I always get confused
> > > by struct page unions, so not completely sure, but it seems to me that
> > > page->rcu_head would overlay with page->lru, right?  
> > 
> > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > I'm wrong) I think that s390 could use exactly the same technique for
> > its list of free 2K pagetable fragments as it uses for its list of THP
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> Nice idea, I think that could actually work, since we only need the empty
> 2K halves on the list. So it should be possible to store the list_head
> inside those.

Jason quickly pointed out the flaw in my thinking there.

> 
> > 
> > And while it could use third and fourth longs instead, I don't see any
> > need for that: a deposited pagetable has been allocated, so would not
> > be on the list of free fragments.
> 
> Correct, that should not interfere.
> 
> > 
> > Below is one of the grossest patches I've ever posted: gross because
> > it's a rushed attempt to see whether that is viable, while it would take
> > me longer to understand all the s390 cleverness there (even though the
> > PP AA commentary above page_table_alloc() is excellent).
> 
> Sounds fair, this is also one of the grossest code we have, which is also
> why Alexander added the comment. I guess we could need even more comments
> inside the code, as it still confuses me more than it should.
> 
> Considering that, you did remarkably well. Your patch seems to work fine,
> at least it survived some LTP mm tests. I will also add it to our CI runs,
> to give it some more testing. Will report tomorrow when it broke something.
> See also below for some patch comments.

Many thanks for your effort on this patch.  I don't expect the testing
of it to catch Jason's point, that I'm corrupting the page table while
it's on its way through RCU to being freed, but he's right nonetheless.

I'll integrate your fixes below into what I have here, but probably
just archive it as something to refer to later in case it might play
a part; but probably it will not - sorry for wasting your time.

> 
> > 
> > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.
> 
> cmma_init_nodat() should be disjoint, not only because it is __init,
> but also because it explicitly skips pagetable pages, so it should
> never touch page->lru of those.
> 
> Not very familiar with the gmap code, it does look disjoint, and we should
> also use complete 4K pages for pagetables instead of 2K fragments there,
> but Christian or Claudio should also have a look.
> 
> > 
> > Gerald, s390 folk: would it be possible for you to give this
> > a try, suggest corrections and improvements, and then I can make it
> > a separate patch of the series; and work on avoiding concurrent use
> > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > the scheme already there, maybe DD bits to go along with the PP AA).
> 
> It feels like it could be possible to not only avoid the double
> rcu_head, but also avoid passing over the mm via page->pt_mm.
> I.e. have pte_free_defer(), which has the mm, do all the checks and
> list updates that page_table_free() does, for which we need the mm.
> Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> just do _dtor/__free_page similar to the generic version.

I'm not sure: I missed your suggestion there when I first skimmed
through, and today have spent more time getting deeper into how it's
done at present.  I am now feeling more confident of a way forward,
a nicely integrated way forward, than I was yesterday.
Though getting it right may not be so easy.

When Jason pointed out the existing RCU, I initially hoped that it might
already provide the necessary framework: but sadly not, because the
unbatched case (used when additional memory is not available) does not
use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
If I used that, it would cripple the s390 implementation unacceptably.

> 
> I must admit that I still have no good overview of the "big picture"
> here, and especially if this approach would still fit in. Probably not,
> as the to-be-freed pagetables would still be accessible, but not really
> valid, if we added them back to the list, with list_heads inside them.
> So maybe call_rcu() has to be done always, and not only for the case
> where the whole 4K page becomes free, then we probably cannot do w/o
> passing over the mm for proper list handling.

My current thinking (but may be proved wrong) is along the lines of:
why does something on its way to being freed need to be on any list
than the rcu_head list?  I expect the current answer is, that the
other half is allocated, so the page won't be freed; but I hope that
we can put it back on that list once we're through with the rcu_head.

But the less I say now, the less I shall make a fool of myself:
I need to get deeper in.

> 
> Ah, and they could also be re-used, once they are back on the list,
> which will probably not go well. Is that what you meant with DD bits,
> i.e. mark such fragments to prevent re-use? Smells a bit like the
> "pending purge"

Yes, we may not need those DD defer bits at all: the pte_free_defer()
pagetables should fit very well with "pending purge" as it is.  They
will go down an unbatched route, but should be obeying the same rules.

> 
> > 
> > Why am I even asking you to move away from page->lru: why don't I
> > thread s390's pte_free_defer() pagetables like THP's deposit does?
> > I cannot, because the deferred pagetables have to remain accessible
> > as valid pagetables, until the RCU grace period has elapsed - unless
> > all the list pointers would appear as pte_none(), which I doubt.
> 
> Yes, only empty and invalid PTEs will appear as pte_none(), i.e. entries
> that contain only 0x400.
> 
> Ok, I guess that also explains why the approach mentioned above,
> to avoid passing over the mm and do the list handling already in
> pte_free_defer(), will not be so easy or possible at all.
> 
> > 
> > (That may limit our possibilities with the deposited pagetables in
> > future: I can imagine them too wanting to remain accessible as valid
> > pagetables.  But that's not needed by this series, and s390 only uses
> > deposit/withdraw for anon THP; and some are hoping that we might be
> > able to move away from deposit/withdraw altogther - though powerpc's
> > special use will make that more difficult.)
> > 
> > Thanks!
> > Hugh
> > 
> > --- 6.4-rc5/arch/s390/mm/pgalloc.c
> > +++ linux/arch/s390/mm/pgalloc.c
> > @@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
> >   */
> >  unsigned long *page_table_alloc(struct mm_struct *mm)
> >  {
> > +	struct list_head *listed;
> >  	unsigned long *table;
> >  	struct page *page;
> >  	unsigned int mask, bit;
> > @@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
> >  		table = NULL;
> >  		spin_lock_bh(&mm->context.lock);
> >  		if (!list_empty(&mm->context.pgtable_list)) {
> > -			page = list_first_entry(&mm->context.pgtable_list,
> > -						struct page, lru);
> > +			listed = mm->context.pgtable_list.next;
> > +			page = virt_to_page(listed);
> >  			mask = atomic_read(&page->_refcount) >> 24;
> >  			/*
> >  			 * The pending removal bits must also be checked.
> > @@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
> >  				bit = mask & 1;		/* =1 -> second 2K */
> >  				if (bit)
> >  					table += PTRS_PER_PTE;
> > +				BUG_ON(table != (unsigned long *)listed);
> >  				atomic_xor_bits(&page->_refcount,
> >  							0x01U << (bit + 24));
> > -				list_del(&page->lru);
> > +				list_del(listed);
> > +				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
> > +				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
> >  			}
> >  		}
> >  		spin_unlock_bh(&mm->context.lock);
> > @@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
> >  		/* Return the first 2K fragment of the page */
> >  		atomic_xor_bits(&page->_refcount, 0x01U << 24);
> >  		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
> > +		listed = (struct list head *)(table + PTRS_PER_PTE);
> 
> Missing "_" in "struct list head"
> 
> >  		spin_lock_bh(&mm->context.lock);
> > -		list_add(&page->lru, &mm->context.pgtable_list);
> > +		list_add(listed, &mm->context.pgtable_list);
> >  		spin_unlock_bh(&mm->context.lock);
> >  	}
> >  	return table;
> > @@ -310,6 +315,7 @@ static void page_table_release_check(str
> >  
> >  void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  {
> > +	struct list_head *listed;
> >  	unsigned int mask, bit, half;
> >  	struct page *page;
> 
> Not sure if "reverse X-mas" is still part of any style guidelines,
> but I still am a big fan of that :-). Although the other code in that
> file is also not consistently using it ...
> 
> >  
> > @@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
> >  		 */
> >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  		mask >>= 24;
> > -		if (mask & 0x03U)
> > -			list_add(&page->lru, &mm->context.pgtable_list);
> > -		else
> > -			list_del(&page->lru);
> > +		if (mask & 0x03U) {
> > +			listed = (struct list_head *)table;
> > +			list_add(listed, &mm->context.pgtable_list);
> > +		} else {
> > +			/*
> > +			 * Get address of the other page table sharing the page.
> > +			 * There are sure to be MUCH better ways to do all this!
> > +			 * But I'm rushing, while trying to keep to the obvious.
> > +			 */
> > +			listed = (struct list_head *)(table + PTRS_PER_PTE);
> > +			if (virt_to_page(listed) != page) {
> > +				/* sizeof(*listed) is twice sizeof(*table) */
> > +				listed -= PTRS_PER_PTE;
> > +			}
> 
> Bitwise XOR with 0x800 should do the trick here, i.e. give you the address
> of the other 2K half, like this:
> 
> 			listed = (struct list_head *)((unsigned long) table ^ 0x800UL);
> 
> > +			list_del(listed);
> > +			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> > +			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> > +		}
> >  		spin_unlock_bh(&mm->context.lock);
> >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> >  		mask >>= 24;
> > @@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
> >  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> >  			 unsigned long vmaddr)
> >  {
> > +	struct list_head *listed;
> >  	struct mm_struct *mm;
> >  	struct page *page;
> >  	unsigned int bit, mask;
> > @@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
> >  	 */
> >  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  	mask >>= 24;
> > -	if (mask & 0x03U)
> > -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> > -	else
> > -		list_del(&page->lru);
> > +	if (mask & 0x03U) {
> > +		listed = (struct list_head *)table;
> > +		list_add_tail(listed, &mm->context.pgtable_list);
> > +	} else {
> > +		/*
> > +		 * Get address of the other page table sharing the page.
> > +		 * There are sure to be MUCH better ways to do all this!
> > +		 * But I'm rushing, and trying to keep to the obvious.
> > +		 */
> > +		listed = (struct list_head *)(table + PTRS_PER_PTE);
> > +		if (virt_to_page(listed) != page) {
> > +			/* sizeof(*listed) is twice sizeof(*table) */
> > +			listed -= PTRS_PER_PTE;
> > +		}
> 
> Same as above.
> 
> > +		list_del(listed);
> > +		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> > +		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> > +	}
> >  	spin_unlock_bh(&mm->context.lock);
> >  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> >  	tlb_remove_table(tlb, table);
> 
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

Thanks a lot, Gerald, sorry that it now looks like wasted effort.

I'm feeling confident enough of getting into s390 PP-AA-world now, that
I think my top priority should be posting a v2 of the two preliminary
series: get those out before focusing back on s390 mm/pgalloc.c.

Is it too early to wish you a happy reverse Xmas?

Hugh

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08  3:35         ` Hugh Dickins
  0 siblings, 0 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-08  3:35 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Hugh Dickins, Russell King, Matthew Wilcox,
	Steven Price, Christoph Hellwig, Jason Gunthorpe,
	Aneesh Kumar K.V, Axel Rasmussen, Christian Borntraeger,
	Thomas Hellstrom, Ralph Campbell, Pasha Tatashin

On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 1 Jun 2023 15:57:51 +0200
> > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:
> > > 
> > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > rcu_head reuse. It might be possible to use the cleverness from our
> > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > the case where both 2K pagetable fragments become unused, similar to
> > > how we decide when to actually call __free_page().
> > > 
> > > However, it might be much worse, and page->rcu_head from a pagetable
> > > page cannot be used at all for s390, because we also use page->lru
> > > to keep our list of free 2K pagetable fragments. I always get confused
> > > by struct page unions, so not completely sure, but it seems to me that
> > > page->rcu_head would overlay with page->lru, right?  
> > 
> > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > I'm wrong) I think that s390 could use exactly the same technique for
> > its list of free 2K pagetable fragments as it uses for its list of THP
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> Nice idea, I think that could actually work, since we only need the empty
> 2K halves on the list. So it should be possible to store the list_head
> inside those.

Jason quickly pointed out the flaw in my thinking there.

> 
> > 
> > And while it could use third and fourth longs instead, I don't see any
> > need for that: a deposited pagetable has been allocated, so would not
> > be on the list of free fragments.
> 
> Correct, that should not interfere.
> 
> > 
> > Below is one of the grossest patches I've ever posted: gross because
> > it's a rushed attempt to see whether that is viable, while it would take
> > me longer to understand all the s390 cleverness there (even though the
> > PP AA commentary above page_table_alloc() is excellent).
> 
> Sounds fair, this is also one of the grossest code we have, which is also
> why Alexander added the comment. I guess we could need even more comments
> inside the code, as it still confuses me more than it should.
> 
> Considering that, you did remarkably well. Your patch seems to work fine,
> at least it survived some LTP mm tests. I will also add it to our CI runs,
> to give it some more testing. Will report tomorrow when it broke something.
> See also below for some patch comments.

Many thanks for your effort on this patch.  I don't expect the testing
of it to catch Jason's point, that I'm corrupting the page table while
it's on its way through RCU to being freed, but he's right nonetheless.

I'll integrate your fixes below into what I have here, but probably
just archive it as something to refer to later in case it might play
a part; but probably it will not - sorry for wasting your time.

> 
> > 
> > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.
> 
> cmma_init_nodat() should be disjoint, not only because it is __init,
> but also because it explicitly skips pagetable pages, so it should
> never touch page->lru of those.
> 
> Not very familiar with the gmap code, it does look disjoint, and we should
> also use complete 4K pages for pagetables instead of 2K fragments there,
> but Christian or Claudio should also have a look.
> 
> > 
> > Gerald, s390 folk: would it be possible for you to give this
> > a try, suggest corrections and improvements, and then I can make it
> > a separate patch of the series; and work on avoiding concurrent use
> > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > the scheme already there, maybe DD bits to go along with the PP AA).
> 
> It feels like it could be possible to not only avoid the double
> rcu_head, but also avoid passing over the mm via page->pt_mm.
> I.e. have pte_free_defer(), which has the mm, do all the checks and
> list updates that page_table_free() does, for which we need the mm.
> Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> just do _dtor/__free_page similar to the generic version.

I'm not sure: I missed your suggestion there when I first skimmed
through, and today have spent more time getting deeper into how it's
done at present.  I am now feeling more confident of a way forward,
a nicely integrated way forward, than I was yesterday.
Though getting it right may not be so easy.

When Jason pointed out the existing RCU, I initially hoped that it might
already provide the necessary framework: but sadly not, because the
unbatched case (used when additional memory is not available) does not
use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
If I used that, it would cripple the s390 implementation unacceptably.

> 
> I must admit that I still have no good overview of the "big picture"
> here, and especially if this approach would still fit in. Probably not,
> as the to-be-freed pagetables would still be accessible, but not really
> valid, if we added them back to the list, with list_heads inside them.
> So maybe call_rcu() has to be done always, and not only for the case
> where the whole 4K page becomes free, then we probably cannot do w/o
> passing over the mm for proper list handling.

My current thinking (but may be proved wrong) is along the lines of:
why does something on its way to being freed need to be on any list
than the rcu_head list?  I expect the current answer is, that the
other half is allocated, so the page won't be freed; but I hope that
we can put it back on that list once we're through with the rcu_head.

But the less I say now, the less I shall make a fool of myself:
I need to get deeper in.

> 
> Ah, and they could also be re-used, once they are back on the list,
> which will probably not go well. Is that what you meant with DD bits,
> i.e. mark such fragments to prevent re-use? Smells a bit like the
> "pending purge"

Yes, we may not need those DD defer bits at all: the pte_free_defer()
pagetables should fit very well with "pending purge" as it is.  They
will go down an unbatched route, but should be obeying the same rules.

> 
> > 
> > Why am I even asking you to move away from page->lru: why don't I
> > thread s390's pte_free_defer() pagetables like THP's deposit does?
> > I cannot, because the deferred pagetables have to remain accessible
> > as valid pagetables, until the RCU grace period has elapsed - unless
> > all the list pointers would appear as pte_none(), which I doubt.
> 
> Yes, only empty and invalid PTEs will appear as pte_none(), i.e. entries
> that contain only 0x400.
> 
> Ok, I guess that also explains why the approach mentioned above,
> to avoid passing over the mm and do the list handling already in
> pte_free_defer(), will not be so easy or possible at all.
> 
> > 
> > (That may limit our possibilities with the deposited pagetables in
> > future: I can imagine them too wanting to remain accessible as valid
> > pagetables.  But that's not needed by this series, and s390 only uses
> > deposit/withdraw for anon THP; and some are hoping that we might be
> > able to move away from deposit/withdraw altogther - though powerpc's
> > special use will make that more difficult.)
> > 
> > Thanks!
> > Hugh
> > 
> > --- 6.4-rc5/arch/s390/mm/pgalloc.c
> > +++ linux/arch/s390/mm/pgalloc.c
> > @@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
> >   */
> >  unsigned long *page_table_alloc(struct mm_struct *mm)
> >  {
> > +	struct list_head *listed;
> >  	unsigned long *table;
> >  	struct page *page;
> >  	unsigned int mask, bit;
> > @@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
> >  		table = NULL;
> >  		spin_lock_bh(&mm->context.lock);
> >  		if (!list_empty(&mm->context.pgtable_list)) {
> > -			page = list_first_entry(&mm->context.pgtable_list,
> > -						struct page, lru);
> > +			listed = mm->context.pgtable_list.next;
> > +			page = virt_to_page(listed);
> >  			mask = atomic_read(&page->_refcount) >> 24;
> >  			/*
> >  			 * The pending removal bits must also be checked.
> > @@ -259,9 +260,12 @@ unsigned long *page_table_alloc(struct m
> >  				bit = mask & 1;		/* =1 -> second 2K */
> >  				if (bit)
> >  					table += PTRS_PER_PTE;
> > +				BUG_ON(table != (unsigned long *)listed);
> >  				atomic_xor_bits(&page->_refcount,
> >  							0x01U << (bit + 24));
> > -				list_del(&page->lru);
> > +				list_del(listed);
> > +				set_pte((pte_t *)&table[0], __pte(_PAGE_INVALID));
> > +				set_pte((pte_t *)&table[1], __pte(_PAGE_INVALID));
> >  			}
> >  		}
> >  		spin_unlock_bh(&mm->context.lock);
> > @@ -288,8 +292,9 @@ unsigned long *page_table_alloc(struct m
> >  		/* Return the first 2K fragment of the page */
> >  		atomic_xor_bits(&page->_refcount, 0x01U << 24);
> >  		memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
> > +		listed = (struct list head *)(table + PTRS_PER_PTE);
> 
> Missing "_" in "struct list head"
> 
> >  		spin_lock_bh(&mm->context.lock);
> > -		list_add(&page->lru, &mm->context.pgtable_list);
> > +		list_add(listed, &mm->context.pgtable_list);
> >  		spin_unlock_bh(&mm->context.lock);
> >  	}
> >  	return table;
> > @@ -310,6 +315,7 @@ static void page_table_release_check(str
> >  
> >  void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  {
> > +	struct list_head *listed;
> >  	unsigned int mask, bit, half;
> >  	struct page *page;
> 
> Not sure if "reverse X-mas" is still part of any style guidelines,
> but I still am a big fan of that :-). Although the other code in that
> file is also not consistently using it ...
> 
> >  
> > @@ -325,10 +331,24 @@ void page_table_free(struct mm_struct *m
> >  		 */
> >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  		mask >>= 24;
> > -		if (mask & 0x03U)
> > -			list_add(&page->lru, &mm->context.pgtable_list);
> > -		else
> > -			list_del(&page->lru);
> > +		if (mask & 0x03U) {
> > +			listed = (struct list_head *)table;
> > +			list_add(listed, &mm->context.pgtable_list);
> > +		} else {
> > +			/*
> > +			 * Get address of the other page table sharing the page.
> > +			 * There are sure to be MUCH better ways to do all this!
> > +			 * But I'm rushing, while trying to keep to the obvious.
> > +			 */
> > +			listed = (struct list_head *)(table + PTRS_PER_PTE);
> > +			if (virt_to_page(listed) != page) {
> > +				/* sizeof(*listed) is twice sizeof(*table) */
> > +				listed -= PTRS_PER_PTE;
> > +			}
> 
> Bitwise XOR with 0x800 should do the trick here, i.e. give you the address
> of the other 2K half, like this:
> 
> 			listed = (struct list_head *)((unsigned long) table ^ 0x800UL);
> 
> > +			list_del(listed);
> > +			set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> > +			set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> > +		}
> >  		spin_unlock_bh(&mm->context.lock);
> >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> >  		mask >>= 24;
> > @@ -349,6 +369,7 @@ void page_table_free(struct mm_struct *m
> >  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> >  			 unsigned long vmaddr)
> >  {
> > +	struct list_head *listed;
> >  	struct mm_struct *mm;
> >  	struct page *page;
> >  	unsigned int bit, mask;
> > @@ -370,10 +391,24 @@ void page_table_free_rcu(struct mmu_gath
> >  	 */
> >  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  	mask >>= 24;
> > -	if (mask & 0x03U)
> > -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> > -	else
> > -		list_del(&page->lru);
> > +	if (mask & 0x03U) {
> > +		listed = (struct list_head *)table;
> > +		list_add_tail(listed, &mm->context.pgtable_list);
> > +	} else {
> > +		/*
> > +		 * Get address of the other page table sharing the page.
> > +		 * There are sure to be MUCH better ways to do all this!
> > +		 * But I'm rushing, and trying to keep to the obvious.
> > +		 */
> > +		listed = (struct list_head *)(table + PTRS_PER_PTE);
> > +		if (virt_to_page(listed) != page) {
> > +			/* sizeof(*listed) is twice sizeof(*table) */
> > +			listed -= PTRS_PER_PTE;
> > +		}
> 
> Same as above.
> 
> > +		list_del(listed);
> > +		set_pte((pte_t *)&listed->next, __pte(_PAGE_INVALID));
> > +		set_pte((pte_t *)&listed->prev, __pte(_PAGE_INVALID));
> > +	}
> >  	spin_unlock_bh(&mm->context.lock);
> >  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> >  	tlb_remove_table(tlb, table);
> 
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

Thanks a lot, Gerald, sorry that it now looks like wasted effort.

I'm feeling confident enough of getting into s390 PP-AA-world now, that
I think my top priority should be posting a v2 of the two preliminary
series: get those out before focusing back on s390 mm/pgalloc.c.

Is it too early to wish you a happy reverse Xmas?

Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-08  3:35         ` Hugh Dickins
  (?)
@ 2023-06-08 13:58           ` Jason Gunthorpe
  -1 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-08 13:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, Jun 07, 2023 at 08:35:05PM -0700, Hugh Dickins wrote:

> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

I was having the same thought. It is pretty tricky, but if this was
made into some core helper then PPC and S390 could both use it and PPC
would get a nice upgrade to have the S390 frag re-use instead of
leaking frags.

Broadly we have three states:

 all frags free
 at least one frag free
 all frags used

'all frags free' should be returned to the allocator
'at least one frag free' should have the struct page on the mmu_struct's list
'all frags used' should be on no list.

So if we go from 
  all frags used -> at least one frag free
Then we put it on the RCU then the RCU puts it on the mmu_struct list

If we go from 
   at least one frag free -> all frags free
Then we take it off the mmu_struct list, put it on the RCU, and RCU
frees it.

Your trick to put the list_head for the mm_struct list into the frag
memory looks like the right direction. So 'at least one frag free' has
a single already RCU free'd frag hold the list head pointer. Thus we
never use the LRU and the rcu_head is always available.

The struct page itself can contain the actual free frag bitmask.

I think if we split up the memory used for pt_frag_refcount we can get
enough bits to keep track of everything. With only 2-4 frags we should
be OK.

So we track this data in the struct page:
  - Current RCU free TODO bitmask - if non-zero then a RCU is already
    triggered
  - Next RCU TODO bitmaks - If an RCU is already triggrered then we
    accumulate more free'd frags here
  - Current Free Bits - Only updated by the RCU callback

?

We'd also need to store the mmu_struct pointer in the struct page for
the RCU to be able to add/remove from the mm_struct list.

I'm not sure how much of the work can be done with atomics and how
much would need to rely on spinlock inside the mm_struct.

It feels feasible and not so bad. :)

Figure it out and test it on S390 then make power use the same common
code, and we get full RCU page table freeing using a reliable rcu_head
on both of these previously troublesome architectures :) Yay

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08 13:58           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-08 13:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Aneesh Kumar K.V, Axel Rasmussen,
	Gerald Schaefer, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Vasily G orbik,
	Anshuman Khandual, Heiko Carstens, Qi Zheng, Suren Baghdasaryan,
	linux-arm-kernel, SeongJae Park, Jann Horn, linux-mm,
	linuxppc-dev, Naoya Horiguchi, Zack Rusin, Minchan Kim,
	Kirill A. Shutemov, Andrew Morton, Mel Gorman, David S. Miller,
	Mike Rapoport, Mike Kravetz

On Wed, Jun 07, 2023 at 08:35:05PM -0700, Hugh Dickins wrote:

> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

I was having the same thought. It is pretty tricky, but if this was
made into some core helper then PPC and S390 could both use it and PPC
would get a nice upgrade to have the S390 frag re-use instead of
leaking frags.

Broadly we have three states:

 all frags free
 at least one frag free
 all frags used

'all frags free' should be returned to the allocator
'at least one frag free' should have the struct page on the mmu_struct's list
'all frags used' should be on no list.

So if we go from 
  all frags used -> at least one frag free
Then we put it on the RCU then the RCU puts it on the mmu_struct list

If we go from 
   at least one frag free -> all frags free
Then we take it off the mmu_struct list, put it on the RCU, and RCU
frees it.

Your trick to put the list_head for the mm_struct list into the frag
memory looks like the right direction. So 'at least one frag free' has
a single already RCU free'd frag hold the list head pointer. Thus we
never use the LRU and the rcu_head is always available.

The struct page itself can contain the actual free frag bitmask.

I think if we split up the memory used for pt_frag_refcount we can get
enough bits to keep track of everything. With only 2-4 frags we should
be OK.

So we track this data in the struct page:
  - Current RCU free TODO bitmask - if non-zero then a RCU is already
    triggered
  - Next RCU TODO bitmaks - If an RCU is already triggrered then we
    accumulate more free'd frags here
  - Current Free Bits - Only updated by the RCU callback

?

We'd also need to store the mmu_struct pointer in the struct page for
the RCU to be able to add/remove from the mm_struct list.

I'm not sure how much of the work can be done with atomics and how
much would need to rely on spinlock inside the mm_struct.

It feels feasible and not so bad. :)

Figure it out and test it on S390 then make power use the same common
code, and we get full RCU page table freeing using a reliable rcu_head
on both of these previously troublesome architectures :) Yay

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08 13:58           ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-08 13:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Vasily Gorbik, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Naoya Horiguchi, Christophe Leroy, Zack Rusin,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, Jun 07, 2023 at 08:35:05PM -0700, Hugh Dickins wrote:

> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

I was having the same thought. It is pretty tricky, but if this was
made into some core helper then PPC and S390 could both use it and PPC
would get a nice upgrade to have the S390 frag re-use instead of
leaking frags.

Broadly we have three states:

 all frags free
 at least one frag free
 all frags used

'all frags free' should be returned to the allocator
'at least one frag free' should have the struct page on the mmu_struct's list
'all frags used' should be on no list.

So if we go from 
  all frags used -> at least one frag free
Then we put it on the RCU then the RCU puts it on the mmu_struct list

If we go from 
   at least one frag free -> all frags free
Then we take it off the mmu_struct list, put it on the RCU, and RCU
frees it.

Your trick to put the list_head for the mm_struct list into the frag
memory looks like the right direction. So 'at least one frag free' has
a single already RCU free'd frag hold the list head pointer. Thus we
never use the LRU and the rcu_head is always available.

The struct page itself can contain the actual free frag bitmask.

I think if we split up the memory used for pt_frag_refcount we can get
enough bits to keep track of everything. With only 2-4 frags we should
be OK.

So we track this data in the struct page:
  - Current RCU free TODO bitmask - if non-zero then a RCU is already
    triggered
  - Next RCU TODO bitmaks - If an RCU is already triggrered then we
    accumulate more free'd frags here
  - Current Free Bits - Only updated by the RCU callback

?

We'd also need to store the mmu_struct pointer in the struct page for
the RCU to be able to add/remove from the mm_struct list.

I'm not sure how much of the work can be done with atomics and how
much would need to rely on spinlock inside the mm_struct.

It feels feasible and not so bad. :)

Figure it out and test it on S390 then make power use the same common
code, and we get full RCU page table freeing using a reliable rcu_head
on both of these previously troublesome architectures :) Yay

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-08  3:35         ` Hugh Dickins
  (?)
@ 2023-06-08 15:47           ` Gerald Schaefer
  -1 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-08 15:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> > On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > On Thu, 1 Jun 2023 15:57:51 +0200
> > > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:  
> > > > 
> > > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > > rcu_head reuse. It might be possible to use the cleverness from our
> > > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > > the case where both 2K pagetable fragments become unused, similar to
> > > > how we decide when to actually call __free_page().
> > > > 
> > > > However, it might be much worse, and page->rcu_head from a pagetable
> > > > page cannot be used at all for s390, because we also use page->lru
> > > > to keep our list of free 2K pagetable fragments. I always get confused
> > > > by struct page unions, so not completely sure, but it seems to me that
> > > > page->rcu_head would overlay with page->lru, right?    
> > > 
> > > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > > I'm wrong) I think that s390 could use exactly the same technique for
> > > its list of free 2K pagetable fragments as it uses for its list of THP
> > > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > > the first two longs of the page table itself for threading the list.  
> > 
> > Nice idea, I think that could actually work, since we only need the empty
> > 2K halves on the list. So it should be possible to store the list_head
> > inside those.  
> 
> Jason quickly pointed out the flaw in my thinking there.

Yes, while I had the right concerns about "the to-be-freed pagetables would
still be accessible, but not really valid, if we added them back to the list,
with list_heads inside them", when suggesting the approach w/o passing over
the mm, I missed that we would have the very same issue already with the
existing page_table_free_rcu().

Thankfully Jason was watching out!

> 
> >   
> > > 
> > > And while it could use third and fourth longs instead, I don't see any
> > > need for that: a deposited pagetable has been allocated, so would not
> > > be on the list of free fragments.  
> > 
> > Correct, that should not interfere.
> >   
> > > 
> > > Below is one of the grossest patches I've ever posted: gross because
> > > it's a rushed attempt to see whether that is viable, while it would take
> > > me longer to understand all the s390 cleverness there (even though the
> > > PP AA commentary above page_table_alloc() is excellent).  
> > 
> > Sounds fair, this is also one of the grossest code we have, which is also
> > why Alexander added the comment. I guess we could need even more comments
> > inside the code, as it still confuses me more than it should.
> > 
> > Considering that, you did remarkably well. Your patch seems to work fine,
> > at least it survived some LTP mm tests. I will also add it to our CI runs,
> > to give it some more testing. Will report tomorrow when it broke something.
> > See also below for some patch comments.  
> 
> Many thanks for your effort on this patch.  I don't expect the testing
> of it to catch Jason's point, that I'm corrupting the page table while
> it's on its way through RCU to being freed, but he's right nonetheless.

Right, tests ran fine, but we would have introduced subtle issues with
racing gup_fast, I guess.

> 
> I'll integrate your fixes below into what I have here, but probably
> just archive it as something to refer to later in case it might play
> a part; but probably it will not - sorry for wasting your time.

No worries, looking at that s390 code can never be amiss. It seems I need
regular refresh, at least I'm sure I already understood it better in the
past.

And who knows, with Jasons recent thoughts, that "list_head inside
pagetable" idea might not be dead yet.

> 
> >   
> > > 
> > > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.  
> > 
> > cmma_init_nodat() should be disjoint, not only because it is __init,
> > but also because it explicitly skips pagetable pages, so it should
> > never touch page->lru of those.
> > 
> > Not very familiar with the gmap code, it does look disjoint, and we should
> > also use complete 4K pages for pagetables instead of 2K fragments there,
> > but Christian or Claudio should also have a look.
> >   
> > > 
> > > Gerald, s390 folk: would it be possible for you to give this
> > > a try, suggest corrections and improvements, and then I can make it
> > > a separate patch of the series; and work on avoiding concurrent use
> > > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > > the scheme already there, maybe DD bits to go along with the PP AA).  
> > 
> > It feels like it could be possible to not only avoid the double
> > rcu_head, but also avoid passing over the mm via page->pt_mm.
> > I.e. have pte_free_defer(), which has the mm, do all the checks and
> > list updates that page_table_free() does, for which we need the mm.
> > Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> > and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> > just do _dtor/__free_page similar to the generic version.  
> 
> I'm not sure: I missed your suggestion there when I first skimmed
> through, and today have spent more time getting deeper into how it's
> done at present.  I am now feeling more confident of a way forward,
> a nicely integrated way forward, than I was yesterday.
> Though getting it right may not be so easy.

I think my "feeling" was a déjà vu of the existing logic that we use for
page_table_free_rcu() -> __tlb_remove_table(), where we also have no mm
any more at the end, and use the PP bits magic to find out if the page
can be freed, or if we still have fragments left.

Of course, in that case, we also would not need the mm any more for
list handling, as the to-be-freed fragments were already put back
on the list, but with PP bits set, to prevent re-use. And clearing
those would then make the fragment usable from the list again.

I guess that would also be the major difference here, i.e. your RCU
call-back would need to be able to add fragments back to the list,
after having them removed before to make room for page->rcu_head,
but with Jasons thoughts that does not seem so impossible after all.

I do not yet understand if the list_head would then compulsorily need
to be inside the pagetable, because page->rcu_head/lru still cannot be
used (again). But you already have a patch for that, so either way
might be possible.

> 
> When Jason pointed out the existing RCU, I initially hoped that it might
> already provide the necessary framework: but sadly not, because the
> unbatched case (used when additional memory is not available) does not
> use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
> If I used that, it would cripple the s390 implementation unacceptably.
> 
> > 
> > I must admit that I still have no good overview of the "big picture"
> > here, and especially if this approach would still fit in. Probably not,
> > as the to-be-freed pagetables would still be accessible, but not really
> > valid, if we added them back to the list, with list_heads inside them.
> > So maybe call_rcu() has to be done always, and not only for the case
> > where the whole 4K page becomes free, then we probably cannot do w/o
> > passing over the mm for proper list handling.  
> 
> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

Yes, that looks promising. Such a fragment would not necessarily need
to be on the list, because while it is on its way, i.e. before the
RCU call-back finished, it cannot be re-used anyway.

page_table_alloc() could currently find such a fragment on the list, but
only to see the PP bits set, so it will not use it. Only after
__tlb_remove_table() in the RCU call-back resets the bits, it would be
usable again.

In your case, that could correspond to adding it back to the list.
That could even be an improvement, because page_table_alloc() would
not be bothered by such unusable fragments.

[...]
> 
> Is it too early to wish you a happy reverse Xmas?

Nice idea, we should make June 24th the reverse Xmas Remembrance Day :-)

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08 15:47           ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-08 15:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Jann Horn, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> > On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > On Thu, 1 Jun 2023 15:57:51 +0200
> > > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:  
> > > > 
> > > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > > rcu_head reuse. It might be possible to use the cleverness from our
> > > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > > the case where both 2K pagetable fragments become unused, similar to
> > > > how we decide when to actually call __free_page().
> > > > 
> > > > However, it might be much worse, and page->rcu_head from a pagetable
> > > > page cannot be used at all for s390, because we also use page->lru
> > > > to keep our list of free 2K pagetable fragments. I always get confused
> > > > by struct page unions, so not completely sure, but it seems to me that
> > > > page->rcu_head would overlay with page->lru, right?    
> > > 
> > > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > > I'm wrong) I think that s390 could use exactly the same technique for
> > > its list of free 2K pagetable fragments as it uses for its list of THP
> > > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > > the first two longs of the page table itself for threading the list.  
> > 
> > Nice idea, I think that could actually work, since we only need the empty
> > 2K halves on the list. So it should be possible to store the list_head
> > inside those.  
> 
> Jason quickly pointed out the flaw in my thinking there.

Yes, while I had the right concerns about "the to-be-freed pagetables would
still be accessible, but not really valid, if we added them back to the list,
with list_heads inside them", when suggesting the approach w/o passing over
the mm, I missed that we would have the very same issue already with the
existing page_table_free_rcu().

Thankfully Jason was watching out!

> 
> >   
> > > 
> > > And while it could use third and fourth longs instead, I don't see any
> > > need for that: a deposited pagetable has been allocated, so would not
> > > be on the list of free fragments.  
> > 
> > Correct, that should not interfere.
> >   
> > > 
> > > Below is one of the grossest patches I've ever posted: gross because
> > > it's a rushed attempt to see whether that is viable, while it would take
> > > me longer to understand all the s390 cleverness there (even though the
> > > PP AA commentary above page_table_alloc() is excellent).  
> > 
> > Sounds fair, this is also one of the grossest code we have, which is also
> > why Alexander added the comment. I guess we could need even more comments
> > inside the code, as it still confuses me more than it should.
> > 
> > Considering that, you did remarkably well. Your patch seems to work fine,
> > at least it survived some LTP mm tests. I will also add it to our CI runs,
> > to give it some more testing. Will report tomorrow when it broke something.
> > See also below for some patch comments.  
> 
> Many thanks for your effort on this patch.  I don't expect the testing
> of it to catch Jason's point, that I'm corrupting the page table while
> it's on its way through RCU to being freed, but he's right nonetheless.

Right, tests ran fine, but we would have introduced subtle issues with
racing gup_fast, I guess.

> 
> I'll integrate your fixes below into what I have here, but probably
> just archive it as something to refer to later in case it might play
> a part; but probably it will not - sorry for wasting your time.

No worries, looking at that s390 code can never be amiss. It seems I need
regular refresh, at least I'm sure I already understood it better in the
past.

And who knows, with Jasons recent thoughts, that "list_head inside
pagetable" idea might not be dead yet.

> 
> >   
> > > 
> > > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.  
> > 
> > cmma_init_nodat() should be disjoint, not only because it is __init,
> > but also because it explicitly skips pagetable pages, so it should
> > never touch page->lru of those.
> > 
> > Not very familiar with the gmap code, it does look disjoint, and we should
> > also use complete 4K pages for pagetables instead of 2K fragments there,
> > but Christian or Claudio should also have a look.
> >   
> > > 
> > > Gerald, s390 folk: would it be possible for you to give this
> > > a try, suggest corrections and improvements, and then I can make it
> > > a separate patch of the series; and work on avoiding concurrent use
> > > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > > the scheme already there, maybe DD bits to go along with the PP AA).  
> > 
> > It feels like it could be possible to not only avoid the double
> > rcu_head, but also avoid passing over the mm via page->pt_mm.
> > I.e. have pte_free_defer(), which has the mm, do all the checks and
> > list updates that page_table_free() does, for which we need the mm.
> > Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> > and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> > just do _dtor/__free_page similar to the generic version.  
> 
> I'm not sure: I missed your suggestion there when I first skimmed
> through, and today have spent more time getting deeper into how it's
> done at present.  I am now feeling more confident of a way forward,
> a nicely integrated way forward, than I was yesterday.
> Though getting it right may not be so easy.

I think my "feeling" was a déjà vu of the existing logic that we use for
page_table_free_rcu() -> __tlb_remove_table(), where we also have no mm
any more at the end, and use the PP bits magic to find out if the page
can be freed, or if we still have fragments left.

Of course, in that case, we also would not need the mm any more for
list handling, as the to-be-freed fragments were already put back
on the list, but with PP bits set, to prevent re-use. And clearing
those would then make the fragment usable from the list again.

I guess that would also be the major difference here, i.e. your RCU
call-back would need to be able to add fragments back to the list,
after having them removed before to make room for page->rcu_head,
but with Jasons thoughts that does not seem so impossible after all.

I do not yet understand if the list_head would then compulsorily need
to be inside the pagetable, because page->rcu_head/lru still cannot be
used (again). But you already have a patch for that, so either way
might be possible.

> 
> When Jason pointed out the existing RCU, I initially hoped that it might
> already provide the necessary framework: but sadly not, because the
> unbatched case (used when additional memory is not available) does not
> use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
> If I used that, it would cripple the s390 implementation unacceptably.
> 
> > 
> > I must admit that I still have no good overview of the "big picture"
> > here, and especially if this approach would still fit in. Probably not,
> > as the to-be-freed pagetables would still be accessible, but not really
> > valid, if we added them back to the list, with list_heads inside them.
> > So maybe call_rcu() has to be done always, and not only for the case
> > where the whole 4K page becomes free, then we probably cannot do w/o
> > passing over the mm for proper list handling.  
> 
> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

Yes, that looks promising. Such a fragment would not necessarily need
to be on the list, because while it is on its way, i.e. before the
RCU call-back finished, it cannot be re-used anyway.

page_table_alloc() could currently find such a fragment on the list, but
only to see the PP bits set, so it will not use it. Only after
__tlb_remove_table() in the RCU call-back resets the bits, it would be
usable again.

In your case, that could correspond to adding it back to the list.
That could even be an improvement, because page_table_alloc() would
not be bothered by such unusable fragments.

[...]
> 
> Is it too early to wish you a happy reverse Xmas?

Nice idea, we should make June 24th the reverse Xmas Remembrance Day :-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
@ 2023-06-08 15:47           ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-08 15:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Miaohe Lin, David Hildenbrand, Peter Zijlstra, Yang Shi,
	Peter Xu, linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, Aneesh Kumar K.V,
	Axel Rasmussen, Christian Borntraeger, Thomas Hellstrom,
	Ralph Campbell, Pasha Tatashin, Vasily Gorbik, Anshuman Khandual,
	Heiko Carstens, Qi Zheng, Suren Baghdasaryan, linux-arm-kernel,
	SeongJae Park, Jann Horn, linux-mm, linuxppc-dev,
	Naoya Horiguchi, Zack Rusin, Minchan Kim, Kirill A. Shutemov,
	Andrew Morton, Mel Gorman, David S. Miller, Mike Rapoport,
	Mike Kravetz

On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> > On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > On Thu, 1 Jun 2023 15:57:51 +0200
> > > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:  
> > > > 
> > > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > > rcu_head reuse. It might be possible to use the cleverness from our
> > > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > > the case where both 2K pagetable fragments become unused, similar to
> > > > how we decide when to actually call __free_page().
> > > > 
> > > > However, it might be much worse, and page->rcu_head from a pagetable
> > > > page cannot be used at all for s390, because we also use page->lru
> > > > to keep our list of free 2K pagetable fragments. I always get confused
> > > > by struct page unions, so not completely sure, but it seems to me that
> > > > page->rcu_head would overlay with page->lru, right?    
> > > 
> > > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > > I'm wrong) I think that s390 could use exactly the same technique for
> > > its list of free 2K pagetable fragments as it uses for its list of THP
> > > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > > the first two longs of the page table itself for threading the list.  
> > 
> > Nice idea, I think that could actually work, since we only need the empty
> > 2K halves on the list. So it should be possible to store the list_head
> > inside those.  
> 
> Jason quickly pointed out the flaw in my thinking there.

Yes, while I had the right concerns about "the to-be-freed pagetables would
still be accessible, but not really valid, if we added them back to the list,
with list_heads inside them", when suggesting the approach w/o passing over
the mm, I missed that we would have the very same issue already with the
existing page_table_free_rcu().

Thankfully Jason was watching out!

> 
> >   
> > > 
> > > And while it could use third and fourth longs instead, I don't see any
> > > need for that: a deposited pagetable has been allocated, so would not
> > > be on the list of free fragments.  
> > 
> > Correct, that should not interfere.
> >   
> > > 
> > > Below is one of the grossest patches I've ever posted: gross because
> > > it's a rushed attempt to see whether that is viable, while it would take
> > > me longer to understand all the s390 cleverness there (even though the
> > > PP AA commentary above page_table_alloc() is excellent).  
> > 
> > Sounds fair, this is also one of the grossest code we have, which is also
> > why Alexander added the comment. I guess we could need even more comments
> > inside the code, as it still confuses me more than it should.
> > 
> > Considering that, you did remarkably well. Your patch seems to work fine,
> > at least it survived some LTP mm tests. I will also add it to our CI runs,
> > to give it some more testing. Will report tomorrow when it broke something.
> > See also below for some patch comments.  
> 
> Many thanks for your effort on this patch.  I don't expect the testing
> of it to catch Jason's point, that I'm corrupting the page table while
> it's on its way through RCU to being freed, but he's right nonetheless.

Right, tests ran fine, but we would have introduced subtle issues with
racing gup_fast, I guess.

> 
> I'll integrate your fixes below into what I have here, but probably
> just archive it as something to refer to later in case it might play
> a part; but probably it will not - sorry for wasting your time.

No worries, looking at that s390 code can never be amiss. It seems I need
regular refresh, at least I'm sure I already understood it better in the
past.

And who knows, with Jasons recent thoughts, that "list_head inside
pagetable" idea might not be dead yet.

> 
> >   
> > > 
> > > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.  
> > 
> > cmma_init_nodat() should be disjoint, not only because it is __init,
> > but also because it explicitly skips pagetable pages, so it should
> > never touch page->lru of those.
> > 
> > Not very familiar with the gmap code, it does look disjoint, and we should
> > also use complete 4K pages for pagetables instead of 2K fragments there,
> > but Christian or Claudio should also have a look.
> >   
> > > 
> > > Gerald, s390 folk: would it be possible for you to give this
> > > a try, suggest corrections and improvements, and then I can make it
> > > a separate patch of the series; and work on avoiding concurrent use
> > > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > > the scheme already there, maybe DD bits to go along with the PP AA).  
> > 
> > It feels like it could be possible to not only avoid the double
> > rcu_head, but also avoid passing over the mm via page->pt_mm.
> > I.e. have pte_free_defer(), which has the mm, do all the checks and
> > list updates that page_table_free() does, for which we need the mm.
> > Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> > and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> > just do _dtor/__free_page similar to the generic version.  
> 
> I'm not sure: I missed your suggestion there when I first skimmed
> through, and today have spent more time getting deeper into how it's
> done at present.  I am now feeling more confident of a way forward,
> a nicely integrated way forward, than I was yesterday.
> Though getting it right may not be so easy.

I think my "feeling" was a déjà vu of the existing logic that we use for
page_table_free_rcu() -> __tlb_remove_table(), where we also have no mm
any more at the end, and use the PP bits magic to find out if the page
can be freed, or if we still have fragments left.

Of course, in that case, we also would not need the mm any more for
list handling, as the to-be-freed fragments were already put back
on the list, but with PP bits set, to prevent re-use. And clearing
those would then make the fragment usable from the list again.

I guess that would also be the major difference here, i.e. your RCU
call-back would need to be able to add fragments back to the list,
after having them removed before to make room for page->rcu_head,
but with Jasons thoughts that does not seem so impossible after all.

I do not yet understand if the list_head would then compulsorily need
to be inside the pagetable, because page->rcu_head/lru still cannot be
used (again). But you already have a patch for that, so either way
might be possible.

> 
> When Jason pointed out the existing RCU, I initially hoped that it might
> already provide the necessary framework: but sadly not, because the
> unbatched case (used when additional memory is not available) does not
> use RCU at all, but instead the tlb_remove_table_sync_one() IRQ hack.
> If I used that, it would cripple the s390 implementation unacceptably.
> 
> > 
> > I must admit that I still have no good overview of the "big picture"
> > here, and especially if this approach would still fit in. Probably not,
> > as the to-be-freed pagetables would still be accessible, but not really
> > valid, if we added them back to the list, with list_heads inside them.
> > So maybe call_rcu() has to be done always, and not only for the case
> > where the whole 4K page becomes free, then we probably cannot do w/o
> > passing over the mm for proper list handling.  
> 
> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

Yes, that looks promising. Such a fragment would not necessarily need
to be on the list, because while it is on its way, i.e. before the
RCU call-back finished, it cannot be re-used anyway.

page_table_alloc() could currently find such a fragment on the list, but
only to see the PP bits set, so it will not use it. Only after
__tlb_remove_table() in the RCU call-back resets the bits, it would be
usable again.

In your case, that could correspond to adding it back to the list.
That could even be an improvement, because page_table_alloc() would
not be bothered by such unusable fragments.

[...]
> 
> Is it too early to wish you a happy reverse Xmas?

Nice idea, we should make June 24th the reverse Xmas Remembrance Day :-)

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-08 15:47           ` Gerald Schaefer
  (?)
  (?)
@ 2023-06-13  6:34           ` Hugh Dickins
  2023-06-14 13:30             ` Gerald Schaefer
  -1 siblings, 1 reply; 158+ messages in thread
From: Hugh Dickins @ 2023-06-13  6:34 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Jason Gunthorpe, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Thu, 8 Jun 2023, Gerald Schaefer wrote:
> On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > 
> > My current thinking (but may be proved wrong) is along the lines of:
> > why does something on its way to being freed need to be on any list
> > than the rcu_head list?  I expect the current answer is, that the
> > other half is allocated, so the page won't be freed; but I hope that
> > we can put it back on that list once we're through with the rcu_head.
> 
> Yes, that looks promising. Such a fragment would not necessarily need
> to be on the list, because while it is on its way, i.e. before the
> RCU call-back finished, it cannot be re-used anyway.
> 
> page_table_alloc() could currently find such a fragment on the list, but
> only to see the PP bits set, so it will not use it. Only after
> __tlb_remove_table() in the RCU call-back resets the bits, it would be
> usable again.
> 
> In your case, that could correspond to adding it back to the list.
> That could even be an improvement, because page_table_alloc() would
> not be bothered by such unusable fragments.

Cutting down the Ccs for now, to just the most interested parties:
here's what I came up with.  Which is entirely unbuilt and untested,
and I may have got several of those tricky mask conditionals wrong;
but seems a good way forward, except for the admitted unsolved flaw
(I do not want to mmgrab() for every single page table).

I don't think you're at all likely to hit that flaw in practice,
so if you have time, please do try reviewing and building and running
(a wrong mask conditional may stop it from even booting, but I hope
you'll be able to spot what's wrong without wasting too much time).
And maybe someone can come up with a good solution to the flaw.

Thanks!
Hugh

[PATCH 07/12] s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule that
a page is not linked to mm_context_t::pgtable_list while either half is
pending free, by either tlb_remove_table() or pte_free_defer(); but is
afterwards either relinked to the list (if other half is allocated), or
freed (if other half is free): by __tlb_remove_table() in both cases.

Set the page->pt_mm field to help with this.  But there is an unsolved
flaw: although reading that the other half is allocated guarantees that
the mm is still valid at that instant, what guarantees that it has not
already been freed before we take its context.lock?

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c          | 185 +++++++++++++++++++++++---------
 include/linux/mm_types.h        |   2 +-
 3 files changed, 142 insertions(+), 49 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..b40b2c0008ca 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -172,7 +172,7 @@ void page_table_free_pgste(struct page *page)
  * When a parent page gets fully allocated it contains 2KB-pgtables in both
  * upper and lower halves and is removed from mm_context_t::pgtable_list.
  *
- * When 2KB-pgtable is freed from to fully allocated parent page that
+ * When 2KB-pgtable is freed from the fully allocated parent page that
  * page turns partially allocated and added to mm_context_t::pgtable_list.
  *
  * If 2KB-pgtable is freed from the partially allocated parent page that
@@ -182,16 +182,24 @@ void page_table_free_pgste(struct page *page)
  * As follows from the above, no unallocated or fully allocated parent
  * pages are contained in mm_context_t::pgtable_list.
  *
+ * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
+ * the new rule is that a page is not linked to mm_context_t::pgtable_list
+ * while either half is pending free by any method; but afterwards is
+ * either relinked to it, or freed, by __tlb_remove_table().  This allows
+ * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
+ *
  * The upper byte (bits 24-31) of the parent page _refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
  *
- *   PP  AA
- * 01234567    upper byte (bits 24-31) of struct page::_refcount
- *   ||  ||
- *   ||  |+--- upper 2KB-pgtable is allocated
- *   ||  +---- lower 2KB-pgtable is allocated
- *   |+------- upper 2KB-pgtable is pending for removal
- *   +-------- lower 2KB-pgtable is pending for removal
+ *   PPHHAA
+ * 76543210    upper byte (bits 24-31) of struct page::_refcount
+ *   ||||||
+ *   |||||+--- lower 2KB-pgtable is allocated
+ *   ||||+---- upper 2KB-pgtable is allocated
+ *   |||+----- lower 2KB-pgtable is pending free by page->rcu_head
+ *   ||+------ upper 2KB-pgtable is pending free by page->rcu_head
+ *   |+------- lower 2KB-pgtable is pending free by any method
+ *   +-------- upper 2KB-pgtable is pending free by any method
  *
  * (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
  * using _refcount is possible).
@@ -200,7 +208,7 @@ void page_table_free_pgste(struct page *page)
  * The parent page is either:
  *   - added to mm_context_t::pgtable_list in case the second half of the
  *     parent page is still unallocated;
- *   - removed from mm_context_t::pgtable_list in case both hales of the
+ *   - removed from mm_context_t::pgtable_list in case both halves of the
  *     parent page are allocated;
  * These operations are protected with mm_context_t::lock.
  *
@@ -244,25 +252,15 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 			page = list_first_entry(&mm->context.pgtable_list,
 						struct page, lru);
 			mask = atomic_read(&page->_refcount) >> 24;
-			/*
-			 * The pending removal bits must also be checked.
-			 * Failure to do so might lead to an impossible
-			 * value of (i.e 0x13 or 0x23) written to _refcount.
-			 * Such values violate the assumption that pending and
-			 * allocation bits are mutually exclusive, and the rest
-			 * of the code unrails as result. That could lead to
-			 * a whole bunch of races and corruptions.
-			 */
-			mask = (mask | (mask >> 4)) & 0x03U;
-			if (mask != 0x03U) {
-				table = (unsigned long *) page_to_virt(page);
-				bit = mask & 1;		/* =1 -> second 2K */
-				if (bit)
-					table += PTRS_PER_PTE;
-				atomic_xor_bits(&page->_refcount,
-							0x01U << (bit + 24));
-				list_del(&page->lru);
-			}
+			/* Cannot be on this list if either half pending free */
+			WARN_ON_ONCE(mask & ~0x03U);
+			/* One or other half must be available, but not both */
+			WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
+			table = (unsigned long *)page_to_virt(page);
+			bit = mask & 0x01U;	/* =1 -> second 2K available */
+			table += bit * PTRS_PER_PTE;
+			atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
+			list_del(&page->lru);
 		}
 		spin_unlock_bh(&mm->context.lock);
 		if (table)
@@ -278,6 +276,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	}
 	arch_set_page_dat(page, 0);
 	/* Initialize page table */
+	page->pt_mm = mm;
 	table = (unsigned long *) page_to_virt(page);
 	if (mm_alloc_pgste(mm)) {
 		/* Return 4K page table with PGSTEs */
@@ -295,7 +294,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	return table;
 }
 
-static void page_table_release_check(struct page *page, void *table,
+static void page_table_release_check(struct page *page, unsigned long *table,
 				     unsigned int half, unsigned int mask)
 {
 	char msg[128];
@@ -314,24 +313,22 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	struct page *page;
 
 	page = virt_to_page(table);
+	WARN_ON_ONCE(page->pt_mm != mm);
 	if (!mm_alloc_pgste(mm)) {
 		/* Free 2K page table fragment of a 4K page */
 		bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
 		spin_lock_bh(&mm->context.lock);
 		/*
-		 * Mark the page for delayed release. The actual release
-		 * will happen outside of the critical section from this
-		 * function or from __tlb_remove_table()
+		 * Mark the page for release. The actual release will happen
+		 * below from this function, or later from __tlb_remove_table().
 		 */
-		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
+		mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
+		if (mask & 0x03U)		/* other half is allocated */
 			list_add(&page->lru, &mm->context.pgtable_list);
-		else
+		else if (!(mask & 0x30U))	/* other half not pending */
 			list_del(&page->lru);
 		spin_unlock_bh(&mm->context.lock);
-		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
-		mask >>= 24;
 		if (mask != 0x00U)
 			return;
 		half = 0x01U << bit;
@@ -355,6 +352,7 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 
 	mm = tlb->mm;
 	page = virt_to_page(table);
+	WARN_ON_ONCE(page->pt_mm != mm);
 	if (mm_alloc_pgste(mm)) {
 		gmap_unlink(mm, table, vmaddr);
 		table = (unsigned long *) ((unsigned long)table | 0x03U);
@@ -364,15 +362,13 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 	bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
 	spin_lock_bh(&mm->context.lock);
 	/*
-	 * Mark the page for delayed release. The actual release will happen
-	 * outside of the critical section from __tlb_remove_table() or from
-	 * page_table_free()
+	 * Mark the page for delayed release.
+	 * The actual release will happen later, from __tlb_remove_table().
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
-		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
+	/* Other half not allocated? Other half not already pending free? */
+	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
 		list_del(&page->lru);
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
@@ -382,17 +378,38 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 void __tlb_remove_table(void *_table)
 {
 	unsigned int mask = (unsigned long) _table & 0x03U, half = mask;
-	void *table = (void *)((unsigned long) _table ^ mask);
+	unsigned long *table = (unsigned long *)((unsigned long) _table ^ mask);
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm;
 
 	switch (half) {
 	case 0x00U:	/* pmd, pud, or p4d */
-		free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+		__free_pages(page, CRST_ALLOC_ORDER);
 		return;
 	case 0x01U:	/* lower 2K of a 4K page table */
-	case 0x02U:	/* higher 2K of a 4K page table */
-		mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24));
-		mask >>= 24;
+	case 0x02U:	/* upper 2K of a 4K page table */
+		/*
+		 * If the other half is marked as allocated, page->pt_mm must
+		 * still be valid, page->rcu_head no longer in use so page->lru
+		 * good for use, so now make the freed half available for reuse.
+		 * But be wary of races with that other half being freed.
+		 */
+		if (atomic_read(&page->_refcount) & (0x03U << 24)) {
+			mm = page->pt_mm;
+			/*
+			 * But what guarantees that mm has not been freed now?!
+			 * It's very unlikely, but we want certainty...
+			 */
+			spin_lock_bh(&mm->context.lock);
+			mask = atomic_xor_bits(&page->_refcount, mask << 28);
+			mask >>= 24;
+			if (mask & 0x03U)
+				list_add(&page->lru, &mm->context.pgtable_list);
+			spin_unlock_bh(&mm->context.lock);
+		} else {
+			mask = atomic_xor_bits(&page->_refcount, mask << 28);
+			mask >>= 24;
+		}
 		if (mask != 0x00U)
 			return;
 		break;
@@ -407,6 +424,78 @@ void __tlb_remove_table(void *_table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+	unsigned long *table;
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	table = (unsigned long *)((unsigned long)table | 0x03U);
+	__tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+	unsigned long *table;
+	struct page *page;
+	unsigned int mask;
+
+	page = container_of(head, struct page, rcu_head);
+	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+	table = (unsigned long *)page_to_virt(page);
+	table += bit * PTRS_PER_PTE;
+	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+	__tlb_remove_table(table);
+
+	/* If pte_free_defer() of the other half came in, queue it now */
+	if (mask & 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+	pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_read *head)
+{
+	pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	unsigned int bit, mask;
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	WARN_ON_ONCE(page->pt_mm != mm);
+	if (mm_alloc_pgste(mm)) {
+		call_rcu(&page->rcu_head, pte_free_pgste)
+		return;
+	}
+	bit = ((unsigned long)pgtable & ~PAGE_MASK) /
+			(PTRS_PER_PTE * sizeof(pte_t));
+
+	spin_lock_bh(&mm->context.lock);
+	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+	mask >>= 24;
+	/* Other half not allocated? Other half not already pending free? */
+	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
+		list_del(&page->lru);
+	spin_unlock_bh(&mm->context.lock);
+
+	/* Do not relink on rcu_head if other half already linked on rcu_head */
+	if ((mask & 0x0CU) != 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..1667a1bdb8a8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -146,7 +146,7 @@ struct page {
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
-				struct mm_struct *pt_mm; /* x86 pgds only */
+				struct mm_struct *pt_mm; /* x86 pgd, s390 */
 				atomic_t pt_frag_refcount; /* powerpc */
 			};
 #if ALLOC_SPLIT_PTLOCKS
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-13  6:34           ` Hugh Dickins
@ 2023-06-14 13:30             ` Gerald Schaefer
  2023-06-14 21:59               ` Hugh Dickins
  0 siblings, 1 reply; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-14 13:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Jason Gunthorpe, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Mon, 12 Jun 2023 23:34:08 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Thu, 8 Jun 2023, Gerald Schaefer wrote:
> > On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > 
> > > My current thinking (but may be proved wrong) is along the lines of:
> > > why does something on its way to being freed need to be on any list
> > > than the rcu_head list?  I expect the current answer is, that the
> > > other half is allocated, so the page won't be freed; but I hope that
> > > we can put it back on that list once we're through with the rcu_head.  
> > 
> > Yes, that looks promising. Such a fragment would not necessarily need
> > to be on the list, because while it is on its way, i.e. before the
> > RCU call-back finished, it cannot be re-used anyway.
> > 
> > page_table_alloc() could currently find such a fragment on the list, but
> > only to see the PP bits set, so it will not use it. Only after
> > __tlb_remove_table() in the RCU call-back resets the bits, it would be
> > usable again.
> > 
> > In your case, that could correspond to adding it back to the list.
> > That could even be an improvement, because page_table_alloc() would
> > not be bothered by such unusable fragments.  
> 
> Cutting down the Ccs for now, to just the most interested parties:
> here's what I came up with.  Which is entirely unbuilt and untested,
> and I may have got several of those tricky mask conditionals wrong;
> but seems a good way forward, except for the admitted unsolved flaw
> (I do not want to mmgrab() for every single page table).
> 
> I don't think you're at all likely to hit that flaw in practice,
> so if you have time, please do try reviewing and building and running
> (a wrong mask conditional may stop it from even booting, but I hope
> you'll be able to spot what's wrong without wasting too much time).
> And maybe someone can come up with a good solution to the flaw.
> 
> Thanks!
> Hugh
> 
> [PATCH 07/12] s390: add pte_free_defer() for pgtables sharing page
> 
> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
> 
> Build upon the existing management, adjusted to follow a new rule that
> a page is not linked to mm_context_t::pgtable_list while either half is
> pending free, by either tlb_remove_table() or pte_free_defer(); but is
> afterwards either relinked to the list (if other half is allocated), or
> freed (if other half is free): by __tlb_remove_table() in both cases.
> 
> Set the page->pt_mm field to help with this.  But there is an unsolved
> flaw: although reading that the other half is allocated guarantees that
> the mm is still valid at that instant, what guarantees that it has not
> already been freed before we take its context.lock?

I fear this will not work, the stored mm might not only have been freed,
but even re-used as a different mm, in the worst case. We do hit the
strangest races in Linux when running under z/VM hypervisor, which could
stop CPUs for quite some time...

It would be much more acceptable to simply not add back such fragments
to the list, and therefore risking some memory waste, than risking to
use an unstable mm in __tlb_remove_table(). The amount of wasted memory
in practice might also not be a lot, depending on whether the fragments
belong to the same and contiguous mapping.

Also, we would not need to use page->pt_mm, and therefore make room for
page->pt_frag_refcount, which for some reason is (still) being used
in new v4 from Vishals "Split ptdesc from struct page" series...

See also my modified version of your patch at the end of this somewhat
lengthy mail.

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/s390/include/asm/pgalloc.h |   4 +
>  arch/s390/mm/pgalloc.c          | 185 +++++++++++++++++++++++---------
>  include/linux/mm_types.h        |   2 +-
>  3 files changed, 142 insertions(+), 49 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
>  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  void vmem_map_init(void);
>  void *vmem_crst_alloc(unsigned long val);
>  pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..b40b2c0008ca 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -172,7 +172,7 @@ void page_table_free_pgste(struct page *page)
>   * When a parent page gets fully allocated it contains 2KB-pgtables in both
>   * upper and lower halves and is removed from mm_context_t::pgtable_list.
>   *
> - * When 2KB-pgtable is freed from to fully allocated parent page that
> + * When 2KB-pgtable is freed from the fully allocated parent page that
>   * page turns partially allocated and added to mm_context_t::pgtable_list.
>   *
>   * If 2KB-pgtable is freed from the partially allocated parent page that
> @@ -182,16 +182,24 @@ void page_table_free_pgste(struct page *page)
>   * As follows from the above, no unallocated or fully allocated parent
>   * pages are contained in mm_context_t::pgtable_list.
>   *
> + * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
> + * the new rule is that a page is not linked to mm_context_t::pgtable_list
> + * while either half is pending free by any method; but afterwards is
> + * either relinked to it, or freed, by __tlb_remove_table().  This allows
> + * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
> + *
>   * The upper byte (bits 24-31) of the parent page _refcount is used
>   * for tracking contained 2KB-pgtables and has the following format:
>   *
> - *   PP  AA
> - * 01234567    upper byte (bits 24-31) of struct page::_refcount
> - *   ||  ||
> - *   ||  |+--- upper 2KB-pgtable is allocated
> - *   ||  +---- lower 2KB-pgtable is allocated
> - *   |+------- upper 2KB-pgtable is pending for removal
> - *   +-------- lower 2KB-pgtable is pending for removal
> + *   PPHHAA
> + * 76543210    upper byte (bits 24-31) of struct page::_refcount

Hmm, big-endian? BTW, the existing (bits 24-31) notation also seems to be
somehow misleading, I guess it should rather read "bits 0-7 of the 32bit
value", but I also often get confused by this, so maybe it is correct.

> + *   ||||||
> + *   |||||+--- lower 2KB-pgtable is allocated
> + *   ||||+---- upper 2KB-pgtable is allocated
> + *   |||+----- lower 2KB-pgtable is pending free by page->rcu_head
> + *   ||+------ upper 2KB-pgtable is pending free by page->rcu_head
> + *   |+------- lower 2KB-pgtable is pending free by any method
> + *   +-------- upper 2KB-pgtable is pending free by any method
>   *
>   * (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
>   * using _refcount is possible).
> @@ -200,7 +208,7 @@ void page_table_free_pgste(struct page *page)
>   * The parent page is either:
>   *   - added to mm_context_t::pgtable_list in case the second half of the
>   *     parent page is still unallocated;
> - *   - removed from mm_context_t::pgtable_list in case both hales of the
> + *   - removed from mm_context_t::pgtable_list in case both halves of the
>   *     parent page are allocated;
>   * These operations are protected with mm_context_t::lock.
>   *
> @@ -244,25 +252,15 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  			page = list_first_entry(&mm->context.pgtable_list,
>  						struct page, lru);
>  			mask = atomic_read(&page->_refcount) >> 24;
> -			/*
> -			 * The pending removal bits must also be checked.
> -			 * Failure to do so might lead to an impossible
> -			 * value of (i.e 0x13 or 0x23) written to _refcount.
> -			 * Such values violate the assumption that pending and
> -			 * allocation bits are mutually exclusive, and the rest
> -			 * of the code unrails as result. That could lead to
> -			 * a whole bunch of races and corruptions.
> -			 */
> -			mask = (mask | (mask >> 4)) & 0x03U;
> -			if (mask != 0x03U) {
> -				table = (unsigned long *) page_to_virt(page);
> -				bit = mask & 1;		/* =1 -> second 2K */
> -				if (bit)
> -					table += PTRS_PER_PTE;
> -				atomic_xor_bits(&page->_refcount,
> -							0x01U << (bit + 24));
> -				list_del(&page->lru);
> -			}
> +			/* Cannot be on this list if either half pending free */
> +			WARN_ON_ONCE(mask & ~0x03U);
> +			/* One or other half must be available, but not both */
> +			WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
> +			table = (unsigned long *)page_to_virt(page);
> +			bit = mask & 0x01U;	/* =1 -> second 2K available */
> +			table += bit * PTRS_PER_PTE;
> +			atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
> +			list_del(&page->lru);

I hope we can do w/o changing page_table_alloc() code, and as little of the
existing, very fragile and carefully tuned, code as possible. At least when
we use the approach of not adding back fragments from pte_free_defer() to
the list.

>  		}
>  		spin_unlock_bh(&mm->context.lock);
>  		if (table)
> @@ -278,6 +276,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  	}
>  	arch_set_page_dat(page, 0);
>  	/* Initialize page table */
> +	page->pt_mm = mm;
>  	table = (unsigned long *) page_to_virt(page);
>  	if (mm_alloc_pgste(mm)) {
>  		/* Return 4K page table with PGSTEs */
> @@ -295,7 +294,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  	return table;
>  }
>  
> -static void page_table_release_check(struct page *page, void *table,
> +static void page_table_release_check(struct page *page, unsigned long *table,
>  				     unsigned int half, unsigned int mask)
>  {
>  	char msg[128];
> @@ -314,24 +313,22 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
>  	struct page *page;
>  
>  	page = virt_to_page(table);
> +	WARN_ON_ONCE(page->pt_mm != mm);
>  	if (!mm_alloc_pgste(mm)) {
>  		/* Free 2K page table fragment of a 4K page */
>  		bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
>  		spin_lock_bh(&mm->context.lock);
>  		/*
> -		 * Mark the page for delayed release. The actual release
> -		 * will happen outside of the critical section from this
> -		 * function or from __tlb_remove_table()
> +		 * Mark the page for release. The actual release will happen
> +		 * below from this function, or later from __tlb_remove_table().
>  		 */
> -		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> +		mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));

Uh oh, I have a bad feeling about this. It seems as it would somehow revert
the subtle race fix from commit c2c224932fd0 ("s390/mm: fix 2KB pgtable
release race").

>  		mask >>= 24;
> -		if (mask & 0x03U)
> +		if (mask & 0x03U)		/* other half is allocated */
>  			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> +		else if (!(mask & 0x30U))	/* other half not pending */
>  			list_del(&page->lru);
>  		spin_unlock_bh(&mm->context.lock);
> -		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> -		mask >>= 24;
>  		if (mask != 0x00U)
>  			return;
>  		half = 0x01U << bit;
> @@ -355,6 +352,7 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  
>  	mm = tlb->mm;
>  	page = virt_to_page(table);
> +	WARN_ON_ONCE(page->pt_mm != mm);
>  	if (mm_alloc_pgste(mm)) {
>  		gmap_unlink(mm, table, vmaddr);
>  		table = (unsigned long *) ((unsigned long)table | 0x03U);
> @@ -364,15 +362,13 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  	bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
>  	spin_lock_bh(&mm->context.lock);
>  	/*
> -	 * Mark the page for delayed release. The actual release will happen
> -	 * outside of the critical section from __tlb_remove_table() or from
> -	 * page_table_free()
> +	 * Mark the page for delayed release.
> +	 * The actual release will happen later, from __tlb_remove_table().
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> -		list_add_tail(&page->lru, &mm->context.pgtable_list);

I guess we want to keep this list_add(), if we do not add back fragments in
__tlb_remove_table() instead.

> -	else
> +	/* Other half not allocated? Other half not already pending free? */
> +	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
>  		list_del(&page->lru);
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> @@ -382,17 +378,38 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  void void *_table)
>  {
>  	unsigned int mask = (unsigned long) _table & 0x03U, half = mask;
> -	void *table = (void *)((unsigned long) _table ^ mask);
> +	unsigned long *table = (unsigned long *)((unsigned long) _table ^ mask);
>  	struct page *page = virt_to_page(table);
> +	struct mm_struct *mm;
>  
>  	switch (half) {
>  	case 0x00U:	/* pmd, pud, or p4d */
> -		free_pages((unsigned long)table, CRST_ALLOC_ORDER);
> +		__free_pages(page, CRST_ALLOC_ORDER);
>  		return;
>  	case 0x01U:	/* lower 2K of a 4K page table */
> -	case 0x02U:	/* higher 2K of a 4K page table */
> -		mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24));
> -		mask >>= 24;
> +	case 0x02U:	/* upper 2K of a 4K page table */
> +		/*
> +		 * If the other half is marked as allocated, page->pt_mm must
> +		 * still be valid, page->rcu_head no longer in use so page->lru
> +		 * good for use, so now make the freed half available for reuse.
> +		 * But be wary of races with that other half being freed.
> +		 */
> +		if (atomic_read(&page->_refcount) & (0x03U << 24)) {
> +			mm = page->pt_mm;
> +			/*
> +			 * But what guarantees that mm has not been freed now?!
> +			 * It's very unlikely, but we want certainty...
> +			 */
> +			spin_lock_bh(&mm->context.lock);
> +			mask = atomic_xor_bits(&page->_refcount, mask << 28);
> +			mask >>= 24;
> +			if (mask & 0x03U)
> +				list_add(&page->lru, &mm->context.pgtable_list);
> +			spin_unlock_bh(&mm->context.lock);
> +		} else {
> +			mask = atomic_xor_bits(&page->_refcount, mask << 28);
> +			mask >>= 24;
> +		}

Same as for page_table_alloc(), I hope we can avoid touching existing
__tlb_remove_table() code, and certainly not expose also the existing
code to the "unstable mm" risk. So page_table_free_rcu() should stick to
its list_add(), and pte_free_defer() should do w/o adding back.

>  		if (mask != 0x00U)
>  			return;
>  		break;
> @@ -407,6 +424,78 @@ void __tlb_remove_table(void *_table)
>  	__free_page(page);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);
> +
> +static void pte_free_pgste(struct rcu_head *head)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	table = (unsigned long *)page_to_virt(page);
> +	table = (unsigned long *)((unsigned long)table | 0x03U);
> +	__tlb_remove_table(table);
> +}
> +
> +static void pte_free_half(struct rcu_head *head, unsigned int bit)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +	unsigned int mask;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
> +
> +	table = (unsigned long *)page_to_virt(page);
> +	table += bit * PTRS_PER_PTE;
> +	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
> +	__tlb_remove_table(table);
> +
> +	/* If pte_free_defer() of the other half came in, queue it now */
> +	if (mask & 0x0CU)
> +		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
> +}
> +
> +static void pte_free_now0(struct rcu_head *head)
> +{
> +	pte_free_half(head, 0);
> +}
> +
> +static void pte_free_now1(struct rcu_read *head)
> +{
> +	pte_free_half(head, 1);
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	unsigned int bit, mask;
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	WARN_ON_ONCE(page->pt_mm != mm);
> +	if (mm_alloc_pgste(mm)) {
> +		call_rcu(&page->rcu_head, pte_free_pgste)

Hmm, in page_table_free_rcu(), which is somewhat similar to this, we
also do a "gmap_unlink(mm, table, vmaddr)" for the mm_alloc_pgste(mm)
case.

Not familiar with this, and also e.g. why we do not care about this
in the non-RCU page_table_free() code. But it might be required here,
and we have no addr here. However, that could probably be passed along
from retract_page_tables(), I guess, if needed.

Christian, Claudio, any idea about gmap_unlink() usage in
page_table_free_rcu() (and why not in page_table_free()), and if
it would also be required here?

IIUC, this whole series would at least ATM only affect THP collapse,
and for KVM, in particular qemu process in the host, we seem to somehow
prevent THP usage in s390_enable_sie():

        /* split thp mappings and disable thp for future mappings */
        thp_split_mm(mm);

So mm_alloc_pgste(mm) might not be a valid case here ATM, but in case
this framework would later also be used for other things, it would be
good to be prepared, i.e. by either passing along the addr, or clarify
that we do not need it...

> +		return;
> +	}
> +	bit = ((unsigned long)pgtable & ~PAGE_MASK) /
> +			(PTRS_PER_PTE * sizeof(pte_t));
> +
> +	spin_lock_bh(&mm->context.lock);
> +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
> +	mask >>= 24;
> +	/* Other half not allocated? Other half not already pending free? */
> +	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
> +		list_del(&page->lru);
> +	spin_unlock_bh(&mm->context.lock);
> +
> +	/* Do not relink on rcu_head if other half already linked on rcu_head */
> +	if ((mask & 0x0CU) != 0x0CU)
> +		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);

I dot not fully understand if / why we need the new HH bits. While working
on my patch it seemed to be useful for sorting out list_add/del in the
various cases. Here it only seems to be used for preventing double rcu_head
usage, is this correct, or am I missing something?

Not needing the page->pt_mm any more would make room for page->pt_frag_refcount,
and possibly a similar approach like on powerpc, for the double rcu_head
issue. But of course not if Vishal grabs it in his "Split ptdesc from struct
page" series...

So, here is some alternative approach, w/o the "unstable mm" flaw, by not
putting fragments back on the list for the pte_free_defer() case, but
with some risk of (temporarily) wasting an uncertain amount of memory with
half-filled 4K pages for pagetables.

There is unfortunately also very likely another flaw with getting the
list_add() and list_del() right, in page_table_free[_rcu]() vs.
pte_free_defer(). The problem is that pte_free_defer() can only do a
list_del(), and not also list_add(), because it must make sure that
page->lru will not be used any more. This is now asymmetrical to the
other cases. So page_table_free[_rcu](), and also pte_free_defer(),
must avoid list_del() w/o previous list_add().

My approach with checking the new HH bits (0x0CU) is likely racy, when
pte_free_half() clears its H bit. At least for the case where we have
two allocated fragments, and both should be freed via pte_free_defer(),
we might end up with a list_del() although there never was a list_add().

Instead of using the HH bits for synchronizing, it might be possible
to use list_del_init() instead of list_del(), probably in all places,
and check with list_empty() to avoid such a list_del() w/o previous
list_add().

Or alternatively, we could stop adding back fragments to the list
in general, risking some more memory waste. It could also allow to
get rid of the list as a whole, and e.g. only keep some pointer to the
most recent free 2K fragment from a freshly allocated 4K page, for
one-time use. Alexander was playing around with such an idea.

The current code is apparently simply too clever for us, and it only
seems to get harder with every change, like in this case. But maybe we
can handle it one more time...

---
 arch/s390/include/asm/pgalloc.h |    4 +
 arch/s390/mm/pgalloc.c          |  108 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 104 insertions(+), 8 deletions(-)

--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct m
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -185,11 +185,13 @@ void page_table_free_pgste(struct page *
  * The upper byte (bits 24-31) of the parent page _refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
  *
- *   PP  AA
+ *   PPHHAA
  * 01234567    upper byte (bits 24-31) of struct page::_refcount
- *   ||  ||
- *   ||  |+--- upper 2KB-pgtable is allocated
- *   ||  +---- lower 2KB-pgtable is allocated
+ *   ||||||
+ *   |||||+--- upper 2KB-pgtable is allocated
+ *   ||||+---- lower 2KB-pgtable is allocated
+ *   |||+----- upper 2KB-pgtable is pending free by page->rcu_head
+ *   ||+------ lower 2KB-pgtable is pending free by page->rcu_head
  *   |+------- upper 2KB-pgtable is pending for removal
  *   +-------- lower 2KB-pgtable is pending for removal
  *
@@ -229,6 +231,8 @@ void page_table_free_pgste(struct page *
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * TODO: Add comments for HH bits
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -325,13 +329,18 @@ void page_table_free(struct mm_struct *m
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
+		/*
+		 * If pte_free_defer() marked other half for delayed release,
+		 * w/o list_add(), we must _not_ do list_del()
+		 */
+		if (mask & 0x03U)		/* other half is allocated */
 			list_add(&page->lru, &mm->context.pgtable_list);
-		else
+		else if (!(mask & 0x0CU))	/* no pte_free_defer() pending */
 			list_del(&page->lru);
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
+		/* Return if other half is allocated, or delayed release pending */
 		if (mask != 0x00U)
 			return;
 		half = 0x01U << bit;
@@ -370,9 +379,13 @@ void page_table_free_rcu(struct mmu_gath
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
+	/*
+	 * If pte_free_defer() marked other half for delayed release,
+	 * w/o list_add(), we must _not_ do list_del()
+	 */
+	if (mask & 0x03U)		/* other half is allocated */
 		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
+	else if (!(mask & 0x0CU))	/* no pte_free_defer() pending */
 		list_del(&page->lru);
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
@@ -407,6 +420,85 @@ void __tlb_remove_table(void *_table)
 	__free_page(page);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now0(struct rcu_head *head);
+static void pte_free_now1(struct rcu_head *head);
+
+static void pte_free_pgste(struct rcu_head *head)
+{
+	unsigned long *table;
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	table = (unsigned long *)page_to_virt(page);
+	table = (unsigned long *)((unsigned long)table | 0x03U);
+	__tlb_remove_table(table);
+}
+
+static void pte_free_half(struct rcu_head *head, unsigned int bit)
+{
+	unsigned long *table;
+	struct page *page;
+	unsigned int mask;
+
+	page = container_of(head, struct page, rcu_head);
+	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
+
+	table = (unsigned long *)page_to_virt(page);
+	table += bit * PTRS_PER_PTE;
+	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
+	__tlb_remove_table(table);
+
+	/* If pte_free_defer() of the other half came in, queue it now */
+	if (mask & 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
+}
+
+static void pte_free_now0(struct rcu_head *head)
+{
+	pte_free_half(head, 0);
+}
+
+static void pte_free_now1(struct rcu_head *head)
+{
+	pte_free_half(head, 1);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	unsigned int bit, mask;
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	if (mm_alloc_pgste(mm)) {
+		/*
+		 * Need gmap_unlink(mm, pgtable, addr), like in
+		 * page_table_free_rcu() ???
+		 * If yes -> need addr parameter here, like in pte_free_tlb().
+		 */
+		call_rcu(&page->rcu_head, pte_free_pgste);
+		return;
+}
+	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
+
+	spin_lock_bh(&mm->context.lock);
+	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
+	mask >>= 24;
+	/*
+	 * Other half not allocated?
+	 * If pte_free_defer() marked other half for delayed release,
+	 * w/o list_add(), we must _not_ do list_del()
+	 */
+	if ((mask & 0x03U) == 0x00U && (mask & 0x0CU) != 0x0CU)
+		list_del(&page->lru);
+	spin_unlock_bh(&mm->context.lock);
+
+	/* Do not relink on rcu_head if other half already linked on rcu_head */
+	if ((mask & 0x0CU) != 0x0CU)
+		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.


^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-14 13:30             ` Gerald Schaefer
@ 2023-06-14 21:59               ` Hugh Dickins
  2023-06-15 12:11                 ` Gerald Schaefer
  2023-06-15 12:34                 ` Jason Gunthorpe
  0 siblings, 2 replies; 158+ messages in thread
From: Hugh Dickins @ 2023-06-14 21:59 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Jason Gunthorpe, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Wed, 14 Jun 2023, Gerald Schaefer wrote:
> On Mon, 12 Jun 2023 23:34:08 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > On Thu, 8 Jun 2023, Gerald Schaefer wrote:
> > > On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
> > > Hugh Dickins <hughd@google.com> wrote:  
> > > > 
> > > > My current thinking (but may be proved wrong) is along the lines of:
> > > > why does something on its way to being freed need to be on any list
> > > > than the rcu_head list?  I expect the current answer is, that the
> > > > other half is allocated, so the page won't be freed; but I hope that
> > > > we can put it back on that list once we're through with the rcu_head.  
> > > 
> > > Yes, that looks promising. Such a fragment would not necessarily need
> > > to be on the list, because while it is on its way, i.e. before the
> > > RCU call-back finished, it cannot be re-used anyway.
> > > 
> > > page_table_alloc() could currently find such a fragment on the list, but
> > > only to see the PP bits set, so it will not use it. Only after
> > > __tlb_remove_table() in the RCU call-back resets the bits, it would be
> > > usable again.
> > > 
> > > In your case, that could correspond to adding it back to the list.
> > > That could even be an improvement, because page_table_alloc() would
> > > not be bothered by such unusable fragments.  
> > 
> > Cutting down the Ccs for now, to just the most interested parties:
> > here's what I came up with.  Which is entirely unbuilt and untested,
> > and I may have got several of those tricky mask conditionals wrong;
> > but seems a good way forward, except for the admitted unsolved flaw
> > (I do not want to mmgrab() for every single page table).
> > 
> > I don't think you're at all likely to hit that flaw in practice,
> > so if you have time, please do try reviewing and building and running
> > (a wrong mask conditional may stop it from even booting, but I hope
> > you'll be able to spot what's wrong without wasting too much time).
> > And maybe someone can come up with a good solution to the flaw.
> > 
> > Thanks!
> > Hugh

Many thankgs for getting into it, Gerald.

> > 
> > [PATCH 07/12] s390: add pte_free_defer() for pgtables sharing page
> > 
> > Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This version is more complicated than others: because s390 fits two 2K
> > page tables into one 4K page (so page->rcu_head must be shared between
> > both halves), and already uses page->lru (which page->rcu_head overlays)
> > to list any free halves; with clever management by page->_refcount bits.
> > 
> > Build upon the existing management, adjusted to follow a new rule that
> > a page is not linked to mm_context_t::pgtable_list while either half is
> > pending free, by either tlb_remove_table() or pte_free_defer(); but is
> > afterwards either relinked to the list (if other half is allocated), or
> > freed (if other half is free): by __tlb_remove_table() in both cases.
> > 
> > Set the page->pt_mm field to help with this.  But there is an unsolved
> > flaw: although reading that the other half is allocated guarantees that
> > the mm is still valid at that instant, what guarantees that it has not
> > already been freed before we take its context.lock?
> 
> I fear this will not work, the stored mm might not only have been freed,
> but even re-used as a different mm, in the worst case. We do hit the
> strangest races in Linux when running under z/VM hypervisor, which could
> stop CPUs for quite some time...

I do not doubt it!  But I thought it wouldn't be something that comes up
in the first five minutes of testing.  I was hoping that you might be able
to point to a piece of existing code which actually makes it a non-issue;
but failing that, we can add something.

I guess the best thing would be to modify kernel/fork.c to allow the
architecture to override free_mm(), and arch/s390 call_rcu to free mm.
But as a quick and dirty s390-end workaround, how about:

--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -70,6 +70,8 @@ static inline int init_new_context(struct task_struct *tsk,
 	return 0;
 }
 
+#define destroy_context(mm) synchronize_rcu()
+
 static inline void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 				      struct task_struct *tsk)
 {

I've always avoided synchronize_rcu(), perhaps it's not so bad over there
in the final __mmdrop(); but I may just be naive to imagine it's ever okay:
freeing the mm by RCU would surely have less impact.

(Funnily enough, there's no problem when the stored mm gets re-used for
a different mm, once past its spin_lock_init(&mm->context.lock); because
all that's needed is to recheck page->_refcount AA bits under that lock,
find them clear, and unlock.  But if that piece of memory gets reused for
something else completely, then what's there may freeze the spin_lock()
forever, or be corrupted by it.  I wanted to use SLAB_TYPESAFE_BY_RCU,
but struct mm initialization is much too complicated for that.)

> 
> It would be much more acceptable to simply not add back such fragments
> to the list, and therefore risking some memory waste, than risking to
> use an unstable mm in __tlb_remove_table(). The amount of wasted memory
> in practice might also not be a lot, depending on whether the fragments
> belong to the same and contiguous mapping.

I agree that it's much better to waste a little memory (and only temporarily)
than to freeze or corrupt.  But it's not an insoluble problem, I just didn't
want to get into more change if there was already an answer that covers it.

I assume that the freed mm issue scared you away from testing my patch,
so we don't know whether I got those mask conditionals right or not?

> 
> Also, we would not need to use page->pt_mm, and therefore make room for
> page->pt_frag_refcount, which for some reason is (still) being used
> in new v4 from Vishals "Split ptdesc from struct page" series...

Vishal's ptdesc: I've been ignoring as far as possible, I'll have to
respond on that later today, I'm afraid it will be putting this all into
an intolerable straitjacket.  If ptdesc is actually making more space
available by some magic, great: but I don't expect to find that is so.
Anyway, for now, there it's impossible (for me anyway) to think of that
at the same time as this.

> 
> See also my modified version of your patch at the end of this somewhat
> lengthy mail.
> 
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  arch/s390/include/asm/pgalloc.h |   4 +
> >  arch/s390/mm/pgalloc.c          | 185 +++++++++++++++++++++++---------
> >  include/linux/mm_types.h        |   2 +-
> >  3 files changed, 142 insertions(+), 49 deletions(-)
> > 
> > diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> > index 17eb618f1348..89a9d5ef94f8 100644
> > --- a/arch/s390/include/asm/pgalloc.h
> > +++ b/arch/s390/include/asm/pgalloc.h
> > @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
> >  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
> >  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
> >  
> > +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> > +#define pte_free_defer pte_free_defer
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> > +
> >  void vmem_map_init(void);
> >  void *vmem_crst_alloc(unsigned long val);
> >  pte_t *vmem_pte_alloc(void);
> > diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> > index 66ab68db9842..b40b2c0008ca 100644
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -172,7 +172,7 @@ void page_table_free_pgste(struct page *page)
> >   * When a parent page gets fully allocated it contains 2KB-pgtables in both
> >   * upper and lower halves and is removed from mm_context_t::pgtable_list.
> >   *
> > - * When 2KB-pgtable is freed from to fully allocated parent page that
> > + * When 2KB-pgtable is freed from the fully allocated parent page that
> >   * page turns partially allocated and added to mm_context_t::pgtable_list.
> >   *
> >   * If 2KB-pgtable is freed from the partially allocated parent page that
> > @@ -182,16 +182,24 @@ void page_table_free_pgste(struct page *page)
> >   * As follows from the above, no unallocated or fully allocated parent
> >   * pages are contained in mm_context_t::pgtable_list.
> >   *
> > + * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
> > + * the new rule is that a page is not linked to mm_context_t::pgtable_list
> > + * while either half is pending free by any method; but afterwards is
> > + * either relinked to it, or freed, by __tlb_remove_table().  This allows
> > + * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
> > + *
> >   * The upper byte (bits 24-31) of the parent page _refcount is used
> >   * for tracking contained 2KB-pgtables and has the following format:
> >   *
> > - *   PP  AA
> > - * 01234567    upper byte (bits 24-31) of struct page::_refcount
> > - *   ||  ||
> > - *   ||  |+--- upper 2KB-pgtable is allocated
> > - *   ||  +---- lower 2KB-pgtable is allocated
> > - *   |+------- upper 2KB-pgtable is pending for removal
> > - *   +-------- lower 2KB-pgtable is pending for removal
> > + *   PPHHAA
> > + * 76543210    upper byte (bits 24-31) of struct page::_refcount
> 
> Hmm, big-endian? BTW, the existing (bits 24-31) notation also seems to be
> somehow misleading, I guess it should rather read "bits 0-7 of the 32bit
> value", but I also often get confused by this, so maybe it is correct.

I did notice, late in the day, that s390 has CPU_BIG_ENDIAN in Kconfig,
and I'm certainly liable to get things wrong there (though did some BE
work on powerpc before).  But searching around, it seemed BIT(0) is 0x01,
so I thought I was right to be changing 01234567 to 76543210 there.
I wouldn't know whether "bits 24-31" is right or not: it made sense
to little-endian me, but that doesn't mean it's right.

> 
> > + *   ||||||
> > + *   |||||+--- lower 2KB-pgtable is allocated
> > + *   ||||+---- upper 2KB-pgtable is allocated
> > + *   |||+----- lower 2KB-pgtable is pending free by page->rcu_head
> > + *   ||+------ upper 2KB-pgtable is pending free by page->rcu_head
> > + *   |+------- lower 2KB-pgtable is pending free by any method
> > + *   +-------- upper 2KB-pgtable is pending free by any method
> >   *
> >   * (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
> >   * using _refcount is possible).
> > @@ -200,7 +208,7 @@ void page_table_free_pgste(struct page *page)
> >   * The parent page is either:
> >   *   - added to mm_context_t::pgtable_list in case the second half of the
> >   *     parent page is still unallocated;
> > - *   - removed from mm_context_t::pgtable_list in case both hales of the
> > + *   - removed from mm_context_t::pgtable_list in case both halves of the
> >   *     parent page are allocated;
> >   * These operations are protected with mm_context_t::lock.
> >   *
> > @@ -244,25 +252,15 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> >  			page = list_first_entry(&mm->context.pgtable_list,
> >  						struct page, lru);
> >  			mask = atomic_read(&page->_refcount) >> 24;
> > -			/*
> > -			 * The pending removal bits must also be checked.
> > -			 * Failure to do so might lead to an impossible
> > -			 * value of (i.e 0x13 or 0x23) written to _refcount.
> > -			 * Such values violate the assumption that pending and
> > -			 * allocation bits are mutually exclusive, and the rest
> > -			 * of the code unrails as result. That could lead to
> > -			 * a whole bunch of races and corruptions.
> > -			 */
> > -			mask = (mask | (mask >> 4)) & 0x03U;
> > -			if (mask != 0x03U) {
> > -				table = (unsigned long *) page_to_virt(page);
> > -				bit = mask & 1;		/* =1 -> second 2K */
> > -				if (bit)
> > -					table += PTRS_PER_PTE;
> > -				atomic_xor_bits(&page->_refcount,
> > -							0x01U << (bit + 24));
> > -				list_del(&page->lru);
> > -			}
> > +			/* Cannot be on this list if either half pending free */
> > +			WARN_ON_ONCE(mask & ~0x03U);
> > +			/* One or other half must be available, but not both */
> > +			WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
> > +			table = (unsigned long *)page_to_virt(page);
> > +			bit = mask & 0x01U;	/* =1 -> second 2K available */
> > +			table += bit * PTRS_PER_PTE;
> > +			atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
> > +			list_del(&page->lru);
> 
> I hope we can do w/o changing page_table_alloc() code, and as little of the
> existing, very fragile and carefully tuned, code as possible. At least when
> we use the approach of not adding back fragments from pte_free_defer() to
> the list.

If we didn't add back fragments at all, it should be easy.  And IIUC very
few actually go through page_table_free() itself: don't the vast majority
go through page_table_free_rcu()?  The problem is with those ones wanting
to use page->lru while the pte_free_defer() ones want to use page->rcu_head.

> 
> >  		}
> >  		spin_unlock_bh(&mm->context.lock);
> >  		if (table)
> > @@ -278,6 +276,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> >  	}
> >  	arch_set_page_dat(page, 0);
> >  	/* Initialize page table */
> > +	page->pt_mm = mm;
> >  	table = (unsigned long *) page_to_virt(page);
> >  	if (mm_alloc_pgste(mm)) {
> >  		/* Return 4K page table with PGSTEs */
> > @@ -295,7 +294,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
> >  	return table;
> >  }
> >  
> > -static void page_table_release_check(struct page *page, void *table,
> > +static void page_table_release_check(struct page *page, unsigned long *table,
> >  				     unsigned int half, unsigned int mask)
> >  {
> >  	char msg[128];
> > @@ -314,24 +313,22 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  	struct page *page;
> >  
> >  	page = virt_to_page(table);
> > +	WARN_ON_ONCE(page->pt_mm != mm);
> >  	if (!mm_alloc_pgste(mm)) {
> >  		/* Free 2K page table fragment of a 4K page */
> >  		bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
> >  		spin_lock_bh(&mm->context.lock);
> >  		/*
> > -		 * Mark the page for delayed release. The actual release
> > -		 * will happen outside of the critical section from this
> > -		 * function or from __tlb_remove_table()
> > +		 * Mark the page for release. The actual release will happen
> > +		 * below from this function, or later from __tlb_remove_table().
> >  		 */
> > -		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > +		mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
> 
> Uh oh, I have a bad feeling about this. It seems as it would somehow revert
> the subtle race fix from commit c2c224932fd0 ("s390/mm: fix 2KB pgtable
> release race").

Thanks for the reference: I have that up on another screen, but it'll take
me a while to understand - thought I'd better reply before working out
whether that's an issue for this patch or not.

> 
> >  		mask >>= 24;
> > -		if (mask & 0x03U)
> > +		if (mask & 0x03U)		/* other half is allocated */
> >  			list_add(&page->lru, &mm->context.pgtable_list);
> > -		else
> > +		else if (!(mask & 0x30U))	/* other half not pending */
> >  			list_del(&page->lru);
> >  		spin_unlock_bh(&mm->context.lock);
> > -		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > -		mask >>= 24;
> >  		if (mask != 0x00U)
> >  			return;
> >  		half = 0x01U << bit;
> > @@ -355,6 +352,7 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> >  
> >  	mm = tlb->mm;
> >  	page = virt_to_page(table);
> > +	WARN_ON_ONCE(page->pt_mm != mm);
> >  	if (mm_alloc_pgste(mm)) {
> >  		gmap_unlink(mm, table, vmaddr);
> >  		table = (unsigned long *) ((unsigned long)table | 0x03U);
> > @@ -364,15 +362,13 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> >  	bit = ((unsigned long) table & ~PAGE_MASK) / (PTRS_PER_PTE*sizeof(pte_t));
> >  	spin_lock_bh(&mm->context.lock);
> >  	/*
> > -	 * Mark the page for delayed release. The actual release will happen
> > -	 * outside of the critical section from __tlb_remove_table() or from
> > -	 * page_table_free()
> > +	 * Mark the page for delayed release.
> > +	 * The actual release will happen later, from __tlb_remove_table().
> >  	 */
> >  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  	mask >>= 24;
> > -	if (mask & 0x03U)
> > -		list_add_tail(&page->lru, &mm->context.pgtable_list);
> 
> I guess we want to keep this list_add(), if we do not add back fragments in
> __tlb_remove_table() instead.

Quite likely: working on different solutions at the same time is beyond me.
I'd rather go back to the starting point and work it out from there.

> 
> > -	else
> > +	/* Other half not allocated? Other half not already pending free? */
> > +	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
> >  		list_del(&page->lru);
> >  	spin_unlock_bh(&mm->context.lock);
> >  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> > @@ -382,17 +378,38 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> >  void void *_table)
> >  {
> >  	unsigned int mask = (unsigned long) _table & 0x03U, half = mask;
> > -	void *table = (void *)((unsigned long) _table ^ mask);
> > +	unsigned long *table = (unsigned long *)((unsigned long) _table ^ mask);
> >  	struct page *page = virt_to_page(table);
> > +	struct mm_struct *mm;
> >  
> >  	switch (half) {
> >  	case 0x00U:	/* pmd, pud, or p4d */
> > -		free_pages((unsigned long)table, CRST_ALLOC_ORDER);
> > +		__free_pages(page, CRST_ALLOC_ORDER);
> >  		return;
> >  	case 0x01U:	/* lower 2K of a 4K page table */
> > -	case 0x02U:	/* higher 2K of a 4K page table */
> > -		mask = atomic_xor_bits(&page->_refcount, mask << (4 + 24));
> > -		mask >>= 24;
> > +	case 0x02U:	/* upper 2K of a 4K page table */
> > +		/*
> > +		 * If the other half is marked as allocated, page->pt_mm must
> > +		 * still be valid, page->rcu_head no longer in use so page->lru
> > +		 * good for use, so now make the freed half available for reuse.
> > +		 * But be wary of races with that other half being freed.
> > +		 */
> > +		if (atomic_read(&page->_refcount) & (0x03U << 24)) {
> > +			mm = page->pt_mm;
> > +			/*
> > +			 * But what guarantees that mm has not been freed now?!
> > +			 * It's very unlikely, but we want certainty...
> > +			 */
> > +			spin_lock_bh(&mm->context.lock);
> > +			mask = atomic_xor_bits(&page->_refcount, mask << 28);
> > +			mask >>= 24;
> > +			if (mask & 0x03U)
> > +				list_add(&page->lru, &mm->context.pgtable_list);
> > +			spin_unlock_bh(&mm->context.lock);
> > +		} else {
> > +			mask = atomic_xor_bits(&page->_refcount, mask << 28);
> > +			mask >>= 24;
> > +		}
> 
> Same as for page_table_alloc(), I hope we can avoid touching existing
> __tlb_remove_table() code, and certainly not expose also the existing
> code to the "unstable mm" risk. So page_table_free_rcu() should stick to
> its list_add(), and pte_free_defer() should do w/o adding back.

But aren't you then faced with the page->lru versus page->rcu_head problem?

> 
> >  		if (mask != 0x00U)
> >  			return;
> >  		break;
> > @@ -407,6 +424,78 @@ void __tlb_remove_table(void *_table)
> >  	__free_page(page);
> >  }
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static void pte_free_now0(struct rcu_head *head);
> > +static void pte_free_now1(struct rcu_head *head);
> > +
> > +static void pte_free_pgste(struct rcu_head *head)
> > +{
> > +	unsigned long *table;
> > +	struct page *page;
> > +
> > +	page = container_of(head, struct page, rcu_head);
> > +	table = (unsigned long *)page_to_virt(page);
> > +	table = (unsigned long *)((unsigned long)table | 0x03U);
> > +	__tlb_remove_table(table);
> > +}
> > +
> > +static void pte_free_half(struct rcu_head *head, unsigned int bit)
> > +{
> > +	unsigned long *table;
> > +	struct page *page;
> > +	unsigned int mask;
> > +
> > +	page = container_of(head, struct page, rcu_head);
> > +	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
> > +
> > +	table = (unsigned long *)page_to_virt(page);
> > +	table += bit * PTRS_PER_PTE;
> > +	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
> > +	__tlb_remove_table(table);
> > +
> > +	/* If pte_free_defer() of the other half came in, queue it now */
> > +	if (mask & 0x0CU)
> > +		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
> > +}
> > +
> > +static void pte_free_now0(struct rcu_head *head)
> > +{
> > +	pte_free_half(head, 0);
> > +}
> > +
> > +static void pte_free_now1(struct rcu_read *head)
> > +{
> > +	pte_free_half(head, 1);
> > +}
> > +
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	unsigned int bit, mask;
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	WARN_ON_ONCE(page->pt_mm != mm);
> > +	if (mm_alloc_pgste(mm)) {
> > +		call_rcu(&page->rcu_head, pte_free_pgste)
> 
> Hmm, in page_table_free_rcu(), which is somewhat similar to this, we
> also do a "gmap_unlink(mm, table, vmaddr)" for the mm_alloc_pgste(mm)
> case.
> 
> Not familiar with this, and also e.g. why we do not care about this
> in the non-RCU page_table_free() code. But it might be required here,
> and we have no addr here. However, that could probably be passed along
> from retract_page_tables(), I guess, if needed.
> 
> Christian, Claudio, any idea about gmap_unlink() usage in
> page_table_free_rcu() (and why not in page_table_free()), and if
> it would also be required here?
> 
> IIUC, this whole series would at least ATM only affect THP collapse,
> and for KVM, in particular qemu process in the host, we seem to somehow
> prevent THP usage in s390_enable_sie():
> 
>         /* split thp mappings and disable thp for future mappings */
>         thp_split_mm(mm);
> 
> So mm_alloc_pgste(mm) might not be a valid case here ATM, but in case
> this framework would later also be used for other things, it would be
> good to be prepared, i.e. by either passing along the addr, or clarify
> that we do not need it...

Yes, some light on that would be appreciated.  I was going by what's
in page_table_free(), which was what was used for the pte_free_defer()
page tables before; and only noticed gmap_unlink() in _rcu() very late.

> 
> > +		return;
> > +	}
> > +	bit = ((unsigned long)pgtable & ~PAGE_MASK) /
> > +			(PTRS_PER_PTE * sizeof(pte_t));
> > +
> > +	spin_lock_bh(&mm->context.lock);
> > +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
> > +	mask >>= 24;
> > +	/* Other half not allocated? Other half not already pending free? */
> > +	if ((mask & 0x03U) == 0x00U && (mask & 0x30U) != 0x30U)
> > +		list_del(&page->lru);
> > +	spin_unlock_bh(&mm->context.lock);
> > +
> > +	/* Do not relink on rcu_head if other half already linked on rcu_head */
> > +	if ((mask & 0x0CU) != 0x0CU)
> > +		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
> 
> I dot not fully understand if / why we need the new HH bits. While working
> on my patch it seemed to be useful for sorting out list_add/del in the
> various cases. Here it only seems to be used for preventing double rcu_head
> usage, is this correct, or am I missing something?

Correct, I only needed the HH bits for avoiding double rcu_head usage (then
implementing the second usage one the first has completed).  If you want to
distinguish pte_free_defer() tables from page_table_free_rcu() tables, the
HH bits would need to be considered in other places too, I imagine: that
gets more complicated, I fear.

> 
> Not needing the page->pt_mm any more would make room for page->pt_frag_refcount,
> and possibly a similar approach like on powerpc, for the double rcu_head
> issue. But of course not if Vishal grabs it in his "Split ptdesc from struct
> page" series...
> 
> So, here is some alternative approach, w/o the "unstable mm" flaw, by not

I'm sorry you took the "unstable mm" flaw as prohibitive: it was just a
flaw to be solved once the rest was stable.

> putting fragments back on the list for the pte_free_defer() case, but
> with some risk of (temporarily) wasting an uncertain amount of memory with
> half-filled 4K pages for pagetables.
> 
> There is unfortunately also very likely another flaw with getting the
> list_add() and list_del() right, in page_table_free[_rcu]() vs.
> pte_free_defer(). The problem is that pte_free_defer() can only do a
> list_del(), and not also list_add(), because it must make sure that
> page->lru will not be used any more. This is now asymmetrical to the
> other cases. So page_table_free[_rcu](), and also pte_free_defer(),
> must avoid list_del() w/o previous list_add().
> 
> My approach with checking the new HH bits (0x0CU) is likely racy, when
> pte_free_half() clears its H bit. At least for the case where we have
> two allocated fragments, and both should be freed via pte_free_defer(),
> we might end up with a list_del() although there never was a list_add().
> 
> Instead of using the HH bits for synchronizing, it might be possible
> to use list_del_init() instead of list_del(), probably in all places,
> and check with list_empty() to avoid such a list_del() w/o previous
> list_add().
> 
> Or alternatively, we could stop adding back fragments to the list
> in general, risking some more memory waste. It could also allow to
> get rid of the list as a whole, and e.g. only keep some pointer to the
> most recent free 2K fragment from a freshly allocated 4K page, for
> one-time use. Alexander was playing around with such an idea.

Powerpc is like that.  I have no idea how much gets wasted that way.
I was keen not to degrade what s390 does: which is definitely superior,
but possibly not worth the effort.

> 
> The current code is apparently simply too clever for us, and it only
> seems to get harder with every change, like in this case. But maybe we
> can handle it one more time...

I'm getting this reply back to you, before reviewing your patch below.
Probably the only way I can review yours is to try it myself and compare.
I'll look into it, once I understand c2c224932fd0.  But may have to write
to Vishal first, or get the v2 of my series out: if only I could work out
a safe and easy way of unbreaking s390...

Is there any chance of you trying mine?
But please don't let it waste your time.

Thanks,
Hugh

> 
> ---
>  arch/s390/include/asm/pgalloc.h |    4 +
>  arch/s390/mm/pgalloc.c          |  108 +++++++++++++++++++++++++++++++++++++---
>  2 files changed, 104 insertions(+), 8 deletions(-)
> 
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct m
>  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  void vmem_map_init(void);
>  void *vmem_crst_alloc(unsigned long val);
>  pte_t *vmem_pte_alloc(void);
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -185,11 +185,13 @@ void page_table_free_pgste(struct page *
>   * The upper byte (bits 24-31) of the parent page _refcount is used
>   * for tracking contained 2KB-pgtables and has the following format:
>   *
> - *   PP  AA
> + *   PPHHAA
>   * 01234567    upper byte (bits 24-31) of struct page::_refcount
> - *   ||  ||
> - *   ||  |+--- upper 2KB-pgtable is allocated
> - *   ||  +---- lower 2KB-pgtable is allocated
> + *   ||||||
> + *   |||||+--- upper 2KB-pgtable is allocated
> + *   ||||+---- lower 2KB-pgtable is allocated
> + *   |||+----- upper 2KB-pgtable is pending free by page->rcu_head
> + *   ||+------ lower 2KB-pgtable is pending free by page->rcu_head
>   *   |+------- upper 2KB-pgtable is pending for removal
>   *   +-------- lower 2KB-pgtable is pending for removal
>   *
> @@ -229,6 +231,8 @@ void page_table_free_pgste(struct page *
>   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
>   * while the PP bits are never used, nor such a page is added to or removed
>   * from mm_context_t::pgtable_list.
> + *
> + * TODO: Add comments for HH bits
>   */
>  unsigned long *page_table_alloc(struct mm_struct *mm)
>  {
> @@ -325,13 +329,18 @@ void page_table_free(struct mm_struct *m
>  		 */
>  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  		mask >>= 24;
> -		if (mask & 0x03U)
> +		/*
> +		 * If pte_free_defer() marked other half for delayed release,
> +		 * w/o list_add(), we must _not_ do list_del()
> +		 */
> +		if (mask & 0x03U)		/* other half is allocated */
>  			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> +		else if (!(mask & 0x0CU))	/* no pte_free_defer() pending */
>  			list_del(&page->lru);
>  		spin_unlock_bh(&mm->context.lock);
>  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
>  		mask >>= 24;
> +		/* Return if other half is allocated, or delayed release pending */
>  		if (mask != 0x00U)
>  			return;
>  		half = 0x01U << bit;
> @@ -370,9 +379,13 @@ void page_table_free_rcu(struct mmu_gath
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> +	/*
> +	 * If pte_free_defer() marked other half for delayed release,
> +	 * w/o list_add(), we must _not_ do list_del()
> +	 */
> +	if (mask & 0x03U)		/* other half is allocated */
>  		list_add_tail(&page->lru, &mm->context.pgtable_list);
> -	else
> +	else if (!(mask & 0x0CU))	/* no pte_free_defer() pending */
>  		list_del(&page->lru);
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
> @@ -407,6 +420,85 @@ void __tlb_remove_table(void *_table)
>  	__free_page(page);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now0(struct rcu_head *head);
> +static void pte_free_now1(struct rcu_head *head);
> +
> +static void pte_free_pgste(struct rcu_head *head)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	table = (unsigned long *)page_to_virt(page);
> +	table = (unsigned long *)((unsigned long)table | 0x03U);
> +	__tlb_remove_table(table);
> +}
> +
> +static void pte_free_half(struct rcu_head *head, unsigned int bit)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +	unsigned int mask;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	mask = atomic_xor_bits(&page->_refcount, 0x04U << (bit + 24));
> +
> +	table = (unsigned long *)page_to_virt(page);
> +	table += bit * PTRS_PER_PTE;
> +	table = (unsigned long *)((unsigned long)table | (0x01U << bit));
> +	__tlb_remove_table(table);
> +
> +	/* If pte_free_defer() of the other half came in, queue it now */
> +	if (mask & 0x0CU)
> +		call_rcu(&page->rcu_head, bit ? pte_free_now0 : pte_free_now1);
> +}
> +
> +static void pte_free_now0(struct rcu_head *head)
> +{
> +	pte_free_half(head, 0);
> +}
> +
> +static void pte_free_now1(struct rcu_head *head)
> +{
> +	pte_free_half(head, 1);
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	unsigned int bit, mask;
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	if (mm_alloc_pgste(mm)) {
> +		/*
> +		 * Need gmap_unlink(mm, pgtable, addr), like in
> +		 * page_table_free_rcu() ???
> +		 * If yes -> need addr parameter here, like in pte_free_tlb().
> +		 */
> +		call_rcu(&page->rcu_head, pte_free_pgste);
> +		return;
> +}
> +	bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * sizeof(pte_t));
> +
> +	spin_lock_bh(&mm->context.lock);
> +	mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));
> +	mask >>= 24;
> +	/*
> +	 * Other half not allocated?
> +	 * If pte_free_defer() marked other half for delayed release,
> +	 * w/o list_add(), we must _not_ do list_del()
> +	 */
> +	if ((mask & 0x03U) == 0x00U && (mask & 0x0CU) != 0x0CU)
> +		list_del(&page->lru);
> +	spin_unlock_bh(&mm->context.lock);
> +
> +	/* Do not relink on rcu_head if other half already linked on rcu_head */
> +	if ((mask & 0x0CU) != 0x0CU)
> +		call_rcu(&page->rcu_head, bit ? pte_free_now1 : pte_free_now0);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-14 21:59               ` Hugh Dickins
@ 2023-06-15 12:11                 ` Gerald Schaefer
  2023-06-15 20:06                   ` Hugh Dickins
  2023-06-15 12:34                 ` Jason Gunthorpe
  1 sibling, 1 reply; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-15 12:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Jason Gunthorpe, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Wed, 14 Jun 2023 14:59:33 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

[...]
> > 
> > It would be much more acceptable to simply not add back such fragments
> > to the list, and therefore risking some memory waste, than risking to
> > use an unstable mm in __tlb_remove_table(). The amount of wasted memory
> > in practice might also not be a lot, depending on whether the fragments
> > belong to the same and contiguous mapping.  
> 
> I agree that it's much better to waste a little memory (and only temporarily)
> than to freeze or corrupt.  But it's not an insoluble problem, I just didn't
> want to get into more change if there was already an answer that covers it.
> 
> I assume that the freed mm issue scared you away from testing my patch,
> so we don't know whether I got those mask conditionals right or not?

Correct, that scared me a lot :-). On the one hand, I do not feel familiar
enough with the common code logic that might need to be changed, or at
least understood, in order to judge if this is a problem and how it could
be addressed.

On the other hand, I am scared of subtle bugs that would not show
immediately, and hit us by surprise later.

Your thoughts about using RCU to free mm, in order to address this
"unstable mm" in __tlb_remove_table(), sound like a reasonable approach.
But again, with my lack of understanding, I am not sure if I can cope
with that.

So at least be prepared for call backs on that issue, not by RCU but
by mail :-)

> 
> > 
> > Also, we would not need to use page->pt_mm, and therefore make room for
> > page->pt_frag_refcount, which for some reason is (still) being used
> > in new v4 from Vishals "Split ptdesc from struct page" series...  
> 
> Vishal's ptdesc: I've been ignoring as far as possible, I'll have to
> respond on that later today, I'm afraid it will be putting this all into
> an intolerable straitjacket.  If ptdesc is actually making more space
> available by some magic, great: but I don't expect to find that is so.
> Anyway, for now, there it's impossible (for me anyway) to think of that
> at the same time as this.

I can totally relate to that. And I also had the feeling and hope that
ptdesc would give some relief on complex struct page (mis-)use, but did
not yet get into investigating further.

[...]
> > I dot not fully understand if / why we need the new HH bits. While working
> > on my patch it seemed to be useful for sorting out list_add/del in the
> > various cases. Here it only seems to be used for preventing double rcu_head
> > usage, is this correct, or am I missing something?  
> 
> Correct, I only needed the HH bits for avoiding double rcu_head usage (then
> implementing the second usage one the first has completed).  If you want to
> distinguish pte_free_defer() tables from page_table_free_rcu() tables, the
> HH bits would need to be considered in other places too, I imagine: that
> gets more complicated, I fear.

Yes, I have the same impression. My approach would prevent scary "unstable mm"
issues in __tlb_remove_table(), but probably introduce other subtle issue.
Or not so subtle, like potential double list_free(), as mentioned in my last
reply.

So it seems we have no completely safe approach so far, but I would agree
on going further with your approach for now. See below for patch comments.

[...]
> 
> I'm getting this reply back to you, before reviewing your patch below.
> Probably the only way I can review yours is to try it myself and compare.
> I'll look into it, once I understand c2c224932fd0.  But may have to write
> to Vishal first, or get the v2 of my series out: if only I could work out
> a safe and easy way of unbreaking s390...
> 
> Is there any chance of you trying mine?
> But please don't let it waste your time.

I have put it to some LTP tests now, and good news is that it does not show
any obvious issues. Only some deadlocks on mm->context.lock, but that can
easily be solved. Problem is that we have some users of that lock, who do
spin_lock() and not spin_lock_bh(). In the past, we had 3 different locks
in mm->context, and then combined them to use the same. Only the pagetable
list locks were taken with spin_lock_bh(), the others used spin_lock().

Of course, after combining them to use the same lock, it would have been
required to change the others to also use spin_lock_bh(), at least if there
was any good reason for using _bh in the pagetable list lock.
It seems there was not, which is why that mismatch was not causing any
issues so far, probably we had some reason which got removed in one of
the various reworks of that code...

With your patch, we do now have a reason, because __tlb_remove_table()
will usually be called in _bh context as RCU callback, and now also
takes that lock. So we also need this change (and two compile fixes,
marked below):

--- a/arch/s390/include/asm/tlbflush.h
+++ b/arch/s390/include/asm/tlbflush.h
@@ -79,12 +79,12 @@ static inline void __tlb_flush_kernel(vo
 
 static inline void __tlb_flush_mm_lazy(struct mm_struct * mm)
 {
-	spin_lock(&mm->context.lock);
+	spin_lock_bh(&mm->context.lock);
 	if (mm->context.flush_mm) {
 		mm->context.flush_mm = 0;
 		__tlb_flush_mm(mm);
 	}
-	spin_unlock(&mm->context.lock);
+	spin_unlock_bh(&mm->context.lock);
 }
 
 /*
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -102,14 +102,14 @@ struct gmap *gmap_create(struct mm_struc
 	if (!gmap)
 		return NULL;
 	gmap->mm = mm;
-	spin_lock(&mm->context.lock);
+	spin_lock_bh(&mm->context.lock);
 	list_add_rcu(&gmap->list, &mm->context.gmap_list);
 	if (list_is_singular(&mm->context.gmap_list))
 		gmap_asce = gmap->asce;
 	else
 		gmap_asce = -1UL;
 	WRITE_ONCE(mm->context.gmap_asce, gmap_asce);
-	spin_unlock(&mm->context.lock);
+	spin_unlock_bh(&mm->context.lock);
 	return gmap;
 }
 EXPORT_SYMBOL_GPL(gmap_create);
@@ -250,7 +250,7 @@ void gmap_remove(struct gmap *gmap)
 		spin_unlock(&gmap->shadow_lock);
 	}
 	/* Remove gmap from the pre-mm list */
-	spin_lock(&gmap->mm->context.lock);
+	spin_lock_bh(&gmap->mm->context.lock);
 	list_del_rcu(&gmap->list);
 	if (list_empty(&gmap->mm->context.gmap_list))
 		gmap_asce = 0;
@@ -260,7 +260,7 @@ void gmap_remove(struct gmap *gmap)
 	else
 		gmap_asce = -1UL;
 	WRITE_ONCE(gmap->mm->context.gmap_asce, gmap_asce);
-	spin_unlock(&gmap->mm->context.lock);
+	spin_unlock_bh(&gmap->mm->context.lock);
 	synchronize_rcu();
 	/* Put reference */
 	gmap_put(gmap);

These are the compile fixes:

> +static void pte_free_now1(struct rcu_read *head)

rcu_read -> rcu_head

> +{
> +	pte_free_half(head, 1);
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	unsigned int bit, mask;
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	WARN_ON_ONCE(page->pt_mm != mm);
> +	if (mm_alloc_pgste(mm)) {
> +		call_rcu(&page->rcu_head, pte_free_pgste)

Missing ";" at the end

> +		return;
> +	}

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-14 21:59               ` Hugh Dickins
  2023-06-15 12:11                 ` Gerald Schaefer
@ 2023-06-15 12:34                 ` Jason Gunthorpe
  2023-06-15 21:09                   ` Hugh Dickins
  1 sibling, 1 reply; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-15 12:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Wed, Jun 14, 2023 at 02:59:33PM -0700, Hugh Dickins wrote:

> I guess the best thing would be to modify kernel/fork.c to allow the
> architecture to override free_mm(), and arch/s390 call_rcu to free mm.
> But as a quick and dirty s390-end workaround, how about:

RCU callbacks are not ordered so that doesn't seem like it helps..

synchronize_rcu would do the job since it is ordered, but I think the
performance cost is too great to just call it from mmdrop

rcu_barrier() followed by call_rcu on the mm struct might work, but I
don't know the cost

A per-cpu refcount scheme might also do the job reasonably

Making the page frag pool global (per-cpu global I guess) would also
remove the need to reach back to the freeable mm_struct and reduce the
need for struct page memory. This views it as a special kind of
kmemcache.

Another approach is to not use a rcu_head in the ptdesc at all.

With a global kmemcache-like-thing we could probably also organize
something where you don't use a rcu_head in the ptdesc, but instead
just a naked 'next' pointer. This would give enough space to have two
next pointers and the next pointers can be re-used for the normal free
list as well.

In this flow you'd thread the free'd frags onto a waterfall of global
per-cpu lists:
 - RCU free the next cycle
 - RCU free this cycle
 - Actually free

Where a single rcu_head and single call_rcu frees the entire 2nd list
to the 3rd list and then schedules the 1st list to be RCU'd next. This
eliminates the need to store a function pointer in the ptdesc at
all.

It requires some global per-cpu lock on the free/alloc paths however,
but this is basically what every other arch does as it frees the page
back to the page allocator.

I suspect that two next pointers would also eliminate pt_frag_refcount
entirely as we can encode that information in the low bits of the next
pointers.

> (Funnily enough, there's no problem when the stored mm gets re-used for
> a different mm, once past its spin_lock_init(&mm->context.lock);
> because

We do that have really weird "type safe by rcu" thing in the
allocators, but I don't quite know how it works.

> Powerpc is like that.  I have no idea how much gets wasted that way.
> I was keen not to degrade what s390 does: which is definitely superior,
> but possibly not worth the effort.

Yeah, it would be good to understand if this is really sufficiently
beneficial..

> I'll look into it, once I understand c2c224932fd0.  But may have to write
> to Vishal first, or get the v2 of my series out: if only I could work out
> a safe and easy way of unbreaking s390...

Can arches opt in to RCU freeing page table support and still keep
your series sane?

Honestly, I feel like trying to RCU enable page tables should be its
own series. It is a sufficiently tricky subject on its own right.

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-15 12:11                 ` Gerald Schaefer
@ 2023-06-15 20:06                   ` Hugh Dickins
  2023-06-16  8:38                     ` Gerald Schaefer
  0 siblings, 1 reply; 158+ messages in thread
From: Hugh Dickins @ 2023-06-15 20:06 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Hugh Dickins, Vasily Gorbik, Jason Gunthorpe, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Thu, 15 Jun 2023, Gerald Schaefer wrote:
> On Wed, 14 Jun 2023 14:59:33 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
...
> > > 
> > > Also, we would not need to use page->pt_mm, and therefore make room for
> > > page->pt_frag_refcount, which for some reason is (still) being used
> > > in new v4 from Vishals "Split ptdesc from struct page" series...  
> > 
> > Vishal's ptdesc: I've been ignoring as far as possible, I'll have to
> > respond on that later today, I'm afraid it will be putting this all into
> > an intolerable straitjacket.  If ptdesc is actually making more space
> > available by some magic, great: but I don't expect to find that is so.
> > Anyway, for now, there it's impossible (for me anyway) to think of that
> > at the same time as this.
> 
> I can totally relate to that. And I also had the feeling and hope that
> ptdesc would give some relief on complex struct page (mis-)use, but did
> not yet get into investigating further.

ptdesc doesn't give any relief, just codifies some existing practices
under new names, and tends to force one architecture to conform to
another's methods.  As I warned Vishal earlier, s390 may need to go
its own way: we can update ptdesc to meet whatever are s390's needs.

> 
> [...]
> > > I dot not fully understand if / why we need the new HH bits. While working
> > > on my patch it seemed to be useful for sorting out list_add/del in the
> > > various cases. Here it only seems to be used for preventing double rcu_head
> > > usage, is this correct, or am I missing something?  
> > 
> > Correct, I only needed the HH bits for avoiding double rcu_head usage (then
> > implementing the second usage one the first has completed).  If you want to
> > distinguish pte_free_defer() tables from page_table_free_rcu() tables, the
> > HH bits would need to be considered in other places too, I imagine: that
> > gets more complicated, I fear.
> 
> Yes, I have the same impression. My approach would prevent scary "unstable mm"
> issues in __tlb_remove_table(), but probably introduce other subtle issue.
> Or not so subtle, like potential double list_free(), as mentioned in my last
> reply.

A more urgent broken MIPS issue (with current linux-next) came up, so I
didn't get to look at your patch at all yesterday (nor the interesting
commit you pointed me to, still on my radar).  I take it from your words
above and below, that you've gone off your patch, and I shouldn't spend
time on it now - holding mulitple approaches in mind gets me confused!

> 
> So it seems we have no completely safe approach so far, but I would agree
> on going further with your approach for now. See below for patch comments.
> 
> [...]
> > 
> > I'm getting this reply back to you, before reviewing your patch below.
> > Probably the only way I can review yours is to try it myself and compare.
> > I'll look into it, once I understand c2c224932fd0.  But may have to write
> > to Vishal first, or get the v2 of my series out: if only I could work out
> > a safe and easy way of unbreaking s390...
> > 
> > Is there any chance of you trying mine?
> > But please don't let it waste your time.
> 
> I have put it to some LTP tests now, and good news is that it does not show
> any obvious issues.

Oh that's very encouraging news, thanks a lot.

> Only some deadlocks on mm->context.lock,

I assume lockdep reports of risk of deadlock, rather than actual deadlock
seen?  I had meant to ask you to include lockdep (CONFIG_PROVE_LOCKING=y),
but it sounds like you rightly did so anyway.

> but that can
> easily be solved. Problem is that we have some users of that lock, who do
> spin_lock() and not spin_lock_bh(). In the past, we had 3 different locks
> in mm->context, and then combined them to use the same. Only the pagetable
> list locks were taken with spin_lock_bh(), the others used spin_lock().

I'd noticed that discrepancy, and was a little surprised that it wasn't
already causing problems (not being a driver person, I rarely come across
spin_lock_bh(); but by coincidence had to look into it very recently, to
fix a 6.4-rc iwlwifi regression on this laptop - and lockdep didn't like
me mixing spin_lock() and spin_lock_bh() there).

> 
> Of course, after combining them to use the same lock, it would have been
> required to change the others to also use spin_lock_bh(), at least if there
> was any good reason for using _bh in the pagetable list lock.
> It seems there was not, which is why that mismatch was not causing any
> issues so far, probably we had some reason which got removed in one of
> the various reworks of that code...
> 
> With your patch, we do now have a reason, because __tlb_remove_table()
> will usually be called in _bh context as RCU callback, and now also
> takes that lock. So we also need this change (and two compile fixes,
> marked below):

Right.  Though with my latest idea, we can use a separate lock for the
page table list, and leave mm->context_lock with spin_lock() as is.

> 
> --- a/arch/s390/include/asm/tlbflush.h
> +++ b/arch/s390/include/asm/tlbflush.h
> @@ -79,12 +79,12 @@ static inline void __tlb_flush_kernel(vo
>  
>  static inline void __tlb_flush_mm_lazy(struct mm_struct * mm)
>  {
> -	spin_lock(&mm->context.lock);
> +	spin_lock_bh(&mm->context.lock);
>  	if (mm->context.flush_mm) {
>  		mm->context.flush_mm = 0;
>  		__tlb_flush_mm(mm);
>  	}
> -	spin_unlock(&mm->context.lock);
> +	spin_unlock_bh(&mm->context.lock);
>  }
>  
>  /*
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -102,14 +102,14 @@ struct gmap *gmap_create(struct mm_struc
>  	if (!gmap)
>  		return NULL;
>  	gmap->mm = mm;
> -	spin_lock(&mm->context.lock);
> +	spin_lock_bh(&mm->context.lock);
>  	list_add_rcu(&gmap->list, &mm->context.gmap_list);
>  	if (list_is_singular(&mm->context.gmap_list))
>  		gmap_asce = gmap->asce;
>  	else
>  		gmap_asce = -1UL;
>  	WRITE_ONCE(mm->context.gmap_asce, gmap_asce);
> -	spin_unlock(&mm->context.lock);
> +	spin_unlock_bh(&mm->context.lock);
>  	return gmap;
>  }
>  EXPORT_SYMBOL_GPL(gmap_create);
> @@ -250,7 +250,7 @@ void gmap_remove(struct gmap *gmap)
>  		spin_unlock(&gmap->shadow_lock);
>  	}
>  	/* Remove gmap from the pre-mm list */
> -	spin_lock(&gmap->mm->context.lock);
> +	spin_lock_bh(&gmap->mm->context.lock);
>  	list_del_rcu(&gmap->list);
>  	if (list_empty(&gmap->mm->context.gmap_list))
>  		gmap_asce = 0;
> @@ -260,7 +260,7 @@ void gmap_remove(struct gmap *gmap)
>  	else
>  		gmap_asce = -1UL;
>  	WRITE_ONCE(gmap->mm->context.gmap_asce, gmap_asce);
> -	spin_unlock(&gmap->mm->context.lock);
> +	spin_unlock_bh(&gmap->mm->context.lock);
>  	synchronize_rcu();
>  	/* Put reference */
>  	gmap_put(gmap);

So we won't need to include those changes above...

> 
> These are the compile fixes:
> 
> > +static void pte_free_now1(struct rcu_read *head)
> 
> rcu_read -> rcu_head
> 
> > +{
> > +	pte_free_half(head, 1);
> > +}
> > +
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +	unsigned int bit, mask;
> > +	struct page *page;
> > +
> > +	page = virt_to_page(pgtable);
> > +	WARN_ON_ONCE(page->pt_mm != mm);
> > +	if (mm_alloc_pgste(mm)) {
> > +		call_rcu(&page->rcu_head, pte_free_pgste)
> 
> Missing ";" at the end
> 
> > +		return;
> > +	}

... but of course I must add these in: many thanks.
And read up on the interesting commit.

You don't mention whether you were running with the
#define destroy_context synchronize_rcu
patch in.  And I was going to ask you to clarify on that,
but there's no need: I found this morning that it was a bad idea.

Of course x86 doesn't tell a lot about s390 down at this level, and
what's acceptable on one may be unacceptable on the other; but when
I tried a synchronize_rcu() in x86's destroy_context(), the machines
were not happy under load, warning messages, freeze: it looks like
final __mmdrop() can sometimes be called from a context which is
not at all suited for synchronize_rcu().

So then as another experiment, I tried adding synchronize_rcu() into
the safer context at the end of exit_mmap(): that ran okay, but
significantly slower (added 12% on kernel builds) than before.

My latest idea: we keep a SLAB_TYPESAFE_BY_RCU kmem cache for the
spinlock, and most probably the pgtable_list head, and back pointer
to the mm; and have each page table page using the pt_mm field to
point to that structure, instead of to the mm itself.  Then
__tlb_remove_table() can safely take the lock, even when the
mm itself has gone away and been reused, even when that structure
has gone away and been reused.  Hmm, I don't think it will even
need to contain a backpointer to the mm.

But I see that as the right way forward, rather than as something
needed today or tomorrow: in the meanwhile, to get v2 of my patchset
out without breaking s390, I'm contemplating (the horror!) a global
spinlock.

Many thanks, Gerald, I feel much better about it today.
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-15 12:34                 ` Jason Gunthorpe
@ 2023-06-15 21:09                   ` Hugh Dickins
  2023-06-16 12:35                     ` Jason Gunthorpe
  0 siblings, 1 reply; 158+ messages in thread
From: Hugh Dickins @ 2023-06-15 21:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, Gerald Schaefer, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Thu, 15 Jun 2023, Jason Gunthorpe wrote:
> On Wed, Jun 14, 2023 at 02:59:33PM -0700, Hugh Dickins wrote:
> 
> > I guess the best thing would be to modify kernel/fork.c to allow the
> > architecture to override free_mm(), and arch/s390 call_rcu to free mm.
> > But as a quick and dirty s390-end workaround, how about:
> 
> RCU callbacks are not ordered so that doesn't seem like it helps..

Thanks, that's an interesting and important point, which I need to knock
into my head better.

But can you show me where that's handled in the existing mm/mmu_gather.c
include/asm-generic/tlb.h framework?  I don't see any rcu_barrier()s
there, yet don't the pmd_huge_pte pointers point into pud page tables
freed shortly afterwards also by RCU?

> 
> synchronize_rcu would do the job since it is ordered, but I think the
> performance cost is too great to just call it from mmdrop

Yes, on x86 it proved to be a non-starter; maybe s390 doesn't have the
same limitation, but it was clear I was naive to hope that a slowdown
on the exit mm path might not be noticeable.

> 
> rcu_barrier() followed by call_rcu on the mm struct might work, but I
> don't know the cost

SLAB_TYPESAFE_BY_RCU handling has the rcu_barrier() built in,
when the slab is destroyed.

> 
> A per-cpu refcount scheme might also do the job reasonably
> 
> Making the page frag pool global (per-cpu global I guess) would also
> remove the need to reach back to the freeable mm_struct and reduce the
> need for struct page memory. This views it as a special kind of
> kmemcache.

I haven't thought in that direction at all.  Hmm.  Or did I think of
it once, but discarded for accounting reasons - IIRC (haven't rechecked)
page table pages are charged to memcg, and counted for meminfo and other(?)
purposes: if the fragments are all lumped into a global pool, we lose that.
I think I decided: maybe a good idea, but not a change I should make to
get me out of this particular hole.

> 
> Another approach is to not use a rcu_head in the ptdesc at all.
> 
> With a global kmemcache-like-thing we could probably also organize
> something where you don't use a rcu_head in the ptdesc, but instead
> just a naked 'next' pointer. This would give enough space to have two
> next pointers and the next pointers can be re-used for the normal free
> list as well.
> 
> In this flow you'd thread the free'd frags onto a waterfall of global
> per-cpu lists:
>  - RCU free the next cycle
>  - RCU free this cycle
>  - Actually free
> 
> Where a single rcu_head and single call_rcu frees the entire 2nd list
> to the 3rd list and then schedules the 1st list to be RCU'd next. This
> eliminates the need to store a function pointer in the ptdesc at
> all.
> 
> It requires some global per-cpu lock on the free/alloc paths however,
> but this is basically what every other arch does as it frees the page
> back to the page allocator.
> 
> I suspect that two next pointers would also eliminate pt_frag_refcount
> entirely as we can encode that information in the low bits of the next
> pointers.

This scheme is clearer in your head than it is in mine.  It may be the
best solution, but I don't see it clearly enough to judge.  I'll carry
on with my way, then you can replace it later on.
> 
> > (Funnily enough, there's no problem when the stored mm gets re-used for
> > a different mm, once past its spin_lock_init(&mm->context.lock);
> > because
> 
> We do that have really weird "type safe by rcu" thing in the
> allocators, but I don't quite know how it works.

I'm quite familiar with it, since I invented it (SLAB_DESTROY_BY_RCU
in 2.6.9 to solve the locking for anon_vma): so it does tend to be my
tool of choice when appropriate.  It's easy: but you cannot reinitialize
the structure on each kmem_cache_alloc(), in particular the spinlocks of
the new allocation may have to serve a tail of use from a previous
allocation at the same address.

> 
> > Powerpc is like that.  I have no idea how much gets wasted that way.
> > I was keen not to degrade what s390 does: which is definitely superior,
> > but possibly not worth the effort.
> 
> Yeah, it would be good to understand if this is really sufficiently
> beneficial..
> 
> > I'll look into it, once I understand c2c224932fd0.  But may have to write
> > to Vishal first, or get the v2 of my series out: if only I could work out
> > a safe and easy way of unbreaking s390...

My latest notion is, just for getting v2 series out, a global spinlock:
to be replaced before reaching an actual release.

> 
> Can arches opt in to RCU freeing page table support and still keep
> your series sane?

Yes, or perhaps we mean different things: I thought most architectures
are already freeing page tables by RCU.  s390 included.
"git grep MMU_GATHER_RCU_TABLE_FREE" shows plenty of selects.

> 
> Honestly, I feel like trying to RCU enable page tables should be its
> own series. It is a sufficiently tricky subject on its own right.

Puzzled,
Hugh

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-15 20:06                   ` Hugh Dickins
@ 2023-06-16  8:38                     ` Gerald Schaefer
  0 siblings, 0 replies; 158+ messages in thread
From: Gerald Schaefer @ 2023-06-16  8:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vasily Gorbik, Jason Gunthorpe, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Thu, 15 Jun 2023 13:06:24 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Thu, 15 Jun 2023, Gerald Schaefer wrote:
> > On Wed, 14 Jun 2023 14:59:33 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> ...
> > > > 
> > > > Also, we would not need to use page->pt_mm, and therefore make room for
> > > > page->pt_frag_refcount, which for some reason is (still) being used
> > > > in new v4 from Vishals "Split ptdesc from struct page" series...    
> > > 
> > > Vishal's ptdesc: I've been ignoring as far as possible, I'll have to
> > > respond on that later today, I'm afraid it will be putting this all into
> > > an intolerable straitjacket.  If ptdesc is actually making more space
> > > available by some magic, great: but I don't expect to find that is so.
> > > Anyway, for now, there it's impossible (for me anyway) to think of that
> > > at the same time as this.  
> > 
> > I can totally relate to that. And I also had the feeling and hope that
> > ptdesc would give some relief on complex struct page (mis-)use, but did
> > not yet get into investigating further.  
> 
> ptdesc doesn't give any relief, just codifies some existing practices
> under new names, and tends to force one architecture to conform to
> another's methods.  As I warned Vishal earlier, s390 may need to go
> its own way: we can update ptdesc to meet whatever are s390's needs.
> 
> > 
> > [...]  
> > > > I dot not fully understand if / why we need the new HH bits. While working
> > > > on my patch it seemed to be useful for sorting out list_add/del in the
> > > > various cases. Here it only seems to be used for preventing double rcu_head
> > > > usage, is this correct, or am I missing something?    
> > > 
> > > Correct, I only needed the HH bits for avoiding double rcu_head usage (then
> > > implementing the second usage one the first has completed).  If you want to
> > > distinguish pte_free_defer() tables from page_table_free_rcu() tables, the
> > > HH bits would need to be considered in other places too, I imagine: that
> > > gets more complicated, I fear.  
> > 
> > Yes, I have the same impression. My approach would prevent scary "unstable mm"
> > issues in __tlb_remove_table(), but probably introduce other subtle issue.
> > Or not so subtle, like potential double list_free(), as mentioned in my last
> > reply.  
> 
> A more urgent broken MIPS issue (with current linux-next) came up, so I
> didn't get to look at your patch at all yesterday (nor the interesting
> commit you pointed me to, still on my radar).  I take it from your words
> above and below, that you've gone off your patch, and I shouldn't spend
> time on it now - holding mulitple approaches in mind gets me confused!

Correct, I think we should stick to your approach first, after all it
would allow to keep all the cleverness, and fragment re-use.

In case the "unstable mm" issue turns out to be unsolvable, instead
of my also flawed approach, I guess we would rather take more drastic
measures, and turn to some fundamental change by removing the list
and need for page->lru completely, and not adding back fragments
at all. This is something we already considered before, when working
on the last race fix, in order to reduce complexity, which might
not have such a huge benefit anyway. We probably want to do some
measurements to get a better feeling on possible "memory waste"
effects.

> 
> > 
> > So it seems we have no completely safe approach so far, but I would agree
> > on going further with your approach for now. See below for patch comments.
> > 
> > [...]  
> > > 
> > > I'm getting this reply back to you, before reviewing your patch below.
> > > Probably the only way I can review yours is to try it myself and compare.
> > > I'll look into it, once I understand c2c224932fd0.  But may have to write
> > > to Vishal first, or get the v2 of my series out: if only I could work out
> > > a safe and easy way of unbreaking s390...
> > > 
> > > Is there any chance of you trying mine?
> > > But please don't let it waste your time.  
> > 
> > I have put it to some LTP tests now, and good news is that it does not show
> > any obvious issues.  
> 
> Oh that's very encouraging news, thanks a lot.
> 
> > Only some deadlocks on mm->context.lock,  
> 
> I assume lockdep reports of risk of deadlock, rather than actual deadlock
> seen?  I had meant to ask you to include lockdep (CONFIG_PROVE_LOCKING=y),
> but it sounds like you rightly did so anyway.

I actually hit the deadlock, CONFIG_PROVE_LOCKING was off, but will
also give it a try with our debug config, that certainly would not be
wrong.

This was the deadlock, on the spin_lock() in __tlb_flush_mm_lazy(),
triggered from finish_arch_post_lock_switch():

[ 1275.753548] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1275.753554] rcu:     8-....: (6000 ticks this GP) idle=918c/1/0x4000000000000000 softirq=15306/15306 fqs=1965
[ 1275.753582] rcu:     (t=6000 jiffies g=47925 q=27525 ncpus=16)
[ 1275.753584] Task dump for CPU 4:
[ 1275.753585] task:mmap1           state:R  running task     stack:0     pid:12815 ppid:12787  flags:0x00000005
[ 1275.753589] Call Trace:
[ 1275.753591]  [<00000000d79cec28>] __alloc_pages+0x158/0x310 
[ 1275.753598]  [<00000000d84a5288>] __func__.3+0x70/0x80 
[ 1275.753604]  [<00000000d83738d0>] do_ext_irq+0xe8/0x160 
[ 1275.753608]  [<00000000d83839ec>] ext_int_handler+0xc4/0xf0 
[ 1275.753610]  [<00000000d79b3574>] tlb_flush_mmu+0x14c/0x1c8 
[ 1275.753615]  [<00000000d79b3722>] tlb_finish_mmu+0x3a/0xc8 
[ 1275.753617]  [<00000000d79ace30>] unmap_region+0xf8/0x118 
[ 1275.753619]  [<00000000d79af98a>] do_vmi_align_munmap+0x2da/0x4b0 
[ 1275.753621]  [<00000000d79afc5a>] do_vmi_munmap+0xba/0xf0 
[ 1275.753623]  [<00000000d79afd34>] __vm_munmap+0xa4/0x168 
[ 1275.753625]  [<00000000d79afee2>] __s390x_sys_munmap+0x32/0x40 
[ 1275.753627]  [<00000000d8373654>] __do_syscall+0x1d4/0x200 
[ 1275.753629]  [<00000000d8383768>] system_call+0x70/0x98 
[ 1275.753632] CPU: 8 PID: 12816 Comm: mmap1 Not tainted 6.4.0-rc6-00037-gb6dad5178cea-dirty #6
[ 1275.753635] Hardware name: IBM 3931 A01 704 (LPAR)
[ 1275.753636] Krnl PSW : 0704e00180000000 00000000d777d12c (finish_arch_post_lock_switch+0x34/0x170)
[ 1275.753643]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 1275.753645] Krnl GPRS: 00000000000000f4 0000000000000001 000000009d5915c8 0000000096be9b4e
[ 1275.753647]            0000000000000000 000000000000000b 0000000091f5a100 0000000000000000
[ 1275.753687]            000000009d590600 00000000803d4200 00000003f898d100 000000009d591200
[ 1275.753689]            000000000000021a 00000000d891a930 0000038003e1fd20 0000038003e1fce0
[ 1275.753696] Krnl Code: 00000000d777d11e: 4120b3c8            la      %r2,968(%r11)
[ 1275.753696]            00000000d777d122: 5810b3c8            l       %r1,968(%r11)
[ 1275.753696]           #00000000d777d126: ec180008007e        cij     %r1,0,8,00000000d777d136
[ 1275.753696]           >00000000d777d12c: 58102000            l       %r1,0(%r2)
[ 1275.753696]            00000000d777d130: ec16fffe007e        cij     %r1,0,6,00000000d777d12c
[ 1275.753696]            00000000d777d136: 583003a0            l       %r3,928
[ 1275.753696]            00000000d777d13a: 41a0b4a8            la      %r10,1192(%r11)
[ 1275.753696]            00000000d777d13e: ec2323bc3d59        risbgn  %r2,%r3,35,188,61
[ 1275.753711] Call Trace:
[ 1275.753712]  [<00000000d777d12c>] finish_arch_post_lock_switch+0x34/0x170 
[ 1275.753715]  [<00000000d777ec66>] finish_task_switch.isra.0+0x8e/0x238 
[ 1275.753718]  [<00000000d837bb2a>] __schedule+0x2f2/0x770 
[ 1275.753721]  [<00000000d837c00a>] schedule+0x62/0x108 
[ 1275.753724]  [<00000000d77eea46>] exit_to_user_mode_prepare+0xde/0x1a8 
[ 1275.753728]  [<00000000d8373cf4>] irqentry_exit_to_user_mode+0x1c/0x70 
[ 1275.753731]  [<00000000d83838d6>] pgm_check_handler+0x116/0x168 
[ 1275.753733] Last Breaking-Event-Address:
[ 1275.753733]  [<00000000d777d130>] finish_arch_post_lock_switch+0x38/0x170

> 
> > but that can
> > easily be solved. Problem is that we have some users of that lock, who do
> > spin_lock() and not spin_lock_bh(). In the past, we had 3 different locks
> > in mm->context, and then combined them to use the same. Only the pagetable
> > list locks were taken with spin_lock_bh(), the others used spin_lock().  
> 
> I'd noticed that discrepancy, and was a little surprised that it wasn't
> already causing problems (not being a driver person, I rarely come across
> spin_lock_bh(); but by coincidence had to look into it very recently, to
> fix a 6.4-rc iwlwifi regression on this laptop - and lockdep didn't like
> me mixing spin_lock() and spin_lock_bh() there).
> 
> > 
> > Of course, after combining them to use the same lock, it would have been
> > required to change the others to also use spin_lock_bh(), at least if there
> > was any good reason for using _bh in the pagetable list lock.
> > It seems there was not, which is why that mismatch was not causing any
> > issues so far, probably we had some reason which got removed in one of
> > the various reworks of that code...
> > 
> > With your patch, we do now have a reason, because __tlb_remove_table()
> > will usually be called in _bh context as RCU callback, and now also
> > takes that lock. So we also need this change (and two compile fixes,
> > marked below):  
> 
> Right.  Though with my latest idea, we can use a separate lock for the
> page table list, and leave mm->context_lock with spin_lock() as is.

I think we want to have that change anyway, might also make sense to put it
in a separate patch. Simply for consistency, even if we are fine ATM, the
current code with mixing spin_lock() and spin_lock_bh() only causes
confusion. I also see no negative impact of that change, and it surely
feels better than a new global lock.

> 
> > 
> > --- a/arch/s390/include/asm/tlbflush.h
> > +++ b/arch/s390/include/asm/tlbflush.h
> > @@ -79,12 +79,12 @@ static inline void __tlb_flush_kernel(vo
> >  
> >  static inline void __tlb_flush_mm_lazy(struct mm_struct * mm)
> >  {
> > -	spin_lock(&mm->context.lock);
> > +	spin_lock_bh(&mm->context.lock);
> >  	if (mm->context.flush_mm) {
> >  		mm->context.flush_mm = 0;
> >  		__tlb_flush_mm(mm);
> >  	}
> > -	spin_unlock(&mm->context.lock);
> > +	spin_unlock_bh(&mm->context.lock);
> >  }
> >  
> >  /*
> > --- a/arch/s390/mm/gmap.c
> > +++ b/arch/s390/mm/gmap.c
> > @@ -102,14 +102,14 @@ struct gmap *gmap_create(struct mm_struc
> >  	if (!gmap)
> >  		return NULL;
> >  	gmap->mm = mm;
> > -	spin_lock(&mm->context.lock);
> > +	spin_lock_bh(&mm->context.lock);
> >  	list_add_rcu(&gmap->list, &mm->context.gmap_list);
> >  	if (list_is_singular(&mm->context.gmap_list))
> >  		gmap_asce = gmap->asce;
> >  	else
> >  		gmap_asce = -1UL;
> >  	WRITE_ONCE(mm->context.gmap_asce, gmap_asce);
> > -	spin_unlock(&mm->context.lock);
> > +	spin_unlock_bh(&mm->context.lock);
> >  	return gmap;
> >  }
> >  EXPORT_SYMBOL_GPL(gmap_create);
> > @@ -250,7 +250,7 @@ void gmap_remove(struct gmap *gmap)
> >  		spin_unlock(&gmap->shadow_lock);
> >  	}
> >  	/* Remove gmap from the pre-mm list */
> > -	spin_lock(&gmap->mm->context.lock);
> > +	spin_lock_bh(&gmap->mm->context.lock);
> >  	list_del_rcu(&gmap->list);
> >  	if (list_empty(&gmap->mm->context.gmap_list))
> >  		gmap_asce = 0;
> > @@ -260,7 +260,7 @@ void gmap_remove(struct gmap *gmap)
> >  	else
> >  		gmap_asce = -1UL;
> >  	WRITE_ONCE(gmap->mm->context.gmap_asce, gmap_asce);
> > -	spin_unlock(&gmap->mm->context.lock);
> > +	spin_unlock_bh(&gmap->mm->context.lock);
> >  	synchronize_rcu();
> >  	/* Put reference */
> >  	gmap_put(gmap);  
> 
> So we won't need to include those changes above...
> 
> > 
> > These are the compile fixes:
> >   
> > > +static void pte_free_now1(struct rcu_read *head)  
> > 
> > rcu_read -> rcu_head
> >   
> > > +{
> > > +	pte_free_half(head, 1);
> > > +}
> > > +
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > +	unsigned int bit, mask;
> > > +	struct page *page;
> > > +
> > > +	page = virt_to_page(pgtable);
> > > +	WARN_ON_ONCE(page->pt_mm != mm);
> > > +	if (mm_alloc_pgste(mm)) {
> > > +		call_rcu(&page->rcu_head, pte_free_pgste)  
> > 
> > Missing ";" at the end
> >   
> > > +		return;
> > > +	}  
> 
> ... but of course I must add these in: many thanks.
> And read up on the interesting commit.
> 
> You don't mention whether you were running with the
> #define destroy_context synchronize_rcu
> patch in.  And I was going to ask you to clarify on that,
> but there's no need: I found this morning that it was a bad idea.

I wasn't, only your original patch, with compile fixes and spin_lock_bh()
changes, and our defconfig.

> 
> Of course x86 doesn't tell a lot about s390 down at this level, and
> what's acceptable on one may be unacceptable on the other; but when
> I tried a synchronize_rcu() in x86's destroy_context(), the machines
> were not happy under load, warning messages, freeze: it looks like
> final __mmdrop() can sometimes be called from a context which is
> not at all suited for synchronize_rcu().
> 
> So then as another experiment, I tried adding synchronize_rcu() into
> the safer context at the end of exit_mmap(): that ran okay, but
> significantly slower (added 12% on kernel builds) than before.
> 
> My latest idea: we keep a SLAB_TYPESAFE_BY_RCU kmem cache for the
> spinlock, and most probably the pgtable_list head, and back pointer
> to the mm; and have each page table page using the pt_mm field to
> point to that structure, instead of to the mm itself.  Then
> __tlb_remove_table() can safely take the lock, even when the
> mm itself has gone away and been reused, even when that structure
> has gone away and been reused.  Hmm, I don't think it will even
> need to contain a backpointer to the mm.
> 
> But I see that as the right way forward, rather than as something
> needed today or tomorrow: in the meanwhile, to get v2 of my patchset
> out without breaking s390, I'm contemplating (the horror!) a global
> spinlock.

Not sure if I understood that right, would you want that global
lock as replacement for the current spin_lock() vs. spin_lock_bh()
confusion, or for addressing the "unstable mm" issue? In the first
case, I don't think it would be necessary, using spin_lock_bh()
in all places should be fine.

> 
> Many thanks, Gerald, I feel much better about it today.

Pleased to hear that :-)

Taking some closer look at your patch, I also have more questions,
related to page_table_alloc() changes, and also possibly breaking
commit c2c224932fd0:

> @@ -244,25 +252,15 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  			page = list_first_entry(&mm->context.pgtable_list,
>  						struct page, lru);
>  			mask = atomic_read(&page->_refcount) >> 24;
> -			/*
> -			 * The pending removal bits must also be checked.
> -			 * Failure to do so might lead to an impossible
> -			 * value of (i.e 0x13 or 0x23) written to _refcount.
> -			 * Such values violate the assumption that pending and
> -			 * allocation bits are mutually exclusive, and the rest
> -			 * of the code unrails as result. That could lead to
> -			 * a whole bunch of races and corruptions.
> -			 */
> -			mask = (mask | (mask >> 4)) & 0x03U;
> -			if (mask != 0x03U) {
> -				table = (unsigned long *) page_to_virt(page);
> -				bit = mask & 1;		/* =1 -> second 2K */
> -				if (bit)
> -					table += PTRS_PER_PTE;
> -				atomic_xor_bits(&page->_refcount,
> -							0x01U << (bit + 24));
> -				list_del(&page->lru);
> -			}
> +			/* Cannot be on this list if either half pending free */
> +			WARN_ON_ONCE(mask & ~0x03U);
> +			/* One or other half must be available, but not both */
> +			WARN_ON_ONCE(mask == 0x00U || mask == 0x03U);
> +			table = (unsigned long *)page_to_virt(page);
> +			bit = mask & 0x01U;	/* =1 -> second 2K available */
> +			table += bit * PTRS_PER_PTE;
> +			atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));
> +			list_del(&page->lru);
>  		}
>  		spin_unlock_bh(&mm->context.lock);
>  		if (table)

I do not really see why this is needed, and it also does not seem to
change anything relevant wrt to the bits, only make it less restrictive
by removing the PP bits check. Of course, removing that important-looking
comment also does not feel very good.

I guess it is related either to what you do in pte_free_defer(), or
__tlb_remove_table(), but both places are under spin_lock, and I'm
not sure what this change here is good for (I think I thought I
understood on first review, but not any more, unfortunately this often
happens with this code...)

> @@ -314,24 +313,22 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
>  	struct page *page;
>  
>  	page = virt_to_page(table);
> +	WARN_ON_ONCE(page->pt_mm != mm);
>  	if (!mm_alloc_pgste(mm)) {
>  		/* Free 2K page table fragment of a 4K page */
>  		bit = ((unsigned long) table & ~PAGE_MASK)/(PTRS_PER_PTE*sizeof(pte_t));
>  		spin_lock_bh(&mm->context.lock);
>  		/*
> -		 * Mark the page for delayed release. The actual release
> -		 * will happen outside of the critical section from this
> -		 * function or from __tlb_remove_table()
> +		 * Mark the page for release. The actual release will happen
> +		 * below from this function, or later from __tlb_remove_table().
>  		 */
> -		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> +		mask = atomic_xor_bits(&page->_refcount, 0x01U << (bit + 24));

I guess you did this ...

>  		mask >>= 24;
> -		if (mask & 0x03U)
> +		if (mask & 0x03U)		/* other half is allocated */
>  			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> +		else if (!(mask & 0x30U))	/* other half not pending */

... to be able to add this check. But pte_free_defer() and page_table_free()
should never be called for the very same fragment, so (temporarily) setting
the P bit above, and clearing it below, should not cause any conflict with
the P bit set from pte_free_defer(), right?

Of course, the "if (!(mask & 0x30U))" check would always be true if you
did not remove the temporary P bit set above, but I guess that could also
be solved by explicitly checking for the other halves P bit, i.e. instead
of checking for any P bit with 0x30U.

>  			list_del(&page->lru);
>  		spin_unlock_bh(&mm->context.lock);
> -		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> -		mask >>= 24;

This reset of the temporary P bit might need to be moved inside the
spin_lock(), so that pte_free_defer() would observe a consistent state.

The race fix from commit c2c224932fd0 wants to temporarily set the P
bit for this fragment, until after list_del(), to prevent
__tlb_remove_table() from observing zero too early (it currently has no
spin_lock()!) and doing its free_page() in conflict with the list_del().

Not sure if your changes to __tlb_remove_table() changed anything
in that regard, now having access to the spin_lock(). But I also do
not see why the above change would necessarily be needed, and risk
introducing that race again.

^ permalink raw reply	[flat|nested] 158+ messages in thread

* Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()
  2023-06-15 21:09                   ` Hugh Dickins
@ 2023-06-16 12:35                     ` Jason Gunthorpe
  0 siblings, 0 replies; 158+ messages in thread
From: Jason Gunthorpe @ 2023-06-16 12:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Gerald Schaefer, Vasily Gorbik, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	linux-s390

On Thu, Jun 15, 2023 at 02:09:30PM -0700, Hugh Dickins wrote:
> On Thu, 15 Jun 2023, Jason Gunthorpe wrote:
> > On Wed, Jun 14, 2023 at 02:59:33PM -0700, Hugh Dickins wrote:
> > 
> > > I guess the best thing would be to modify kernel/fork.c to allow the
> > > architecture to override free_mm(), and arch/s390 call_rcu to free mm.
> > > But as a quick and dirty s390-end workaround, how about:
> > 
> > RCU callbacks are not ordered so that doesn't seem like it helps..
> 
> Thanks, that's an interesting and important point, which I need to knock
> into my head better.
> 
> But can you show me where that's handled in the existing mm/mmu_gather.c
> include/asm-generic/tlb.h framework?  I don't see any rcu_barrier()s
> there, yet don't the pmd_huge_pte pointers point into pud page tables
> freed shortly afterwards also by RCU?

I don't know anything about the pmd_huge_pte stuff.. I was expecting
it got cleaned up explicitly before things reached the call_rcu? Where is it
touched from a call_rcu callback?

> > Making the page frag pool global (per-cpu global I guess) would also
> > remove the need to reach back to the freeable mm_struct and reduce the
> > need for struct page memory. This views it as a special kind of
> > kmemcache.
> 
> I haven't thought in that direction at all.  Hmm.  Or did I think of
> it once, but discarded for accounting reasons - IIRC (haven't rechecked)
> page table pages are charged to memcg, and counted for meminfo and other(?)
> purposes: if the fragments are all lumped into a global pool, we
> lose that.

You'd have to search the free list for fragments that match the
current memcg to avoid creating mismatches :\, or rework how memcg
accouting works for page tables - eg move the memcg from the struct
page to the mm_struct so that each frag can be accounted differently.

> > Can arches opt in to RCU freeing page table support and still keep
> > your series sane?
> 
> Yes, or perhaps we mean different things: I thought most architectures
> are already freeing page tables by RCU.  s390 included.
> "git grep MMU_GATHER_RCU_TABLE_FREE" shows plenty of selects.

MMU_GATHER_RCU_TABLE_FREE is a very confusing option. What it really
says is that the architecture doesn't do an IPI so we sometimes use
RCU as a replacement for the IPI, but not always.

Specifically this means it doesn't allow rcu reading of the page
tables. You still have to take the IPI blocking interrupt-disable lock
to read page tables, even if MMU_GATHER_RCU_TABLE_FREE is set.

IMHO I would be alot happier with what you were trying to do here if
it came along with full RCU enabling of page tables so that we could
say that the rcu_read_lock() is sufficient locking to read page tables
*always*.

I didn't really put together how this series works that we could
introduce rcu_read_lock() in only one specific place..

My query was simpler - if we could find enough space to put a rcu_head
in the ptdesc for many architectures, and thus *always* RCU free on
many architectures, could you do what you want but disable it on S390
and POWER which would still have to rely on an RCU head allocation and
a backup IPI?

Jason

^ permalink raw reply	[flat|nested] 158+ messages in thread

end of thread, other threads:[~2023-06-16 12:35 UTC | newest]

Thread overview: 158+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-29  6:11 [PATCH 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-05-29  6:11 ` Hugh Dickins
2023-05-29  6:11 ` Hugh Dickins
2023-05-29  6:14 ` [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-05-29  6:14   ` Hugh Dickins
2023-05-29  6:14   ` Hugh Dickins
2023-05-31 17:06   ` Jann Horn
2023-05-31 17:06     ` Jann Horn
2023-05-31 17:06     ` Jann Horn
2023-06-02  2:50     ` Hugh Dickins
2023-06-02  2:50       ` Hugh Dickins
2023-06-02  2:50       ` Hugh Dickins
2023-06-02 14:21       ` Jann Horn
2023-06-02 14:21         ` Jann Horn
2023-06-02 14:21         ` Jann Horn
2023-05-29  6:16 ` [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-05-29  6:16   ` Hugh Dickins
2023-05-29  6:16   ` Hugh Dickins
2023-05-29 13:56   ` Matthew Wilcox
2023-05-29 13:56     ` Matthew Wilcox
2023-05-29 13:56     ` Matthew Wilcox
     [not found]   ` <ZHeg3oRljRn6wlLX@ziepe.ca>
2023-06-02  5:35     ` Hugh Dickins
2023-06-02  5:35       ` Hugh Dickins
2023-06-02  5:35       ` Hugh Dickins
2023-05-29  6:17 ` [PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-05-29  6:17   ` Hugh Dickins
2023-05-29  6:17   ` Hugh Dickins
2023-05-29  6:18 ` [PATCH 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-05-29  6:18   ` Hugh Dickins
2023-05-29  6:18   ` Hugh Dickins
2023-05-29  6:20 ` [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-05-29  6:20   ` Hugh Dickins
2023-05-29  6:20   ` Hugh Dickins
2023-05-29 14:02   ` Matthew Wilcox
2023-05-29 14:02     ` Matthew Wilcox
2023-05-29 14:02     ` Matthew Wilcox
2023-05-29 14:36     ` Hugh Dickins
2023-05-29 14:36       ` Hugh Dickins
2023-05-29 14:36       ` Hugh Dickins
2023-06-01 13:57       ` Gerald Schaefer
2023-06-01 13:57         ` Gerald Schaefer
2023-06-01 13:57         ` Gerald Schaefer
2023-06-02  6:38         ` Hugh Dickins
2023-06-02  6:38           ` Hugh Dickins
2023-06-02 14:20     ` Jason Gunthorpe
2023-06-02 14:20       ` Jason Gunthorpe
2023-06-02 14:20       ` Jason Gunthorpe
2023-06-06  3:40       ` Hugh Dickins
2023-06-06  3:40         ` Hugh Dickins
2023-06-06  3:40         ` Hugh Dickins
2023-06-06 18:23         ` Jason Gunthorpe
2023-06-06 18:23           ` Jason Gunthorpe
2023-06-06 18:23           ` Jason Gunthorpe
2023-06-06 19:03           ` Peter Xu
2023-06-06 19:03             ` Peter Xu
2023-06-06 19:03             ` Peter Xu
2023-06-06 19:08             ` Jason Gunthorpe
2023-06-06 19:08               ` Jason Gunthorpe
2023-06-06 19:08               ` Jason Gunthorpe
2023-06-07  3:49               ` Hugh Dickins
2023-06-07  3:49                 ` Hugh Dickins
2023-06-07  3:49                 ` Hugh Dickins
2023-05-29  6:21 ` [PATCH 06/12] sparc: " Hugh Dickins
2023-05-29  6:21   ` Hugh Dickins
2023-05-29  6:21   ` Hugh Dickins
2023-06-06  3:46   ` Hugh Dickins
2023-06-06  3:46     ` Hugh Dickins
2023-06-06  3:46     ` Hugh Dickins
2023-05-29  6:22 ` [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() Hugh Dickins
2023-05-29  6:22   ` Hugh Dickins
2023-05-29  6:22   ` Hugh Dickins
2023-06-06  5:11   ` Hugh Dickins
2023-06-06  5:11     ` Hugh Dickins
2023-06-06  5:11     ` Hugh Dickins
2023-06-06 18:39     ` Jason Gunthorpe
2023-06-06 18:39       ` Jason Gunthorpe
2023-06-06 18:39       ` Jason Gunthorpe
2023-06-08  2:46       ` Hugh Dickins
2023-06-08  2:46         ` Hugh Dickins
2023-06-08  2:46         ` Hugh Dickins
2023-06-06 19:40     ` Gerald Schaefer
2023-06-06 19:40       ` Gerald Schaefer
2023-06-06 19:40       ` Gerald Schaefer
2023-06-08  3:35       ` Hugh Dickins
2023-06-08  3:35         ` Hugh Dickins
2023-06-08  3:35         ` Hugh Dickins
2023-06-08 13:58         ` Jason Gunthorpe
2023-06-08 13:58           ` Jason Gunthorpe
2023-06-08 13:58           ` Jason Gunthorpe
2023-06-08 15:47         ` Gerald Schaefer
2023-06-08 15:47           ` Gerald Schaefer
2023-06-08 15:47           ` Gerald Schaefer
2023-06-13  6:34           ` Hugh Dickins
2023-06-14 13:30             ` Gerald Schaefer
2023-06-14 21:59               ` Hugh Dickins
2023-06-15 12:11                 ` Gerald Schaefer
2023-06-15 20:06                   ` Hugh Dickins
2023-06-16  8:38                     ` Gerald Schaefer
2023-06-15 12:34                 ` Jason Gunthorpe
2023-06-15 21:09                   ` Hugh Dickins
2023-06-16 12:35                     ` Jason Gunthorpe
2023-05-29  6:23 ` [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-05-29  6:23   ` Hugh Dickins
2023-05-29  6:23   ` Hugh Dickins
2023-06-01 13:31   ` Jann Horn
2023-06-01 13:31     ` Jann Horn
2023-06-01 13:31     ` Jann Horn
     [not found]   ` <ZHekpAKJ05cr/GLl@ziepe.ca>
2023-06-02  6:03     ` Hugh Dickins
2023-06-02  6:03       ` Hugh Dickins
2023-06-02  6:03       ` Hugh Dickins
2023-06-02 12:15       ` Jason Gunthorpe
2023-06-02 12:15         ` Jason Gunthorpe
2023-06-02 12:15         ` Jason Gunthorpe
2023-05-29  6:25 ` [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-05-29  6:25   ` Hugh Dickins
2023-05-29  6:25   ` Hugh Dickins
2023-05-29 23:26   ` Peter Xu
2023-05-29 23:26     ` Peter Xu
2023-05-29 23:26     ` Peter Xu
2023-05-31  0:38     ` Hugh Dickins
2023-05-31  0:38       ` Hugh Dickins
2023-05-31  0:38       ` Hugh Dickins
2023-05-31 15:34   ` Jann Horn
2023-05-31 15:34     ` Jann Horn
2023-05-31 15:34     ` Jann Horn
     [not found]     ` <ZHe0A079X9B8jWlH@x1n>
2023-05-31 22:18       ` Jann Horn
2023-05-31 22:18         ` Jann Horn
2023-05-31 22:18         ` Jann Horn
2023-06-01 14:06         ` Jason Gunthorpe
2023-06-01 14:06           ` Jason Gunthorpe
2023-06-01 14:06           ` Jason Gunthorpe
2023-06-06  6:18     ` Hugh Dickins
2023-06-06  6:18       ` Hugh Dickins
2023-06-06  6:18       ` Hugh Dickins
2023-05-29  6:26 ` [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-05-29  6:26   ` Hugh Dickins
2023-05-29  6:26   ` Hugh Dickins
2023-05-31 17:25   ` Jann Horn
2023-05-31 17:25     ` Jann Horn
2023-06-02  5:11     ` Hugh Dickins
2023-06-02  5:11       ` Hugh Dickins
2023-06-02  5:11       ` Hugh Dickins
2023-05-29  6:28 ` [PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-05-29  6:28   ` Hugh Dickins
2023-05-29  6:28   ` Hugh Dickins
2023-05-29  6:30 ` [PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
2023-05-29  6:30   ` Hugh Dickins
2023-05-29  6:30   ` Hugh Dickins
2023-05-31 17:59 ` [PATCH 00/12] mm: free retracted page table by RCU Jann Horn
2023-06-02  4:37   ` Hugh Dickins
2023-06-02  4:37     ` Hugh Dickins
2023-06-02  4:37     ` Hugh Dickins
2023-06-02 15:26     ` Jann Horn
2023-06-02 15:26       ` Jann Horn
2023-06-02 15:26       ` Jann Horn
2023-06-06  6:28       ` Hugh Dickins
2023-06-06  6:28         ` Hugh Dickins
2023-06-06  6:28         ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.