linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] mm: free retracted page table by RCU
@ 2023-07-12  4:27 Hugh Dickins
  2023-07-12  4:30 ` [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
                   ` (12 more replies)
  0 siblings, 13 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Here is v3 of the series of patches to mm (and a few architectures), based
on v6.5-rc1 which includes the preceding two series (thank you!): in which
khugepaged takes advantage of pte_offset_map[_lock]() allowing for pmd
transitions.  Differences from v1 and v2 are noted patch by patch below.

This replaces the v2 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/54cb04f-3762-987f-8294-91dafd8ebfb0@google.com/
series of 12 posted on 2023-06-20.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on v6.5-rc1; and almost good on current mm-unstable or current
linux-next - just one patch conflicts, the 12/13: I'll reply to that
one with its mm-unstable or linux-next equivalent (vma_assert_locked()
has been added next to where vma_try_start_write() is being removed).

01/13 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
      v3: same as v1
02/13 mm/pgtable: add PAE safety to __pte_offset_map()
      v3: same as v2
      v2: rename to pmdp_get_lockless_start/end() per Matthew;
          so use inlines without _irq_save(flags) macro oddity;
          add pmdp_get_lockless_sync() for use later in 09/13.
03/13 arm: adjust_pte() use pte_offset_map_nolock()
      v3: same as v1
04/13 powerpc: assert_pte_locked() use pte_offset_map_nolock()
      v3: same as v1
05/13 powerpc: add pte_free_defer() for pgtables sharing page
      v3: much simpler version, following suggestion by Jason
      v2: fix rcu_head usage to cope with concurrent deferrals;
          add para to commit message explaining rcu_head issue.
06/13 sparc: add pte_free_defer() for pte_t *pgtable_t
      v3: same as v2
      v2: use page_address() instead of less common page_to_virt();
          add para to commit message explaining simple conversion;
          changed title since sparc64 pgtables do not share page.
07/13 s390: add pte_free_defer() for pgtables sharing page
      v3: much simpler version, following suggestion by Gerald
      v2: complete rewrite, integrated with s390's existing pgtable
          management; temporarily using a global mm_pgtable_list_lock,
          to be restored to per-mm spinlock in a later followup patch.
08/13 mm/pgtable: add pte_free_defer() for pgtable as page
      v3: same as v2
      v2: add comment on rcu_head to "Page table pages", per JannH
09/13 mm/khugepaged: retract_page_tables() without mmap or vma lock
      v3: same as v2
      v2: repeat checks under ptl because UFFD, per PeterX and JannH;
          bring back mmu_notifier calls for PMD, per JannH and Jason;
          pmdp_get_lockless_sync() to issue missing interrupt if PAE.
10/13 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
      v3: updated to using ptent instead of *pte
      v2: first check VMA, in case page tables torn down, per JannH;
          pmdp_get_lockless_sync() to issue missing interrupt if PAE;
          moved mmu_notifier after step 1, reworked final goto labels.
11/13 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
      v3: rediffed
      v2: same as v1
12/13 mm: delete mmap_write_trylock() and vma_try_start_write()
      v3: rediffed (different diff needed for mm-unstable or linux-next)
      v2: same as v1
13/13 mm/pgtable: notes on pte_offset_map[_lock]()
      v3: new: JannH asked for more helpful comment, this is my attempt;
          could be moved to be the first in the series.

 arch/arm/mm/fault-armv.c            |   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c      |  29 +-
 arch/powerpc/mm/pgtable.c           |  16 +-
 arch/s390/include/asm/pgalloc.h     |   4 +
 arch/s390/mm/pgalloc.c              |  80 ++++-
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c             |  16 +
 include/linux/mm.h                  |  17 --
 include/linux/mm_types.h            |   4 +
 include/linux/mmap_lock.h           |  10 -
 include/linux/pgtable.h             |  10 +-
 mm/khugepaged.c                     | 481 +++++++++++-------------------
 mm/pgtable-generic.c                |  97 +++++-
 14 files changed, 404 insertions(+), 371 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
@ 2023-07-12  4:30 ` Hugh Dickins
  2023-07-12  4:32 ` [PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c    | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..5134edcec668 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte)	do {	\
 	kunmap_local((pte));	\
-	/* rcu_read_unlock() to be added later */	\
+	rcu_read_unlock();	\
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4d454953046f..400e5a045848 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
 	pmd_t pmdval;
 
-	/* rcu_read_lock() to be added later */
+	rcu_read_lock();
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmdvalp)
 		*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	}
 	return __pte_map(&pmdval, addr);
 nomap:
-	/* rcu_read_unlock() to be added later */
+	rcu_read_unlock();
 	return NULL;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
  2023-07-12  4:30 ` [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
@ 2023-07-12  4:32 ` Hugh Dickins
  2023-07-12  4:33 ` [PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pgtable.h |  4 ++++
 mm/pgtable-generic.c    | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5134edcec668..7f2db400f653 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 	return pmd;
 }
 #define pmdp_get_lockless pmdp_get_lockless
+#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
 
@@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 {
 	return pmdp_get(pmdp);
 }
+static inline void pmdp_get_lockless_sync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 400e5a045848..b9a0c2137cc1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+static unsigned long pmdp_get_lockless_start(void)
+{
+	unsigned long irqflags;
+
+	local_irq_save(irqflags);
+	return irqflags;
+}
+static void pmdp_get_lockless_end(unsigned long irqflags)
+{
+	local_irq_restore(irqflags);
+}
+#else
+static unsigned long pmdp_get_lockless_start(void) { return 0; }
+static void pmdp_get_lockless_end(unsigned long irqflags) { }
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+	unsigned long irqflags;
 	pmd_t pmdval;
 
 	rcu_read_lock();
+	irqflags = pmdp_get_lockless_start();
 	pmdval = pmdp_get_lockless(pmd);
+	pmdp_get_lockless_end(irqflags);
+
 	if (pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
  2023-07-12  4:30 ` [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
  2023-07-12  4:32 ` [PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
@ 2023-07-12  4:33 ` Hugh Dickins
  2023-07-12  4:34 ` [PATCH v3 04/13] powerpc: assert_pte_locked() " Hugh Dickins
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map(pmd, address);
+	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
 	if (!pte)
 		return 0;
 
-	ptl = pte_lockptr(vma->vm_mm, pmd);
 	do_pte_lock(ptl);
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (2 preceding siblings ...)
  2023-07-12  4:33 ` [PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
@ 2023-07-12  4:34 ` Hugh Dickins
  2023-07-18 10:41   ` Aneesh Kumar K.V
  2023-07-12  4:35 ` [PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (mm == &init_mm)
 		return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pud = pud_offset(p4d, addr);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, addr);
-	/*
-	 * khugepaged to collapse normal pages to hugepage, first set
-	 * pmd to none to force page fault/gup to take mmap_lock. After
-	 * pmd is set to none, we do a pte_clear which does this assertion
-	 * so if we find pmd none, return.
-	 */
-	if (pmd_none(*pmd))
-		return;
-	BUG_ON(!pmd_present(*pmd));
-	assert_spin_locked(pte_lockptr(mm, pmd));
+	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	BUG_ON(!pte);
+	assert_spin_locked(ptl);
+	pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (3 preceding siblings ...)
  2023-07-12  4:34 ` [PATCH v3 04/13] powerpc: assert_pte_locked() " Hugh Dickins
@ 2023-07-12  4:35 ` Hugh Dickins
  2023-07-12  4:37 ` [PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t Hugh Dickins
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time.  But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 ++++
 arch/powerpc/mm/pgtable-frag.c     | 29 ++++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
 	return __alloc_for_ptecache(mm, kernel);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
 void pte_fragment_free(unsigned long *table, int kernel)
 {
 	struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)
 
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
-		if (!kernel)
-			pgtable_pte_page_dtor(page);
-		__free_page(page);
+		if (kernel)
+			__free_page(page);
+		else if (TestClearPageActive(page))
+			call_rcu(&page->rcu_head, pte_free_now);
+		else
+			pte_free_now(&page->rcu_head);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	SetPageActive(page);
+	pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (4 preceding siblings ...)
  2023-07-12  4:35 ` [PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
@ 2023-07-12  4:37 ` Hugh Dickins
  2023-07-12  4:38 ` [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/sparc/include/asm/pgalloc_64.h |  4 ++++
 arch/sparc/mm/init_64.c             | 16 ++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..0d7fd793924c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	__pte_free((pgtable_t)page_address(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (5 preceding siblings ...)
  2023-07-12  4:37 ` [PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t Hugh Dickins
@ 2023-07-12  4:38 ` Hugh Dickins
  2023-07-13  4:47   ` Alexander Gordeev
  2023-07-19 14:25   ` Claudio Imbrenda
  2023-07-12  4:39 ` [PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive).  And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
---
 arch/s390/include/asm/pgalloc.h |  4 ++
 arch/s390/mm/pgalloc.c          | 80 +++++++++++++++++++++++++++++------
 2 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..760b4ace475e 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 					table += PTRS_PER_PTE;
 				atomic_xor_bits(&page->_refcount,
 							0x01U << (bit + 24));
-				list_del(&page->lru);
+				list_del_init(&page->lru);
 			}
 		}
 		spin_unlock_bh(&mm->context.lock);
@@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	table = (unsigned long *) page_to_virt(page);
 	if (mm_alloc_pgste(mm)) {
 		/* Return 4K page table with PGSTEs */
+		INIT_LIST_HEAD(&page->lru);
 		atomic_xor_bits(&page->_refcount, 0x03U << 24);
 		memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
 		memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
@@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
 {
 	char msg[128];
 
-	if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
+	if (!IS_ENABLED(CONFIG_DEBUG_VM))
+		return;
+	if (!mask && list_empty(&page->lru))
 		return;
 	snprintf(msg, sizeof(msg),
 		 "Invalid pgtable %p release half 0x%02x mask 0x%02x",
@@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
 	dump_page(page, msg);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
 void page_table_free(struct mm_struct *mm, unsigned long *table)
 {
 	unsigned int mask, bit, half;
@@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 		 */
 		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 		mask >>= 24;
-		if (mask & 0x03U)
+		if ((mask & 0x03U) && !PageActive(page)) {
+			/*
+			 * Other half is allocated, and neither half has had
+			 * its free deferred: add page to head of list, to make
+			 * this freed half available for immediate reuse.
+			 */
 			list_add(&page->lru, &mm->context.pgtable_list);
-		else
-			list_del(&page->lru);
+		} else {
+			/* If page is on list, now remove it. */
+			list_del_init(&page->lru);
+		}
 		spin_unlock_bh(&mm->context.lock);
 		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
 		mask >>= 24;
@@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
 	}
 
 	page_table_release_check(page, table, half, mask);
-	pgtable_pte_page_dtor(page);
-	__free_page(page);
+	if (TestClearPageActive(page))
+		call_rcu(&page->rcu_head, pte_free_now);
+	else
+		pte_free_now(&page->rcu_head);
 }
 
 void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
@@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
 	 */
 	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
 	mask >>= 24;
-	if (mask & 0x03U)
+	if ((mask & 0x03U) && !PageActive(page)) {
+		/*
+		 * Other half is allocated, and neither half has had
+		 * its free deferred: add page to end of list, to make
+		 * this freed half available for reuse once its pending
+		 * bit has been cleared by __tlb_remove_table().
+		 */
 		list_add_tail(&page->lru, &mm->context.pgtable_list);
-	else
-		list_del(&page->lru);
+	} else {
+		/* If page is on list, now remove it. */
+		list_del_init(&page->lru);
+	}
 	spin_unlock_bh(&mm->context.lock);
 	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
 	tlb_remove_table(tlb, table);
@@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table)
 	}
 
 	page_table_release_check(page, table, half, mask);
-	pgtable_pte_page_dtor(page);
-	__free_page(page);
+	if (TestClearPageActive(page))
+		call_rcu(&page->rcu_head, pte_free_now);
+	else
+		pte_free_now(&page->rcu_head);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	SetPageActive(page);
+	page_table_free(mm, (unsigned long *)pgtable);
+	/*
+	 * page_table_free() does not do the pgste gmap_unlink() which
+	 * page_table_free_rcu() does: warn us if pgste ever reaches here.
+	 */
+	WARN_ON_ONCE(mm_alloc_pgste(mm));
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 /*
  * Base infrastructure required to generate basic asces, region, segment,
  * and page tables that do not make use of enhanced features like EDAT1.
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (6 preceding siblings ...)
  2023-07-12  4:38 ` [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
@ 2023-07-12  4:39 ` Hugh Dickins
  2023-07-12  4:41 ` [PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm_types.h |  4 ++++
 include/linux/pgtable.h  |  2 ++
 mm/pgtable-generic.c     | 20 ++++++++++++++++++++
 3 files changed, 26 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de10fc797c8e..17a7868f00bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -144,6 +144,10 @@ struct page {
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+			/*
+			 * A PTE page table page might be freed by use of
+			 * rcu_head: which overlays those two fields above.
+			 */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 7f2db400f653..9fa34be65159 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b9a0c2137cc1..fa9d4d084291 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mm_inline.h>
+#include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 	return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = pgtable;
+	call_rcu(&page->rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (7 preceding siblings ...)
  2023-07-12  4:39 ` [PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
@ 2023-07-12  4:41 ` Hugh Dickins
  2023-07-12  4:42 ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that enhancement
does raise some more questions: leave it until a later release.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 184 ++++++++++++++++++++------------------------------
 1 file changed, 75 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..3bb05147961b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1615,9 +1615,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		break;
 	case SCAN_PMD_NONE:
 		/*
-		 * In MADV_COLLAPSE path, possible race with khugepaged where
-		 * all pte entries have been removed and pmd cleared.  If so,
-		 * skip all the pte checks and just update the pmd mapping.
+		 * All pte entries have been removed and pmd cleared.
+		 * Skip all the pte checks and just update the pmd mapping.
 		 */
 		goto maybe_install_pmd;
 	default:
@@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
 	mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-			       struct mm_struct *target_mm,
-			       unsigned long target_addr, struct page *hpage,
-			       struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
-	int target_result = SCAN_FAIL;
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
-		int result = SCAN_FAIL;
-		struct mm_struct *mm = NULL;
-		unsigned long addr = 0;
-		pmd_t *pmd;
-		bool is_target = false;
+		struct mmu_notifier_range range;
+		struct mm_struct *mm;
+		unsigned long addr;
+		pmd_t *pmd, pgt_pmd;
+		spinlock_t *pml;
+		spinlock_t *ptl;
+		bool skipped_uffd = false;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-		 * got written to. These VMAs are likely not worth investing
-		 * mmap_write_lock(mm) as PMD-mapping is likely to be split
-		 * later.
-		 *
-		 * Note that vma->anon_vma check is racy: it can be set up after
-		 * the check but before we took mmap_lock by the fault path.
-		 * But page lock would prevent establishing any new ptes of the
-		 * page, so we are safe.
-		 *
-		 * An alternative would be drop the check, but check that page
-		 * table is clear before calling pmdp_collapse_flush() under
-		 * ptl. It has higher chance to recover THP for the VMA, but
-		 * has higher cost too. It would also probably require locking
-		 * the anon_vma.
+		 * got written to. These VMAs are likely not worth removing
+		 * page tables from, as PMD-mapping is likely to be split later.
 		 */
-		if (READ_ONCE(vma->anon_vma)) {
-			result = SCAN_PAGE_ANON;
-			goto next;
-		}
+		if (READ_ONCE(vma->anon_vma))
+			continue;
+
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
-		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
-			result = SCAN_VMA_CHECK;
-			goto next;
-		}
-		mm = vma->vm_mm;
-		is_target = mm == target_mm && addr == target_addr;
-		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
-		if (result != SCAN_SUCCEED)
-			goto next;
-		/*
-		 * We need exclusive mmap_lock to retract page table.
-		 *
-		 * We use trylock due to lock inversion: we need to acquire
-		 * mmap_lock while holding page lock. Fault path does it in
-		 * reverse order. Trylock is a way to avoid deadlock.
-		 *
-		 * Also, it's not MADV_COLLAPSE's job to collapse other
-		 * mappings - let khugepaged take care of them later.
-		 */
-		result = SCAN_PTE_MAPPED_HUGEPAGE;
-		if ((cc->is_khugepaged || is_target) &&
-		    mmap_write_trylock(mm)) {
-			/* trylock for the same lock inversion as above */
-			if (!vma_try_start_write(vma))
-				goto unlock_next;
-
-			/*
-			 * Re-check whether we have an ->anon_vma, because
-			 * collapse_and_free_pmd() requires that either no
-			 * ->anon_vma exists or the anon_vma is locked.
-			 * We already checked ->anon_vma above, but that check
-			 * is racy because ->anon_vma can be populated under the
-			 * mmap lock in read mode.
-			 */
-			if (vma->anon_vma) {
-				result = SCAN_PAGE_ANON;
-				goto unlock_next;
-			}
-			/*
-			 * When a vma is registered with uffd-wp, we can't
-			 * recycle the pmd pgtable because there can be pte
-			 * markers installed.  Skip it only, so the rest mm/vma
-			 * can still have the same file mapped hugely, however
-			 * it'll always mapped in small page size for uffd-wp
-			 * registered ranges.
-			 */
-			if (hpage_collapse_test_exit(mm)) {
-				result = SCAN_ANY_PROCESS;
-				goto unlock_next;
-			}
-			if (userfaultfd_wp(vma)) {
-				result = SCAN_PTE_UFFD_WP;
-				goto unlock_next;
-			}
-			collapse_and_free_pmd(mm, vma, addr, pmd);
-			if (!cc->is_khugepaged && is_target)
-				result = set_huge_pmd(vma, addr, pmd, hpage);
-			else
-				result = SCAN_SUCCEED;
-
-unlock_next:
-			mmap_write_unlock(mm);
-			goto next;
-		}
-		/*
-		 * Calling context will handle target mm/addr. Otherwise, let
-		 * khugepaged try again later.
-		 */
-		if (!is_target) {
-			khugepaged_add_pte_mapped_thp(mm, addr);
+		    vma->vm_end < addr + HPAGE_PMD_SIZE)
 			continue;
+
+		mm = vma->vm_mm;
+		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
+			continue;
+
+		if (hpage_collapse_test_exit(mm))
+			continue;
+		/*
+		 * When a vma is registered with uffd-wp, we cannot recycle
+		 * the page table because there may be pte markers installed.
+		 * Other vmas can still have the same file mapped hugely, but
+		 * skip this one: it will always be mapped in small page size
+		 * for uffd-wp registered ranges.
+		 */
+		if (userfaultfd_wp(vma))
+			continue;
+
+		/* PTEs were notified when unmapped; but now for the PMD? */
+		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+					addr, addr + HPAGE_PMD_SIZE);
+		mmu_notifier_invalidate_range_start(&range);
+
+		pml = pmd_lock(mm, pmd);
+		ptl = pte_lockptr(mm, pmd);
+		if (ptl != pml)
+			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+		/*
+		 * Huge page lock is still held, so normally the page table
+		 * must remain empty; and we have already skipped anon_vma
+		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
+		 * held, it is still possible for a racing userfaultfd_ioctl()
+		 * to have inserted ptes or markers.  Now that we hold ptlock,
+		 * repeating the anon_vma check protects from one category,
+		 * and repeating the userfaultfd_wp() check from another.
+		 */
+		if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
+			skipped_uffd = true;
+		} else {
+			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
+			pmdp_get_lockless_sync();
+		}
+
+		if (ptl != pml)
+			spin_unlock(ptl);
+		spin_unlock(pml);
+
+		mmu_notifier_invalidate_range_end(&range);
+
+		if (!skipped_uffd) {
+			mm_dec_nr_ptes(mm);
+			page_table_check_pte_clear_range(mm, addr, pgt_pmd);
+			pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 		}
-next:
-		if (is_target)
-			target_result = result;
 	}
-	i_mmap_unlock_write(mapping);
-	return target_result;
+	i_mmap_unlock_read(mapping);
 }
 
 /**
@@ -2259,9 +2223,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 	/*
 	 * Remove pte page tables, so we can re-fault the page as huge.
+	 * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
 	 */
-	result = retract_page_tables(mapping, start, mm, addr, hpage,
-				     cc);
+	retract_page_tables(mapping, start);
+	if (cc && !cc->is_khugepaged)
+		result = SCAN_PTE_MAPPED_HUGEPAGE;
 	unlock_page(hpage);
 
 	/*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (8 preceding siblings ...)
  2023-07-12  4:41 ` [PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
@ 2023-07-12  4:42 ` Hugh Dickins
  2023-07-23 22:32   ` [PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix Hugh Dickins
                     ` (2 more replies)
  2023-07-12  4:43 ` [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
                   ` (2 subsequent siblings)
  12 siblings, 3 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3".  But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 172 ++++++++++++++++++++++----------------------------
 1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3bb05147961b..46986eb4eebb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
 {
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	};
 
 	VM_BUG_ON(!PageTransHuge(hpage));
-	mmap_assert_write_locked(vma->vm_mm);
+	mmap_assert_locked(vma->vm_mm);
 
 	if (do_set_pmd(&vmf, hpage))
 		return SCAN_FAIL;
@@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to another
- *    page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	struct mmu_notifier_range range;
-
-	mmap_assert_write_locked(mm);
-	if (vma->vm_file)
-		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
-	/*
-	 * All anon_vmas attached to the VMA have the same root and are
-	 * therefore locked by the same lock.
-	 */
-	if (vma->anon_vma)
-		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + HPAGE_PMD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	pmd = pmdp_collapse_flush(vma, addr, pmdp);
-	tlb_remove_table_sync_one();
-	mmu_notifier_invalidate_range_end(&range);
-	mm_dec_nr_ptes(mm);
-	page_table_check_pte_clear_range(mm, addr, pmd);
-	pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	struct mmu_notifier_range range;
+	bool notified = false;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
-	pmd_t *pmd;
-	spinlock_t *ptl;
-	int count = 0, result = SCAN_FAIL;
+	pmd_t *pmd, pgt_pmd;
+	spinlock_t *pml, *ptl;
+	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
-	mmap_assert_write_locked(mm);
+	mmap_assert_locked(mm);
+
+	/* First check VMA found, in case page tables are being torn down */
+	if (!vma || !vma->vm_file ||
+	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+		return SCAN_VMA_CHECK;
 
 	/* Fast check before locking page if already PMD-mapped */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	if (result == SCAN_PMD_MAPPED)
 		return result;
 
-	if (!vma || !vma->vm_file ||
-	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-		return SCAN_VMA_CHECK;
-
 	/*
 	 * If we are here, we've succeeded in replacing all the native pages
 	 * in the page cache with a single hugepage. If a mm were to fault-in
@@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
@@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
-	/* Lock the vma before taking i_mmap and page table locks */
-	vma_start_write(vma);
-
-	/*
-	 * We need to lock the mapping so that from here on, only GUP-fast and
-	 * hardware page walks can access the parts of the page tables that
-	 * we're operating on.
-	 * See collapse_and_free_pmd().
-	 */
-	i_mmap_lock_write(vma->vm_file->f_mapping);
-
-	/*
-	 * This spinlock should be unnecessary: Nobody else should be accessing
-	 * the page tables under spinlock protection here, only
-	 * lockless_pages_from_mm() and the hardware page walker can access page
-	 * tables while all the high-level locks are held in write mode.
-	 */
 	result = SCAN_FAIL;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
-	if (!start_pte)
-		goto drop_immap;
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto drop_hpage;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 */
 		if (hpage + i != page)
 			goto abort;
-		count++;
 	}
 
-	/* step 2: adjust rmap */
+	pte_unmap_unlock(start_pte, ptl);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+	notified = true;
+	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto abort;
+
+	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
 		struct page *page;
@@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 
 		if (pte_none(ptent))
 			continue;
-		page = vm_normal_page(vma, addr, ptent);
-		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+		/*
+		 * We dropped ptl after the first scan, to do the mmu_notifier:
+		 * page lock stops more PTEs of the hpage being faulted in, but
+		 * does not stop write faults COWing anon copies from existing
+		 * PTEs; and does not stop those being swapped out or migrated.
+		 */
+		if (!pte_present(ptent)) {
+			result = SCAN_PTE_NON_PRESENT;
 			goto abort;
+		}
+		page = vm_normal_page(vma, addr, ptent);
+		if (hpage + i != page)
+			goto abort;
+
+		/*
+		 * Must clear entry, or a racing truncate may re-remove it.
+		 * TLB flush can be left until pmdp_collapse_flush() does it.
+		 * PTE dirty? Shmem page is already dirty; file is read-only.
+		 */
+		pte_clear(mm, addr, pte);
 		page_remove_rmap(page, vma, false);
+		nr_ptes++;
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (count) {
-		page_ref_sub(hpage, count);
-		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	if (nr_ptes) {
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
 	}
 
-	/* step 4: remove pte entries */
-	/* we make no change to anon, but protect concurrent anon page lookup */
-	if (vma->anon_vma)
-		anon_vma_lock_write(vma->anon_vma);
+	/* step 4: remove page table */
 
-	collapse_and_free_pmd(mm, vma, haddr, pmd);
+	/* Huge page lock is still held, so page table must remain empty */
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	pmdp_get_lockless_sync();
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
 
-	if (vma->anon_vma)
-		anon_vma_unlock_write(vma->anon_vma);
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	mmu_notifier_invalidate_range_end(&range);
+
+	mm_dec_nr_ptes(mm);
+	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 
 maybe_install_pmd:
 	/* step 5: install pmd entry */
 	result = install_pmd
 			? set_huge_pmd(vma, haddr, pmd, hpage)
 			: SCAN_SUCCEED;
-
+	goto drop_hpage;
+abort:
+	if (nr_ptes) {
+		flush_tlb_mm(mm);
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
+	}
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	if (notified)
+		mmu_notifier_invalidate_range_end(&range);
 drop_hpage:
 	unlock_page(hpage);
 	put_page(hpage);
 	return result;
-
-abort:
-	pte_unmap_unlock(start_pte, ptl);
-drop_immap:
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	goto drop_hpage;
 }
 
 static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
@@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		case SCAN_PTE_MAPPED_HUGEPAGE:
 			BUG_ON(mmap_locked);
 			BUG_ON(*prev);
-			mmap_write_lock(mm);
+			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_write_unlock(mm);
+			mmap_locked = true;
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (9 preceding siblings ...)
  2023-07-12  4:42 ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
@ 2023-07-12  4:43 ` Hugh Dickins
  2023-07-23 22:35   ` [PATCH v3 11/13 fix] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps(): fix Hugh Dickins
  2023-07-12  4:44 ` [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
  2023-07-12  4:46 ` [PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]() Hugh Dickins
  12 siblings, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 125 +++++++-------------------------------------------
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 46986eb4eebb..7c7aaddbe130 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
 	struct mm_slot slot;
-
-	/* pte-mapped THP in this mm */
-	int nr_pte_mapped_thp;
-	unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1439,50 +1431,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- *     emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- *     retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- *     (at virtual address X) and adds an entry (for X) into mm_struct A's
- *     ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- *     sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- *     (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	struct khugepaged_mm_slot *mm_slot;
-	struct mm_slot *slot;
-	bool ret = false;
-
-	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-	spin_lock(&khugepaged_mm_lock);
-	slot = mm_slot_lookup(mm_slots_hash, mm);
-	mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-	if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) {
-		mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-		ret = true;
-	}
-	spin_unlock(&khugepaged_mm_lock);
-	return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
@@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-	struct mm_slot *slot = &mm_slot->slot;
-	struct mm_struct *mm = slot->mm;
-	int i;
-
-	if (likely(mm_slot->nr_pte_mapped_thp == 0))
-		return;
-
-	if (!mmap_write_trylock(mm))
-		return;
-
-	if (unlikely(hpage_collapse_test_exit(mm)))
-		goto out;
-
-	for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-		collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-	mm_slot->nr_pte_mapped_thp = 0;
-	mmap_write_unlock(mm);
-}
-
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -2370,16 +2295,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 {
 	BUILD_BUG();
 }
-
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
-{
-}
-
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
-					  unsigned long addr)
-{
-	return false;
-}
 #endif
 
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
@@ -2409,7 +2324,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		khugepaged_scan.mm_slot = mm_slot;
 	}
 	spin_unlock(&khugepaged_mm_lock);
-	khugepaged_collapse_pte_mapped_thps(mm_slot);
 
 	mm = slot->mm;
 	/*
@@ -2462,36 +2376,29 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 						khugepaged_scan.address);
 
 				mmap_read_unlock(mm);
-				*result = hpage_collapse_scan_file(mm,
-								   khugepaged_scan.address,
-								   file, pgoff, cc);
 				mmap_locked = false;
+				*result = hpage_collapse_scan_file(mm,
+					khugepaged_scan.address, file, pgoff, cc);
+				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
+					mmap_read_lock(mm);
+					mmap_locked = true;
+					if (hpage_collapse_test_exit(mm)) {
+						fput(file);
+						goto breakouterloop;
+					}
+					*result = collapse_pte_mapped_thp(mm,
+						khugepaged_scan.address, false);
+					if (*result == SCAN_PMD_MAPPED)
+						*result = SCAN_SUCCEED;
+				}
 				fput(file);
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-								  khugepaged_scan.address,
-								  &mmap_locked,
-								  cc);
+					khugepaged_scan.address, &mmap_locked, cc);
 			}
-			switch (*result) {
-			case SCAN_PTE_MAPPED_HUGEPAGE: {
-				pmd_t *pmd;
 
-				*result = find_pmd_or_thp_or_none(mm,
-								  khugepaged_scan.address,
-								  &pmd);
-				if (*result != SCAN_SUCCEED)
-					break;
-				if (!khugepaged_add_pte_mapped_thp(mm,
-								   khugepaged_scan.address))
-					break;
-			} fallthrough;
-			case SCAN_SUCCEED:
+			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
-				break;
-			default:
-				break;
-			}
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (10 preceding siblings ...)
  2023-07-12  4:43 ` [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
@ 2023-07-12  4:44 ` Hugh Dickins
  2023-07-12  4:48   ` [PATCH mm " Hugh Dickins
  2023-07-12  4:46 ` [PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]() Hugh Dickins
  12 siblings, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h        | 17 -----------------
 include/linux/mmap_lock.h | 10 ----------
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..b7b45be616ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struct vm_area_struct *vma)
 	up_write(&vma->vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-	int mm_lock_seq;
-
-	if (__is_vma_write_locked(vma, &mm_lock_seq))
-		return true;
-
-	if (!down_write_trylock(&vma->vm_lock->lock))
-		return false;
-
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
-	return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
 	int mm_lock_seq;
@@ -731,8 +716,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-		{ return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm)
 	return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-	bool ret;
-
-	__mmap_lock_trace_start_locking(mm, true);
-	ret = down_write_trylock(&mm->mmap_lock) != 0;
-	__mmap_lock_trace_acquire_returned(mm, true, ret);
-	return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]()
  2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
                   ` (11 preceding siblings ...)
  2023-07-12  4:44 ` [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
@ 2023-07-12  4:46 ` Hugh Dickins
  12 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Add a block of comments on pte_offset_map_lock(), pte_offset_map() and
pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/pgtable-generic.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index fa9d4d084291..4fcd959dcc4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -315,6 +315,50 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
+/*
+ * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
+ * __pte_offset_map_lock() below, is usually called with the pmd pointer for
+ * addr, reached by walking down the mm's pgd, p4d, pud for addr: either while
+ * holding mmap_lock or vma lock for read or for write; or in truncate or rmap
+ * context, while holding file's i_mmap_lock or anon_vma lock for read (or for
+ * write). In a few cases, it may be used with pmd pointing to a pmd_t already
+ * copied to or constructed on the stack.
+ *
+ * When successful, it returns the pte pointer for addr, with its page table
+ * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent
+ * modification by software, with a pointer to that spinlock in ptlp (in some
+ * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's
+ * struct page).  pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards.
+ *
+ * But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no
+ * page table at *pmd: if, for example, the page table has just been removed,
+ * or replaced by the huge pmd of a THP.  (When successful, *pmd is rechecked
+ * after acquiring the ptlock, and retried internally if it changed: so that a
+ * page table can be safely removed or replaced by THP while holding its lock.)
+ *
+ * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above,
+ * just returns the pte pointer for addr, its page table kmapped if necessary;
+ * or NULL if there is no page table at *pmd.  It does not attempt to lock the
+ * page table, so cannot normally be used when the page table is to be updated,
+ * or when entries read must be stable.  But it does take rcu_read_lock(): so
+ * that even when page table is racily removed, it remains a valid though empty
+ * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
+ * afterwards.
+ *
+ * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it.  This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
+ * pointer for the page table that it returns.  In principle, the caller should
+ * recheck *pmd once the lock is taken; in practice, no callsite needs that -
+ * either the mmap_lock for write, or pte_same() check on contents, is enough.
+ *
+ * Note that free_pgtables(), used after unmapping detached vmas, or when
+ * exiting the whole mm, does not take page table lock before freeing a page
+ * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
+ * pte_offset_map() and co once the vma is detached from mm or mm_users is zero.
+ */
 pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 			     unsigned long addr, spinlock_t **ptlp)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH mm 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()
  2023-07-12  4:44 ` [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
@ 2023-07-12  4:48   ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-12  4:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
This is the version which applies to mm-unstable or linux-next.

 include/linux/mm.h        | 17 -----------------
 include/linux/mmap_lock.h | 10 ----------
 2 files changed, 27 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struc
 	up_write(&vma->vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-	int mm_lock_seq;
-
-	if (__is_vma_write_locked(vma, &mm_lock_seq))
-		return true;
-
-	if (!down_write_trylock(&vma->vm_lock->lock))
-		return false;
-
-	vma->vm_lock_seq = mm_lock_seq;
-	up_write(&vma->vm_lock->lock);
-	return true;
-}
-
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
 	int mm_lock_seq;
@@ -758,8 +743,6 @@ static inline bool vma_start_read(struct
 		{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-		{ return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killab
 	return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-	bool ret;
-
-	__mmap_lock_trace_start_locking(mm, true);
-	ret = down_write_trylock(&mm->mmap_lock) != 0;
-	__mmap_lock_trace_acquire_returned(mm, true, ret);
-	return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, true);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page
  2023-07-12  4:38 ` [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
@ 2023-07-13  4:47   ` Alexander Gordeev
  2023-07-19 14:25   ` Claudio Imbrenda
  1 sibling, 0 replies; 34+ messages in thread
From: Alexander Gordeev @ 2023-07-13  4:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Lorenzo Stoakes, Huang Ying,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, Jul 11, 2023 at 09:38:35PM -0700, Hugh Dickins wrote:
> Add s390-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
> 
> Build upon the existing management, adjusted to follow a new rule: that
> a page is never on the free list if pte_free_defer() was used on either
> half (marked by PageActive).  And for simplicity, delay calling RCU until
> both halves are freed.
> 
> Not adding back unallocated fragments to the list in pte_free_defer()
> can result in wasting some amount of memory for pagetables, depending
> on how long the allocated fragment will stay in use. In practice, this
> effect is expected to be insignificant, and not justify a far more
> complex approach, which might allow to add the fragments back later
> in __tlb_remove_table(), where we might not have a stable mm any more.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> ---
>  arch/s390/include/asm/pgalloc.h |  4 ++
>  arch/s390/mm/pgalloc.c          | 80 +++++++++++++++++++++++++++++------
>  2 files changed, 72 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
>  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  void vmem_map_init(void);
>  void *vmem_crst_alloc(unsigned long val);
>  pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..760b4ace475e 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
>   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
>   * while the PP bits are never used, nor such a page is added to or removed
>   * from mm_context_t::pgtable_list.
> + *
> + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> + * guarantee that its pgtable cannot be reused before the RCU grace period
> + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> + * But for simplicity, because page->rcu_head overlays page->lru, and because
> + * the RCU callback might not be called before the mm_context_t has been freed,
> + * pte_free_defer() in this implementation prevents both fragments from being
> + * reused, and delays making the call to RCU until both fragments are freed.
>   */
>  unsigned long *page_table_alloc(struct mm_struct *mm)
>  {
> @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  					table += PTRS_PER_PTE;
>  				atomic_xor_bits(&page->_refcount,
>  							0x01U << (bit + 24));
> -				list_del(&page->lru);
> +				list_del_init(&page->lru);
>  			}
>  		}
>  		spin_unlock_bh(&mm->context.lock);
> @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>  	table = (unsigned long *) page_to_virt(page);
>  	if (mm_alloc_pgste(mm)) {
>  		/* Return 4K page table with PGSTEs */
> +		INIT_LIST_HEAD(&page->lru);
>  		atomic_xor_bits(&page->_refcount, 0x03U << 24);
>  		memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
>  		memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
> @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table,
>  {
>  	char msg[128];
>  
> -	if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
> +	if (!IS_ENABLED(CONFIG_DEBUG_VM))
> +		return;
> +	if (!mask && list_empty(&page->lru))
>  		return;
>  	snprintf(msg, sizeof(msg),
>  		 "Invalid pgtable %p release half 0x%02x mask 0x%02x",
> @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table,
>  	dump_page(page, msg);
>  }
>  
> +static void pte_free_now(struct rcu_head *head)
> +{
> +	struct page *page;
> +
> +	page = container_of(head, struct page, rcu_head);
> +	pgtable_pte_page_dtor(page);
> +	__free_page(page);
> +}
> +
>  void page_table_free(struct mm_struct *mm, unsigned long *table)
>  {
>  	unsigned int mask, bit, half;
> @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
>  		 */
>  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  		mask >>= 24;
> -		if (mask & 0x03U)
> +		if ((mask & 0x03U) && !PageActive(page)) {
> +			/*
> +			 * Other half is allocated, and neither half has had
> +			 * its free deferred: add page to head of list, to make
> +			 * this freed half available for immediate reuse.
> +			 */
>  			list_add(&page->lru, &mm->context.pgtable_list);
> -		else
> -			list_del(&page->lru);
> +		} else {
> +			/* If page is on list, now remove it. */
> +			list_del_init(&page->lru);
> +		}
>  		spin_unlock_bh(&mm->context.lock);
>  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
>  		mask >>= 24;
> @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
>  	}
>  
>  	page_table_release_check(page, table, half, mask);
> -	pgtable_pte_page_dtor(page);
> -	__free_page(page);
> +	if (TestClearPageActive(page))
> +		call_rcu(&page->rcu_head, pte_free_now);
> +	else
> +		pte_free_now(&page->rcu_head);
>  }
>  
>  void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
> @@ -370,10 +400,18 @@ void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table,
>  	 */
>  	mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
>  	mask >>= 24;
> -	if (mask & 0x03U)
> +	if ((mask & 0x03U) && !PageActive(page)) {
> +		/*
> +		 * Other half is allocated, and neither half has had
> +		 * its free deferred: add page to end of list, to make
> +		 * this freed half available for reuse once its pending
> +		 * bit has been cleared by __tlb_remove_table().
> +		 */
>  		list_add_tail(&page->lru, &mm->context.pgtable_list);
> -	else
> -		list_del(&page->lru);
> +	} else {
> +		/* If page is on list, now remove it. */
> +		list_del_init(&page->lru);
> +	}
>  	spin_unlock_bh(&mm->context.lock);
>  	table = (unsigned long *) ((unsigned long) table | (0x01U << bit));
>  	tlb_remove_table(tlb, table);
> @@ -403,10 +441,28 @@ void __tlb_remove_table(void *_table)
>  	}
>  
>  	page_table_release_check(page, table, half, mask);
> -	pgtable_pte_page_dtor(page);
> -	__free_page(page);
> +	if (TestClearPageActive(page))
> +		call_rcu(&page->rcu_head, pte_free_now);
> +	else
> +		pte_free_now(&page->rcu_head);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	SetPageActive(page);
> +	page_table_free(mm, (unsigned long *)pgtable);
> +	/*
> +	 * page_table_free() does not do the pgste gmap_unlink() which
> +	 * page_table_free_rcu() does: warn us if pgste ever reaches here.
> +	 */
> +	WARN_ON_ONCE(mm_alloc_pgste(mm));
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.

Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()
  2023-07-12  4:34 ` [PATCH v3 04/13] powerpc: assert_pte_locked() " Hugh Dickins
@ 2023-07-18 10:41   ` Aneesh Kumar K.V
  2023-07-19  5:04     ` Hugh Dickins
  0 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-18 10:41 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Jann Horn, Vishal Moola,
	Vlastimil Babka, Zi Yan, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

Hugh Dickins <hughd@google.com> writes:

> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
> stricter than the previous implementation, which skipped when pmd_none()
> (with a comment on khugepaged collapse transitions): but wouldn't we want
> to know, if an assert_pte_locked() caller can be racing such transitions?
>

The reason we had that pmd_none check there was to handle khugpaged. In
case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
ppc64 had the assert_pte_locked check inside that ptep_clear.

_pmd = pmdp_collapse_flush(vma, address, pmd);
..
ptep_clear()
-> asset_ptep_locked()
---> pmd_none
-----> BUG


The problem is how assert_pte_locked() verify whether we are holding
ptl. It does that by walking the page table again and in this specific
case by the time we call the function we already had cleared pmd .
>
> This mod might cause new crashes: which either expose my ignorance, or
> indicate issues to be fixed, or limit the usage of assert_pte_locked().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  arch/powerpc/mm/pgtable.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index cb2dcdb18f8e..16b061af86d7 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  	p4d_t *p4d;
>  	pud_t *pud;
>  	pmd_t *pmd;
> +	pte_t *pte;
> +	spinlock_t *ptl;
>  
>  	if (mm == &init_mm)
>  		return;
> @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  	pud = pud_offset(p4d, addr);
>  	BUG_ON(pud_none(*pud));
>  	pmd = pmd_offset(pud, addr);
> -	/*
> -	 * khugepaged to collapse normal pages to hugepage, first set
> -	 * pmd to none to force page fault/gup to take mmap_lock. After
> -	 * pmd is set to none, we do a pte_clear which does this assertion
> -	 * so if we find pmd none, return.
> -	 */
> -	if (pmd_none(*pmd))
> -		return;
> -	BUG_ON(!pmd_present(*pmd));
> -	assert_spin_locked(pte_lockptr(mm, pmd));
> +	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
> +	BUG_ON(!pte);
> +	assert_spin_locked(ptl);
> +	pte_unmap(pte);
>  }
>  #endif /* CONFIG_DEBUG_VM */
>  
> -- 
> 2.35.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()
  2023-07-18 10:41   ` Aneesh Kumar K.V
@ 2023-07-19  5:04     ` Hugh Dickins
  2023-07-19  5:24       ` Aneesh Kumar K V
  0 siblings, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2023-07-19  5:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Gerald Schaefer,
	Vasily Gorbik, Jann Horn, Vishal Moola, Vlastimil Babka, Zi Yan,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
> Hugh Dickins <hughd@google.com> writes:
> 
> > Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> > in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
> > stricter than the previous implementation, which skipped when pmd_none()
> > (with a comment on khugepaged collapse transitions): but wouldn't we want
> > to know, if an assert_pte_locked() caller can be racing such transitions?
> >
> 
> The reason we had that pmd_none check there was to handle khugpaged. In
> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
> ppc64 had the assert_pte_locked check inside that ptep_clear.
> 
> _pmd = pmdp_collapse_flush(vma, address, pmd);
> ..
> ptep_clear()
> -> asset_ptep_locked()
> ---> pmd_none
> -----> BUG
> 
> 
> The problem is how assert_pte_locked() verify whether we are holding
> ptl. It does that by walking the page table again and in this specific
> case by the time we call the function we already had cleared pmd .

Aneesh, please clarify, I've spent hours on this.

From all your use of past tense ("had"), I thought you were Acking my
patch; but now, after looking again at v3.11 source and today's,
I think you are NAKing my patch in its present form.

You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
*pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
Is that your point?

I can easily restore that khugepaged comment (which had appeared to me
out of date at the time, but now looks still relevant) and pmd_none(*pmd)
check: but please clarify.

Thanks,
Hugh

> >
> > This mod might cause new crashes: which either expose my ignorance, or
> > indicate issues to be fixed, or limit the usage of assert_pte_locked().
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  arch/powerpc/mm/pgtable.c | 16 ++++++----------
> >  1 file changed, 6 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> > index cb2dcdb18f8e..16b061af86d7 100644
> > --- a/arch/powerpc/mm/pgtable.c
> > +++ b/arch/powerpc/mm/pgtable.c
> > @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> >  	p4d_t *p4d;
> >  	pud_t *pud;
> >  	pmd_t *pmd;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> >  
> >  	if (mm == &init_mm)
> >  		return;
> > @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
> >  	pud = pud_offset(p4d, addr);
> >  	BUG_ON(pud_none(*pud));
> >  	pmd = pmd_offset(pud, addr);
> > -	/*
> > -	 * khugepaged to collapse normal pages to hugepage, first set
> > -	 * pmd to none to force page fault/gup to take mmap_lock. After
> > -	 * pmd is set to none, we do a pte_clear which does this assertion
> > -	 * so if we find pmd none, return.
> > -	 */
> > -	if (pmd_none(*pmd))
> > -		return;
> > -	BUG_ON(!pmd_present(*pmd));
> > -	assert_spin_locked(pte_lockptr(mm, pmd));
> > +	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
> > +	BUG_ON(!pte);
> > +	assert_spin_locked(ptl);
> > +	pte_unmap(pte);
> >  }
> >  #endif /* CONFIG_DEBUG_VM */
> >  
> > -- 
> > 2.35.3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()
  2023-07-19  5:04     ` Hugh Dickins
@ 2023-07-19  5:24       ` Aneesh Kumar K V
  2023-07-21 13:13         ` Jay Patel
  0 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K V @ 2023-07-19  5:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Lorenzo Stoakes, Huang Ying,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Jann Horn, Vishal Moola,
	Vlastimil Babka, Zi Yan, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On 7/19/23 10:34 AM, Hugh Dickins wrote:
> On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
>> Hugh Dickins <hughd@google.com> writes:
>>
>>> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
>>> in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
>>> stricter than the previous implementation, which skipped when pmd_none()
>>> (with a comment on khugepaged collapse transitions): but wouldn't we want
>>> to know, if an assert_pte_locked() caller can be racing such transitions?
>>>
>>
>> The reason we had that pmd_none check there was to handle khugpaged. In
>> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
>> ppc64 had the assert_pte_locked check inside that ptep_clear.
>>
>> _pmd = pmdp_collapse_flush(vma, address, pmd);
>> ..
>> ptep_clear()
>> -> asset_ptep_locked()
>> ---> pmd_none
>> -----> BUG
>>
>>
>> The problem is how assert_pte_locked() verify whether we are holding
>> ptl. It does that by walking the page table again and in this specific
>> case by the time we call the function we already had cleared pmd .
> 
> Aneesh, please clarify, I've spent hours on this.
> 
> From all your use of past tense ("had"), I thought you were Acking my
> patch; but now, after looking again at v3.11 source and today's,
> I think you are NAKing my patch in its present form.
> 

Sorry for the confusion my reply created. 

> You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
> uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
> *pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
> Is that your point?
> 

Yes. I haven't tested this yet to verify that it is indeed hitting that BUG.
But a code inspection tells me we will hit that BUG on powerpc because of
the above details.

> I can easily restore that khugepaged comment (which had appeared to me
> out of date at the time, but now looks still relevant) and pmd_none(*pmd)
> check: but please clarify.
> 

That is correct. if we add that pmd_none check back we should be good here.


-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page
  2023-07-12  4:38 ` [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
  2023-07-13  4:47   ` Alexander Gordeev
@ 2023-07-19 14:25   ` Claudio Imbrenda
  2023-07-23 22:29     ` [PATCH v3 07/13 fix] s390: add pte_free_defer() for pgtables sharing page: fix Hugh Dickins
  1 sibling, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2023-07-19 14:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Lorenzo Stoakes, Huang Ying,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Tue, 11 Jul 2023 21:38:35 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

[...]

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> +	struct page *page;
> +
> +	page = virt_to_page(pgtable);
> +	SetPageActive(page);
> +	page_table_free(mm, (unsigned long *)pgtable);
> +	/*
> +	 * page_table_free() does not do the pgste gmap_unlink() which
> +	 * page_table_free_rcu() does: warn us if pgste ever reaches here.
> +	 */
> +	WARN_ON_ONCE(mm_alloc_pgste(mm));

it seems I have overlooked something when we previously discussed
this...

mm_alloc_pgste() is true for all processes that have PGSTEs, not only
for processes that can run guests.

There are two ways to enable PGSTEs: an ELF header bit, and a sysctl
knob.

The ELF bit is only used by qemu, it enables PGSTE allocation only for
that single process. This is a strong indication that the process wants
to run guests.

The sysctl knob enables PGSTE allocation for every process in the system
from that moment on. In that case, the WARN_ON_ONCE would be triggered
when not necessary.

There is however another way to check if a process is actually
__using__ the PGSTEs, a.k.a. if the process is actually capable of
running guests.

Confusingly, the name of that function is mm_has_pgste(). This confused
me as well, which is why I didn't notice it when we discussed this
previously :)


in short: can you please use mm_has_pgste() instead of mm_alloc_pgste()
in the WARN_ON_ONCE ?

> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  /*
>   * Base infrastructure required to generate basic asces, region, segment,
>   * and page tables that do not make use of enhanced features like EDAT1.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()
  2023-07-19  5:24       ` Aneesh Kumar K V
@ 2023-07-21 13:13         ` Jay Patel
  2023-07-23 22:26           ` [PATCH v3 04/13 fix] powerpc: assert_pte_locked() use pte_offset_map_nolock(): fix Hugh Dickins
  0 siblings, 1 reply; 34+ messages in thread
From: Jay Patel @ 2023-07-21 13:13 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Hugh Dickins, Miaohe Lin, David Hildenbrand, Peter Zijlstra,
	Yang Shi, Peter Xu, linux-kernel, Song Liu, sparclinux,
	Alexander Gordeev, Claudio Imbrenda, Will Deacon, linux-s390,
	Yu Zhao, Ira Weiny, Alistair Popple, Russell King,
	Matthew Wilcox, Steven Price, Christoph Hellwig, Jason Gunthorpe,
	linux-arm-kernel, Zi Yan, Huang Ying, Axel Rasmussen,
	Gerald Schaefer, Christian Borntraeger, Thomas Hellstrom,
	Ralph Camp bell, Pasha Tatashin, Vasily Gorbik,
	Anshuman Khandual, Heiko Carstens, Qi Zheng, Suren Baghdasaryan,
	Vlastimil Babka, SeongJae Park, Lorenzo Stoakes, Jann Horn,
	linux-mm, linuxppc-dev, Naoya Horiguchi, Zack Rusin,
	Vishal Moola, Minchan Kim, Kirill A. Shutemov, Andrew Morton,
	Mel Gorman, David S. Miller, Mike Rapoport, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 2359 bytes --]

On Jul 19 2023, Aneesh Kumar K V wrote:
> On 7/19/23 10:34 AM, Hugh Dickins wrote:
> > On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
> >> Hugh Dickins <hughd@google.com> writes:
> >>
> >>> Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> >>> in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
> >>> stricter than the previous implementation, which skipped when pmd_none()
> >>> (with a comment on khugepaged collapse transitions): but wouldn't we want
> >>> to know, if an assert_pte_locked() caller can be racing such transitions?
> >>>
> >>
> >> The reason we had that pmd_none check there was to handle khugpaged. In
> >> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
> >> ppc64 had the assert_pte_locked check inside that ptep_clear.
> >>
> >> _pmd = pmdp_collapse_flush(vma, address, pmd);
> >> ..
> >> ptep_clear()
> >> -> asset_ptep_locked()
> >> ---> pmd_none
> >> -----> BUG
> >>
> >>
> >> The problem is how assert_pte_locked() verify whether we are holding
> >> ptl. It does that by walking the page table again and in this specific
> >> case by the time we call the function we already had cleared pmd .
> > 
> > Aneesh, please clarify, I've spent hours on this.
> > 
> > From all your use of past tense ("had"), I thought you were Acking my
> > patch; but now, after looking again at v3.11 source and today's,
> > I think you are NAKing my patch in its present form.
> > 
> 
> Sorry for the confusion my reply created. 
> 
> > You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
> > uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
> > *pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
> > Is that your point?
> > 
> 
> Yes. I haven't tested this yet to verify that it is indeed hitting that BUG.
> But a code inspection tells me we will hit that BUG on powerpc because of
> the above details.
>
Hi Aneesh,

After testing it, I can confirm that it encountered a BUG on powerpc.
Log report as attached

Thanks,
Jay Patel 
> > I can easily restore that khugepaged comment (which had appeared to me
> > out of date at the time, but now looks still relevant) and pmd_none(*pmd)
> > check: but please clarify.
> > 
> 
> That is correct. if we add that pmd_none check back we should be good here.
> 
> 
> -aneesh

[-- Attachment #2: report.txt --]
[-- Type: text/plain, Size: 3720 bytes --]

[   53.513058][  T105] ------------[ cut here ]------------
[   53.513080][  T105] kernel BUG at arch/powerpc/mm/pgtable.c:327!
[   53.513090][  T105] Oops: Exception in kernel mode, sig: 5 [#1]
[   53.513099][  T105] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[   53.513109][  T105] Modules linked in: bonding pseries_rng rng_core vmx_crypto gf128mul ibmveth crc32c_vpmsum fuse autofs4                                                                              
[   53.513135][  T105] CPU: 3 PID: 105 Comm: khugepaged Not tainted 6.5.0-rc1-gebfaf626e99f-dirty #1
[   53.513146][  T105] Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.80 (VL950_131) hv:phyp pSeries                                                                             
[   53.513156][  T105] NIP:  c000000000079478 LR: c00000000007946c CTR: 0000000000000000
[   53.513165][  T105] REGS: c000000008e9b930 TRAP: 0700   Not tainted  (6.5.0-rc1-gebfaf626e99f-dirty)                                                                                                    
[   53.513175][  T105] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002882  XER: 20040000                                                                                               
[   53.513202][  T105] CFAR: c000000000412a30 IRQMASK: 0
[   53.513202][  T105] GPR00: c00000000007946c c000000008e9bbd0 c0000000012d3500 0000000000000001
[   53.513202][  T105] GPR04: 0000000011000000 c000000008e9bbb0 c000000008e9bbf0 ffffffffffffffff
[   53.513202][  T105] GPR08: 00000000000003ff 0000000000000000 0000000000000000 000000000000000a
[   53.513202][  T105] GPR12: 00000000497b0000 c00000001ec9d480 c00c00000016fe00 c000000051455000
[   53.513202][  T105] GPR16: 0000000000000000 ffffffffffffffff 0000000000000001 0000000000000001
[   53.513202][  T105] GPR20: c000000002912430 c000000051455000 0000000000000000 c00000000946e650
[   53.513202][  T105] GPR24: c0000000029800e8 0000000011000000 c00c000000145168 c000000002980180
[   53.513202][  T105] GPR28: 0000000011000000 8603f85b000080c0 c000000008e9bc70 c00000001bd0d680
[   53.513304][  T105] NIP [c000000000079478] assert_pte_locked.part.18+0x168/0x1b0
[   53.513318][  T105] LR [c00000000007946c] assert_pte_locked.part.18+0x15c/0x1b0
[   53.513329][  T105] Call Trace:
[   53.513335][  T105] [c000000008e9bbd0] [c00000000007946c] assert_pte_locked.part.18+0x15c/0x1b0 (unreliable)                                                                                            
[   53.513350][  T105] [c000000008e9bc00] [c00000000048e10c] collapse_huge_page+0x11dc/0x1700
[   53.513362][  T105] [c000000008e9bd40] [c00000000048ed18] hpage_collapse_scan_pmd+0x6e8/0x850
[   53.513374][  T105] [c000000008e9be30] [c000000000492544] khugepaged+0x7e4/0xb70
[   53.513386][  T105] [c000000008e9bf90] [c00000000015f038] kthread+0x138/0x140
[   53.513399][  T105] [c000000008e9bfe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
[   53.513411][  T105] Code: 7c852378 7c844c36 794a1564 7c894038 794af082 79241f24 78eaf00e 7c8a2214 48399541 60000000 7c630074 7863d182 <0b030000> e9210020 81290000 7d290074                             
[   53.513448][  T105] ---[ end trace 0000000000000000 ]---
[   53.516544][  T105]
[   53.516551][  T105] note: khugepaged[105] exited with irqs disabled
[  182.648447][ T1952] mconf[1952]: segfault (11) at 110efa38 nip 1001e97c lr 1001e8a8 code 1
[  182.648482][ T1952] mconf[1952]: code: 60420000 4bfffd59 4bffffd0 60000000 60420000 e93f0070 2fa90000 409e0014
[  182.648494][ T1952] mconf[1952]: code: 480000cc e9290000 2fa90000 419e00c0 <81490008> 2f8a0005 409effec e9490028
[  962.694079][T39811] sda2: Can't mount, would change RO state

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 04/13 fix] powerpc: assert_pte_locked() use pte_offset_map_nolock(): fix
  2023-07-21 13:13         ` Jay Patel
@ 2023-07-23 22:26           ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-23 22:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jay Patel, Aneesh Kumar K V, Hugh Dickins, Miaohe Lin,
	David Hildenbrand, Peter Zijlstra, Yang Shi, Peter Xu,
	linux-kernel, Song Liu, sparclinux, Alexander Gordeev,
	Claudio Imbrenda, Will Deacon, linux-s390, Yu Zhao, Ira Weiny,
	Alistair Popple, Russell King, Matthew Wilcox, Steven Price,
	Christoph Hellwig, Jason Gunthorpe, linux-arm-kernel, Zi Yan,
	Huang Ying, Axel Rasmussen, Gerald Schaefer,
	Christian Borntraeger, Thomas Hellstrom, Ralph Camp bell,
	Pasha Tatashin, Vasily Gorbik, Anshuman Khandual, Heiko Carstens,
	Qi Zheng, Suren Baghdasaryan, Vlastimil Babka, SeongJae Park,
	Lorenzo Stoakes, Jann Horn, linux-mm, linuxppc-dev,
	Naoya Horiguchi, Zack Rusin, Vishal Moola, Minchan Kim,
	Kirill A. Shutemov, Mel Gorman, David S. Miller, Mike Rapoport,
	Mike Kravetz

Aneesh points out that assert_pte_locked() still needs the pmd_none()
check, to stop crashing in khugepaged: restore that comment and check.

Andrew, when merging with original commit, please edit its 1st para to:

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 16b061af86d7..a3dcdb2d5b4b 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -323,6 +323,14 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pud = pud_offset(p4d, addr);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, addr);
+	/*
+	 * khugepaged to collapse normal pages to hugepage, first set
+	 * pmd to none to force page fault/gup to take mmap_lock. After
+	 * pmd is set to none, we do a pte_clear which does this assertion
+	 * so if we find pmd none, return.
+	 */
+	if (pmd_none(*pmd))
+		return;
 	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
 	BUG_ON(!pte);
 	assert_spin_locked(ptl);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 07/13 fix] s390: add pte_free_defer() for pgtables sharing page: fix
  2023-07-19 14:25   ` Claudio Imbrenda
@ 2023-07-23 22:29     ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-23 22:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Claudio Imbrenda, Hugh Dickins, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Alexander Gordeev, Gerald Schaefer,
	Vasily Gorbik, Jann Horn, Vishal Moola, Vlastimil Babka, Zi Yan,
	linux-arm-kernel, sparclinux, linuxppc-dev, linux-s390,
	linux-kernel, linux-mm

Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/mm/pgalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 760b4ace475e..d7374add7820 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -459,7 +459,7 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
 	 * page_table_free() does not do the pgste gmap_unlink() which
 	 * page_table_free_rcu() does: warn us if pgste ever reaches here.
 	 */
-	WARN_ON_ONCE(mm_alloc_pgste(mm));
+	WARN_ON_ONCE(mm_has_pgste(mm));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix
  2023-07-12  4:42 ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
@ 2023-07-23 22:32   ` Hugh Dickins
  2023-08-03  9:17   ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Qi Zheng
  2023-08-14 20:36   ` [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Jann Horn
  2 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-23 22:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

madvise_collapse() setting "mmap_locked = true" after calling
collapse_pte_mapped_thp() looked good but was wrong.  If the loop then
moves on to the next extent, mmap_locked assures it that "vma" has been
revalidated under mmap_lock, which was not the case: and led to UAFs,
crashes in __fput() or task_work_run(), even collapse_file()'s
VM_BUG_ON(start & (HPAGE_PMD_NR - 1)) - all detected by syzbot.

(collapse_pte_mapped_thp() does validate the vma that it works on:
but it's not passed in as an argument, collapse_pte_mapped_thp() finds
the vma for mm and addr by itself - which may by this time have changed
from the vma saved in madvise_collapse().)

Reported-by: syzbot+fe7b1487405295d29268@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000f9de430600ae05db@google.com/
Reported-by: syzbot+173cc8cfdfbbef6dd755@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/000000000000e4b0f0060123ca40@google.com/
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6bad69c0e4bd..1c773db26e88 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2747,7 +2747,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			BUG_ON(*prev);
 			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_locked = true;
+			mmap_read_unlock(mm);
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 11/13 fix] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps(): fix
  2023-07-12  4:43 ` [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
@ 2023-07-23 22:35   ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-07-23 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Jann Horn,
	Vishal Moola, Vlastimil Babka, Zi Yan, linux-arm-kernel,
	sparclinux, linuxppc-dev, linux-s390, linux-kernel, linux-mm

Though not yet detected by syzbot, this commit was making the same
mistake with mmap_locked as the previous commit: fix that.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1c773db26e88..41913730db4c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2380,19 +2380,17 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				mmap_locked = false;
 				*result = hpage_collapse_scan_file(mm,
 					khugepaged_scan.address, file, pgoff, cc);
+				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
-					mmap_locked = true;
-					if (hpage_collapse_test_exit(mm)) {
-						fput(file);
+					if (hpage_collapse_test_exit(mm))
 						goto breakouterloop;
-					}
 					*result = collapse_pte_mapped_thp(mm,
 						khugepaged_scan.address, false);
 					if (*result == SCAN_PMD_MAPPED)
 						*result = SCAN_SUCCEED;
+					mmap_read_unlock(mm);
 				}
-				fput(file);
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
 					khugepaged_scan.address, &mmap_locked, cc);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-07-12  4:42 ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
  2023-07-23 22:32   ` [PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix Hugh Dickins
@ 2023-08-03  9:17   ` Qi Zheng
  2023-08-06  3:55     ` Hugh Dickins
  2023-08-06  3:59     ` [PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2 Hugh Dickins
  2023-08-14 20:36   ` [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Jann Horn
  2 siblings, 2 replies; 34+ messages in thread
From: Qi Zheng @ 2023-08-03  9:17 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, Pasha Tatashin
  Cc: Mike Kravetz, Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Yang Shi, Mel Gorman,
	Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Jann Horn, Vishal Moola,
	Vlastimil Babka, Zi Yan, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

Hi,

On 2023/7/12 12:42, Hugh Dickins wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> Follow the pattern in retract_page_tables(); and using pte_free_defer()
> removes most of the need for tlb_remove_table_sync_one() here; but call
> pmdp_get_lockless_sync() to use it in the PAE case.
> 
> First check the VMA, in case page tables are being torn down: from JannH.
> Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
> acquired and the page looks suitable: from then on its state is stable.
> 
> However, collapse_pte_mapped_thp() was doing something others don't:
> freeing a page table still containing "valid" entries.  i_mmap lock did
> stop a racing truncate from double-freeing those pages, but we prefer
> collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
> flush can wait until the pmdp_collapse_flush() which follows, but the
> mmu_notifier_invalidate_range_start() has to be done earlier.
> 
> Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
> for khugepaged to keep on repeatedly invalidating a range which is then
> found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
> must then be more careful (after dropping ptl to do mmu_notifier), with
> abort prepared to correct the accounting like "step 3".  But with those
> entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
> safe by the huge page lock, which stops new PTEs from being faulted in.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   mm/khugepaged.c | 172 ++++++++++++++++++++++----------------------------
>   1 file changed, 77 insertions(+), 95 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3bb05147961b..46986eb4eebb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>   	return ret;
>   }
>   
> -/* hpage must be locked, and mmap_lock must be held in write */
> +/* hpage must be locked, and mmap_lock must be held */
>   static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   			pmd_t *pmdp, struct page *hpage)
>   {
> @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   	};
>   
>   	VM_BUG_ON(!PageTransHuge(hpage));
> -	mmap_assert_write_locked(vma->vm_mm);
> +	mmap_assert_locked(vma->vm_mm);
>   
>   	if (do_set_pmd(&vmf, hpage))
>   		return SCAN_FAIL;
> @@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   	return SCAN_SUCCEED;
>   }
>   
> -/*
> - * A note about locking:
> - * Trying to take the page table spinlocks would be useless here because those
> - * are only used to synchronize:
> - *
> - *  - modifying terminal entries (ones that point to a data page, not to another
> - *    page table)
> - *  - installing *new* non-terminal entries
> - *
> - * Instead, we need roughly the same kind of protection as free_pgtables() or
> - * mm_take_all_locks() (but only for a single VMA):
> - * The mmap lock together with this VMA's rmap locks covers all paths towards
> - * the page table entries we're messing with here, except for hardware page
> - * table walks and lockless_pages_from_mm().
> - */
> -static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> -				  unsigned long addr, pmd_t *pmdp)
> -{
> -	pmd_t pmd;
> -	struct mmu_notifier_range range;
> -
> -	mmap_assert_write_locked(mm);
> -	if (vma->vm_file)
> -		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
> -	/*
> -	 * All anon_vmas attached to the VMA have the same root and are
> -	 * therefore locked by the same lock.
> -	 */
> -	if (vma->anon_vma)
> -		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
> -
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> -				addr + HPAGE_PMD_SIZE);
> -	mmu_notifier_invalidate_range_start(&range);
> -	pmd = pmdp_collapse_flush(vma, addr, pmdp);
> -	tlb_remove_table_sync_one();
> -	mmu_notifier_invalidate_range_end(&range);
> -	mm_dec_nr_ptes(mm);
> -	page_table_check_pte_clear_range(mm, addr, pmd);
> -	pte_free(mm, pmd_pgtable(pmd));
> -}
> -
>   /**
>    * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
>    * address haddr.
> @@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   			    bool install_pmd)
>   {
> +	struct mmu_notifier_range range;
> +	bool notified = false;
>   	unsigned long haddr = addr & HPAGE_PMD_MASK;
>   	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>   	struct page *hpage;
>   	pte_t *start_pte, *pte;
> -	pmd_t *pmd;
> -	spinlock_t *ptl;
> -	int count = 0, result = SCAN_FAIL;
> +	pmd_t *pmd, pgt_pmd;
> +	spinlock_t *pml, *ptl;
> +	int nr_ptes = 0, result = SCAN_FAIL;
>   	int i;
>   
> -	mmap_assert_write_locked(mm);
> +	mmap_assert_locked(mm);
> +
> +	/* First check VMA found, in case page tables are being torn down */
> +	if (!vma || !vma->vm_file ||
> +	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> +		return SCAN_VMA_CHECK;
>   
>   	/* Fast check before locking page if already PMD-mapped */
>   	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>   	if (result == SCAN_PMD_MAPPED)
>   		return result;
>   
> -	if (!vma || !vma->vm_file ||
> -	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -		return SCAN_VMA_CHECK;
> -
>   	/*
>   	 * If we are here, we've succeeded in replacing all the native pages
>   	 * in the page cache with a single hugepage. If a mm were to fault-in
> @@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		goto drop_hpage;
>   	}
>   
> +	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>   	switch (result) {
>   	case SCAN_SUCCEED:
>   		break;
> @@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		goto drop_hpage;
>   	}
>   
> -	/* Lock the vma before taking i_mmap and page table locks */
> -	vma_start_write(vma);
> -
> -	/*
> -	 * We need to lock the mapping so that from here on, only GUP-fast and
> -	 * hardware page walks can access the parts of the page tables that
> -	 * we're operating on.
> -	 * See collapse_and_free_pmd().
> -	 */
> -	i_mmap_lock_write(vma->vm_file->f_mapping);
> -
> -	/*
> -	 * This spinlock should be unnecessary: Nobody else should be accessing
> -	 * the page tables under spinlock protection here, only
> -	 * lockless_pages_from_mm() and the hardware page walker can access page
> -	 * tables while all the high-level locks are held in write mode.
> -	 */
>   	result = SCAN_FAIL;
>   	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> -	if (!start_pte)
> -		goto drop_immap;
> +	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> +		goto drop_hpage;
>   
>   	/* step 1: check all mapped PTEs are to the right huge page */
>   	for (i = 0, addr = haddr, pte = start_pte;
> @@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		 */
>   		if (hpage + i != page)
>   			goto abort;
> -		count++;
>   	}
>   
> -	/* step 2: adjust rmap */
> +	pte_unmap_unlock(start_pte, ptl);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(&range);
> +	notified = true;
> +	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> +	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> +		goto abort;
> +
> +	/* step 2: clear page table and adjust rmap */
>   	for (i = 0, addr = haddr, pte = start_pte;
>   	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>   		struct page *page;
> @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   
>   		if (pte_none(ptent))
>   			continue;
> -		page = vm_normal_page(vma, addr, ptent);
> -		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> +		/*
> +		 * We dropped ptl after the first scan, to do the mmu_notifier:
> +		 * page lock stops more PTEs of the hpage being faulted in, but
> +		 * does not stop write faults COWing anon copies from existing
> +		 * PTEs; and does not stop those being swapped out or migrated.
> +		 */
> +		if (!pte_present(ptent)) {
> +			result = SCAN_PTE_NON_PRESENT;
>   			goto abort;
> +		}
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (hpage + i != page)
> +			goto abort;
> +
> +		/*
> +		 * Must clear entry, or a racing truncate may re-remove it.
> +		 * TLB flush can be left until pmdp_collapse_flush() does it.
> +		 * PTE dirty? Shmem page is already dirty; file is read-only.
> +		 */
> +		pte_clear(mm, addr, pte);

This is not non-present PTE entry, so we should call ptep_clear() to let
page_table_check track the PTE clearing operation, right? Otherwise it
may lead to false positives?

Thanks,
Qi

>   		page_remove_rmap(page, vma, false);
> +		nr_ptes++;
>   	}
>   
>   	pte_unmap_unlock(start_pte, ptl);
>   
>   	/* step 3: set proper refcount and mm_counters. */
> -	if (count) {
> -		page_ref_sub(hpage, count);
> -		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> +	if (nr_ptes) {
> +		page_ref_sub(hpage, nr_ptes);
> +		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
>   	}
>   
> -	/* step 4: remove pte entries */
> -	/* we make no change to anon, but protect concurrent anon page lookup */
> -	if (vma->anon_vma)
> -		anon_vma_lock_write(vma->anon_vma);
> +	/* step 4: remove page table */
>   
> -	collapse_and_free_pmd(mm, vma, haddr, pmd);
> +	/* Huge page lock is still held, so page table must remain empty */
> +	pml = pmd_lock(mm, pmd);
> +	if (ptl != pml)
> +		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
> +	pmdp_get_lockless_sync();
> +	if (ptl != pml)
> +		spin_unlock(ptl);
> +	spin_unlock(pml);
>   
> -	if (vma->anon_vma)
> -		anon_vma_unlock_write(vma->anon_vma);
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	mmu_notifier_invalidate_range_end(&range);
> +
> +	mm_dec_nr_ptes(mm);
> +	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
> +	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>   
>   maybe_install_pmd:
>   	/* step 5: install pmd entry */
>   	result = install_pmd
>   			? set_huge_pmd(vma, haddr, pmd, hpage)
>   			: SCAN_SUCCEED;
> -
> +	goto drop_hpage;
> +abort:
> +	if (nr_ptes) {
> +		flush_tlb_mm(mm);
> +		page_ref_sub(hpage, nr_ptes);
> +		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> +	}
> +	if (start_pte)
> +		pte_unmap_unlock(start_pte, ptl);
> +	if (notified)
> +		mmu_notifier_invalidate_range_end(&range);
>   drop_hpage:
>   	unlock_page(hpage);
>   	put_page(hpage);
>   	return result;
> -
> -abort:
> -	pte_unmap_unlock(start_pte, ptl);
> -drop_immap:
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> -	goto drop_hpage;
>   }
>   
>   static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
> @@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   		case SCAN_PTE_MAPPED_HUGEPAGE:
>   			BUG_ON(mmap_locked);
>   			BUG_ON(*prev);
> -			mmap_write_lock(mm);
> +			mmap_read_lock(mm);
>   			result = collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_write_unlock(mm);
> +			mmap_locked = true;
>   			goto handle_result;
>   		/* Whitelisted set of results where continuing OK */
>   		case SCAN_PMD_NULL:

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-08-03  9:17   ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Qi Zheng
@ 2023-08-06  3:55     ` Hugh Dickins
  2023-08-07  2:21       ` Qi Zheng
  2023-08-06  3:59     ` [PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2 Hugh Dickins
  1 sibling, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2023-08-06  3:55 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Hugh Dickins, Andrew Morton, Pasha Tatashin, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Yang Shi, Mel Gorman,
	Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Jann Horn, Vishal Moola,
	Vlastimil Babka, Zi Yan, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

On Thu, 3 Aug 2023, Qi Zheng wrote:
> On 2023/7/12 12:42, Hugh Dickins wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
...
> > @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
> > unsigned long addr,
> >   
> >     if (pte_none(ptent))
> >   			continue;
> > -		page = vm_normal_page(vma, addr, ptent);
> > -		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > +		/*
> > +		 * We dropped ptl after the first scan, to do the
> > mmu_notifier:
> > +		 * page lock stops more PTEs of the hpage being faulted in,
> > but
> > +		 * does not stop write faults COWing anon copies from existing
> > +		 * PTEs; and does not stop those being swapped out or
> > migrated.
> > +		 */
> > +		if (!pte_present(ptent)) {
> > +			result = SCAN_PTE_NON_PRESENT;
> >   			goto abort;
> > +		}
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (hpage + i != page)
> > +			goto abort;
> > +
> > +		/*
> > +		 * Must clear entry, or a racing truncate may re-remove it.
> > +		 * TLB flush can be left until pmdp_collapse_flush() does it.
> > +		 * PTE dirty? Shmem page is already dirty; file is read-only.
> > +		 */
> > +		pte_clear(mm, addr, pte);
> 
> This is not non-present PTE entry, so we should call ptep_clear() to let
> page_table_check track the PTE clearing operation, right? Otherwise it
> may lead to false positives?

You are right: thanks a lot for catching that: fix patch follows.

Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2
  2023-08-03  9:17   ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Qi Zheng
  2023-08-06  3:55     ` Hugh Dickins
@ 2023-08-06  3:59     ` Hugh Dickins
  1 sibling, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-08-06  3:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Qi Zheng, Hugh Dickins, Pasha Tatashin, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Suren Baghdasaryan, Yang Shi, Mel Gorman,
	Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Jann Horn, Vishal Moola,
	Vlastimil Babka, Zi Yan, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm

Use ptep_clear() instead of pte_clear(): when CONFIG_PAGE_TABLE_CHECK=y,
ptep_clear() adds some accounting, missing which would cause a BUG later.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Closes: https://lore.kernel.org/linux-mm/0df84f9f-e9b0-80b1-4c9e-95abc1a73a96@bytedance.com/
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bb76a5d454de..78fc1a24a1cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1603,7 +1603,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 * TLB flush can be left until pmdp_collapse_flush() does it.
 		 * PTE dirty? Shmem page is already dirty; file is read-only.
 		 */
-		pte_clear(mm, addr, pte);
+		ptep_clear(mm, addr, pte);
 		page_remove_rmap(page, vma, false);
 		nr_ptes++;
 	}
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-08-06  3:55     ` Hugh Dickins
@ 2023-08-07  2:21       ` Qi Zheng
  0 siblings, 0 replies; 34+ messages in thread
From: Qi Zheng @ 2023-08-07  2:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Pasha Tatashin, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Miaohe Lin, Minchan Kim, Christoph Hellwig, Song Liu,
	Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Jann Horn, Vishal Moola,
	Vlastimil Babka, Zi Yan, linux-arm-kernel, sparclinux,
	linuxppc-dev, linux-s390, linux-kernel, linux-mm



On 2023/8/6 11:55, Hugh Dickins wrote:
> On Thu, 3 Aug 2023, Qi Zheng wrote:
>> On 2023/7/12 12:42, Hugh Dickins wrote:
>>> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
>>> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
>>> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
>>> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> ...
>>> @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
>>> unsigned long addr,
>>>    
>>>      if (pte_none(ptent))
>>>    			continue;
>>> -		page = vm_normal_page(vma, addr, ptent);
>>> -		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
>>> +		/*
>>> +		 * We dropped ptl after the first scan, to do the
>>> mmu_notifier:
>>> +		 * page lock stops more PTEs of the hpage being faulted in,
>>> but
>>> +		 * does not stop write faults COWing anon copies from existing
>>> +		 * PTEs; and does not stop those being swapped out or
>>> migrated.
>>> +		 */
>>> +		if (!pte_present(ptent)) {
>>> +			result = SCAN_PTE_NON_PRESENT;
>>>    			goto abort;
>>> +		}
>>> +		page = vm_normal_page(vma, addr, ptent);
>>> +		if (hpage + i != page)
>>> +			goto abort;
>>> +
>>> +		/*
>>> +		 * Must clear entry, or a racing truncate may re-remove it.
>>> +		 * TLB flush can be left until pmdp_collapse_flush() does it.
>>> +		 * PTE dirty? Shmem page is already dirty; file is read-only.
>>> +		 */
>>> +		pte_clear(mm, addr, pte);
>>
>> This is not non-present PTE entry, so we should call ptep_clear() to let
>> page_table_check track the PTE clearing operation, right? Otherwise it
>> may lead to false positives?
> 
> You are right: thanks a lot for catching that: fix patch follows.

With fix patch:

Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks.

> 
> Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-07-12  4:42 ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
  2023-07-23 22:32   ` [PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix Hugh Dickins
  2023-08-03  9:17   ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Qi Zheng
@ 2023-08-14 20:36   ` Jann Horn
  2023-08-15  6:34     ` Hugh Dickins
  2023-08-21 19:48     ` Hugh Dickins
  2 siblings, 2 replies; 34+ messages in thread
From: Jann Horn @ 2023-08-14 20:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Suren Baghdasaryan, Qi Zheng,
	Yang Shi, Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon,
	Yu Zhao, Alistair Popple, Ralph Campbell, Ira Weiny,
	Steven Price, SeongJae Park, Lorenzo Stoakes, Huang Ying,
	Naoya Horiguchi, Christophe Leroy, Zack Rusin, Jason Gunthorpe,
	Axel Rasmussen, Anshuman Khandual, Pasha Tatashin, Miaohe Lin,
	Minchan Kim, Christoph Hellwig, Song Liu, Thomas Hellstrom,
	Russell King, David S. Miller, Michael Ellerman,
	Aneesh Kumar K.V, Heiko Carstens, Christian Borntraeger,
	Claudio Imbrenda, Alexander Gordeev, Gerald Schaefer,
	Vasily Gorbik, Vishal Moola, Vlastimil Babka, Zi Yan, Linux ARM,
	sparclinux, linuxppc-dev, linux-s390, kernel list, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

We can still have a racing userfaultfd operation at the "/* step 4:
remove page table */" point that installs a new PTE before the page
table is removed.

To reproduce, patch a delay into the kernel like this:


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9a6e0d507759..27cc8dfbf3a7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -20,6 +20,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/ksm.h>
+#include <linux/delay.h>

 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
*mm, unsigned long addr,
        }

        /* step 4: remove page table */
+       if (strcmp(current->comm, "DELAYME") == 0) {
+               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
+               mdelay(5000);
+               pr_warn("%s: END DELAY INJECTION\n", __func__);
+       }

        /* Huge page lock is still held, so page table must remain empty */
        pml = pmd_lock(mm, pmd);


And then run the attached reproducer against mm/mm-everything. You
should get this in dmesg:

[  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
type:MM_ANONPAGES val:1

[-- Attachment #2: khugepaged-vs-uffd.c --]
[-- Type: text/x-csrc, Size: 2942 bytes --]

// compile with "gcc -o khugepaged-vs-uffd khugepaged-vs-uffd.c -pthread"
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <sys/prctl.h>
#include <sys/mount.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <linux/userfaultfd.h>

#ifndef MADV_COLLAPSE
#define MADV_COLLAPSE 25
#endif

#ifndef UFFD_USER_MODE_ONLY
#define UFFD_USER_MODE_ONLY 1
#endif

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

static void write_file(char *name, char *buf) {
  int fd = SYSCHK(open(name, O_WRONLY));
  if (write(fd, buf, strlen(buf)) != strlen(buf))
    err(1, "write %s", name);
  close(fd);
}

static void write_map(char *name, int outer_id) {
  char buf[100];
  sprintf(buf, "0 %d 1", outer_id);
  write_file(name, buf);
}

static void *thread_fn(void *dummy) {
  system("head -n50 /proc/$PPID/smaps;echo;echo");
  SYSCHK(prctl(PR_SET_NAME, "DELAYME"));
  SYSCHK(madvise((void*)0x200000UL, 0x200000, MADV_COLLAPSE));
  SYSCHK(prctl(PR_SET_NAME, "thread"));
  system("head -n50 /proc/$PPID/smaps");
  return NULL;
}

int main(void) {
  int outer_uid = getuid();
  int outer_gid = getgid();
  SYSCHK(unshare(CLONE_NEWNS|CLONE_NEWUSER));
  SYSCHK(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL));
  write_file("/proc/self/setgroups", "deny");
  write_map("/proc/self/uid_map", outer_uid);
  write_map("/proc/self/gid_map", outer_gid);

  SYSCHK(mount("none", "/tmp", "tmpfs", MS_NOSUID|MS_NODEV, "huge=always"));
  int fd = SYSCHK(open("/tmp/a", O_RDWR|O_CREAT, 0600));
  SYSCHK(ftruncate(fd, 0x200000));
  void *ptr = SYSCHK(mmap((void*)0x200000UL, 0x100000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0));
  *(volatile char *)ptr;
  SYSCHK(mmap((void*)0x300000UL, 0x100000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0x100000));
  for (int i=0; i<512; i++)
    *(volatile char *)(0x200000UL + 0x1000 * i);

  int uffd = SYSCHK(syscall(__NR_userfaultfd, UFFD_USER_MODE_ONLY));

  struct uffdio_api api = { .api = UFFD_API, .features = 0 };
  SYSCHK(ioctl(uffd, UFFDIO_API, &api));

  struct uffdio_register reg = {
    .range = { .start = 0x200000, .len = 0x200000 },
    .mode = UFFDIO_REGISTER_MODE_MISSING
  };
  SYSCHK(ioctl(uffd, UFFDIO_REGISTER, &reg));

  pthread_t thread;
  if (pthread_create(&thread, NULL, thread_fn, NULL))
    errx(1, "pthread_create");

  sleep(1);

  unsigned char dummy_page[0x1000] = {1};
  struct uffdio_copy copy = {
    .dst = 0x201000,
    .src = (unsigned long)dummy_page,
    .len = 0x1000,
    .mode = 0
  };
  SYSCHK(ioctl(uffd, UFFDIO_COPY, &copy));

  if (pthread_join(thread, NULL))
    errx(1, "pthread_join");

  //system("cat /proc/$PPID/smaps");
}

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-08-14 20:36   ` [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Jann Horn
@ 2023-08-15  6:34     ` Hugh Dickins
  2023-08-15  7:11       ` David Hildenbrand
  2023-08-21 19:48     ` Hugh Dickins
  1 sibling, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2023-08-15  6:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Vishal Moola, Vlastimil Babka,
	Zi Yan, Linux ARM, sparclinux, linuxppc-dev, linux-s390,
	kernel list, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Mon, 14 Aug 2023, Jann Horn wrote:
> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> We can still have a racing userfaultfd operation at the "/* step 4:
> remove page table */" point that installs a new PTE before the page
> table is removed.
> 
> To reproduce, patch a delay into the kernel like this:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9a6e0d507759..27cc8dfbf3a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/ksm.h>
> +#include <linux/delay.h>
> 
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
>         }
> 
>         /* step 4: remove page table */
> +       if (strcmp(current->comm, "DELAYME") == 0) {
> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> +               mdelay(5000);
> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
> +       }
> 
>         /* Huge page lock is still held, so page table must remain empty */
>         pml = pmd_lock(mm, pmd);
> 
> 
> And then run the attached reproducer against mm/mm-everything. You
> should get this in dmesg:
> 
> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
> type:MM_ANONPAGES val:1

Thanks a lot, Jann. I haven't thought about it at all yet; and just
tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
just see "Invalid argument" on the UFFDIO_COPY ioctl.
Will investigate tomorrow.

Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-08-15  6:34     ` Hugh Dickins
@ 2023-08-15  7:11       ` David Hildenbrand
  2023-08-15 15:41         ` Hugh Dickins
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2023-08-15  7:11 UTC (permalink / raw)
  To: Hugh Dickins, Jann Horn
  Cc: Andrew Morton, Mike Kravetz, Mike Rapoport, Kirill A. Shutemov,
	Matthew Wilcox, Suren Baghdasaryan, Qi Zheng, Yang Shi,
	Mel Gorman, Peter Xu, Peter Zijlstra, Will Deacon, Yu Zhao,
	Alistair Popple, Ralph Campbell, Ira Weiny, Steven Price,
	SeongJae Park, Lorenzo Stoakes, Huang Ying, Naoya Horiguchi,
	Christophe Leroy, Zack Rusin, Jason Gunthorpe, Axel Rasmussen,
	Anshuman Khandual, Pasha Tatashin, Miaohe Lin, Minchan Kim,
	Christoph Hellwig, Song Liu, Thomas Hellstrom, Russell King,
	David S. Miller, Michael Ellerman, Aneesh Kumar K.V,
	Heiko Carstens, Christian Borntraeger, Claudio Imbrenda,
	Alexander Gordeev, Gerald Schaefer, Vasily Gorbik, Vishal Moola,
	Vlastimil Babka, Zi Yan, Linux ARM, sparclinux, linuxppc-dev,
	linux-s390, kernel list, Linux-MM

On 15.08.23 08:34, Hugh Dickins wrote:
> On Mon, 14 Aug 2023, Jann Horn wrote:
>> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
>>> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
>>> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
>>> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
>>> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
>>
>> We can still have a racing userfaultfd operation at the "/* step 4:
>> remove page table */" point that installs a new PTE before the page
>> table is removed.
>>
>> To reproduce, patch a delay into the kernel like this:
>>
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9a6e0d507759..27cc8dfbf3a7 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/shmem_fs.h>
>>   #include <linux/ksm.h>
>> +#include <linux/delay.h>
>>
>>   #include <asm/tlb.h>
>>   #include <asm/pgalloc.h>
>> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
>> *mm, unsigned long addr,
>>          }
>>
>>          /* step 4: remove page table */
>> +       if (strcmp(current->comm, "DELAYME") == 0) {
>> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
>> +               mdelay(5000);
>> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
>> +       }
>>
>>          /* Huge page lock is still held, so page table must remain empty */
>>          pml = pmd_lock(mm, pmd);
>>
>>
>> And then run the attached reproducer against mm/mm-everything. You
>> should get this in dmesg:
>>
>> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
>> type:MM_ANONPAGES val:1
> 
> Thanks a lot, Jann. I haven't thought about it at all yet; and just
> tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
> just see "Invalid argument" on the UFFDIO_COPY ioctl.
> Will investigate tomorrow.

Maybe you're missing a fixup:

https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com

When the src address is not page aligned, UFFDIO_COPY in mm-unstable 
would erroneously fail.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-08-15  7:11       ` David Hildenbrand
@ 2023-08-15 15:41         ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-08-15 15:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Jann Horn, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Kirill A. Shutemov, Matthew Wilcox,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Vishal Moola, Vlastimil Babka,
	Zi Yan, Linux ARM, sparclinux, linuxppc-dev, linux-s390,
	kernel list, Linux-MM

On Tue, 15 Aug 2023, David Hildenbrand wrote:
> On 15.08.23 08:34, Hugh Dickins wrote:
> > On Mon, 14 Aug 2023, Jann Horn wrote:
> >>
> >>          /* step 4: remove page table */
> >> +       if (strcmp(current->comm, "DELAYME") == 0) {
> >> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> >> +               mdelay(5000);
> >> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
> >> +       }
> >>
> >>          /* Huge page lock is still held, so page table must remain empty
> >>          */
> >>          pml = pmd_lock(mm, pmd);
> >>
> >>
> >> And then run the attached reproducer against mm/mm-everything. You
> >> should get this in dmesg:
> >>
> >> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
> >> type:MM_ANONPAGES val:1
> > 
> > Thanks a lot, Jann. I haven't thought about it at all yet; and just
> > tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
> > just see "Invalid argument" on the UFFDIO_COPY ioctl.
> > Will investigate tomorrow.
> 
> Maybe you're missing a fixup:
> 
> https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com
> 
> When the src address is not page aligned, UFFDIO_COPY in mm-unstable would
> erroneously fail.

You got it, many thanks David: I had assumed that my next-20230808 tree
would be up-to-date enough, but it wasn't.  Reproduced now.

Hugh 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  2023-08-14 20:36   ` [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Jann Horn
  2023-08-15  6:34     ` Hugh Dickins
@ 2023-08-21 19:48     ` Hugh Dickins
  1 sibling, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2023-08-21 19:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Hugh Dickins, Andrew Morton, Mike Kravetz, Mike Rapoport,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Suren Baghdasaryan, Qi Zheng, Yang Shi, Mel Gorman, Peter Xu,
	Peter Zijlstra, Will Deacon, Yu Zhao, Alistair Popple,
	Ralph Campbell, Ira Weiny, Steven Price, SeongJae Park,
	Lorenzo Stoakes, Huang Ying, Naoya Horiguchi, Christophe Leroy,
	Zack Rusin, Jason Gunthorpe, Axel Rasmussen, Anshuman Khandual,
	Pasha Tatashin, Miaohe Lin, Minchan Kim, Christoph Hellwig,
	Song Liu, Thomas Hellstrom, Russell King, David S. Miller,
	Michael Ellerman, Aneesh Kumar K.V, Heiko Carstens,
	Christian Borntraeger, Claudio Imbrenda, Alexander Gordeev,
	Gerald Schaefer, Vasily Gorbik, Vishal Moola, Vlastimil Babka,
	Zi Yan, Zach O'Keefe, Linux ARM, sparclinux, linuxppc-dev,
	linux-s390, kernel list, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 2961 bytes --]

On Mon, 14 Aug 2023, Jann Horn wrote:
> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> We can still have a racing userfaultfd operation at the "/* step 4:
> remove page table */" point that installs a new PTE before the page
> table is removed.

And you've been very polite not to remind me that this is exactly
what you warned me about, in connection with retract_page_tables(),
nearly three months ago:

https://lore.kernel.org/linux-mm/CAG48ez0aF1Rf1apSjn9YcnfyFQ4YqSd4GqB6f2wfhF7jMdi5Hg@mail.gmail.com/

> 
> To reproduce, patch a delay into the kernel like this:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9a6e0d507759..27cc8dfbf3a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/ksm.h>
> +#include <linux/delay.h>
> 
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
>         }
> 
>         /* step 4: remove page table */
> +       if (strcmp(current->comm, "DELAYME") == 0) {
> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> +               mdelay(5000);
> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
> +       }
> 
>         /* Huge page lock is still held, so page table must remain empty */
>         pml = pmd_lock(mm, pmd);
> 
> 
> And then run the attached reproducer against mm/mm-everything. You
> should get this in dmesg:
> 
> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
> type:MM_ANONPAGES val:1

Very helpful, thank you Jann.

I got a bit distracted when I then found mm's recent addition of
UFFDIO_POISON: thought I needed to change both collapse_pte_mapped_thp()
and retract_page_tables() now to cope with mfill_atomic_pte_poison()
inserting into even a userfaultfd_armed shared VMA.

But eventually, on second thoughts, realized that's only inserting a pte
marker, invalid, so won't cause any actual trouble.  A little untidy,
to leave that behind in a supposedly empty page table about to be freed,
but not worth refactoring these functions to avoid a non-bug.

And though syzbot and JH may find some fun with it, I don't think any
real application would be insertng a PTE_MARKER_POISONED where a huge
page collapse is almost complete.

So I scaled back to a more proportionate fix, following.  Sorry, I've
slightly messed up applying the "DELAY INJECTION" patch above: not
intentional, honest!  (mdelay while holding the locks is still good.)

Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-08-21 19:48 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-12  4:27 [PATCH v3 00/13] mm: free retracted page table by RCU Hugh Dickins
2023-07-12  4:30 ` [PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-07-12  4:32 ` [PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-07-12  4:33 ` [PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-07-12  4:34 ` [PATCH v3 04/13] powerpc: assert_pte_locked() " Hugh Dickins
2023-07-18 10:41   ` Aneesh Kumar K.V
2023-07-19  5:04     ` Hugh Dickins
2023-07-19  5:24       ` Aneesh Kumar K V
2023-07-21 13:13         ` Jay Patel
2023-07-23 22:26           ` [PATCH v3 04/13 fix] powerpc: assert_pte_locked() use pte_offset_map_nolock(): fix Hugh Dickins
2023-07-12  4:35 ` [PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-07-12  4:37 ` [PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t Hugh Dickins
2023-07-12  4:38 ` [PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-07-13  4:47   ` Alexander Gordeev
2023-07-19 14:25   ` Claudio Imbrenda
2023-07-23 22:29     ` [PATCH v3 07/13 fix] s390: add pte_free_defer() for pgtables sharing page: fix Hugh Dickins
2023-07-12  4:39 ` [PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-07-12  4:41 ` [PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-07-12  4:42 ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-07-23 22:32   ` [PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix Hugh Dickins
2023-08-03  9:17   ` [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Qi Zheng
2023-08-06  3:55     ` Hugh Dickins
2023-08-07  2:21       ` Qi Zheng
2023-08-06  3:59     ` [PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2 Hugh Dickins
2023-08-14 20:36   ` [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Jann Horn
2023-08-15  6:34     ` Hugh Dickins
2023-08-15  7:11       ` David Hildenbrand
2023-08-15 15:41         ` Hugh Dickins
2023-08-21 19:48     ` Hugh Dickins
2023-07-12  4:43 ` [PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-07-23 22:35   ` [PATCH v3 11/13 fix] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps(): fix Hugh Dickins
2023-07-12  4:44 ` [PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins
2023-07-12  4:48   ` [PATCH mm " Hugh Dickins
2023-07-12  4:46 ` [PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]() Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).