All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] page table check fixes and cleanups
@ 2022-01-26 18:36 Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 1/4] mm/debug_vm_pgtable: remove pte entry from the page table Pasha Tatashin
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Pasha Tatashin @ 2022-01-26 18:36 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, rientjes, pjt,
	weixugc, gthelen, mingo, will, rppt, dave.hansen, hpa,
	aneesh.kumar, jirislaby, songmuchun, qydwhotmail, hughd, ziy,
	anshuman.khandual

Changelog:
v4:	- Addressed review comments from David Rientjes
	- Added Acks.
v3:	- Resolved a regression introduced in previous version, where
	  page collapse in khugepaged would cause crash on boot.
	- Addressed comments from Anshuman Khandual regarding commit
	  log.
v2:	- Addressed simplification comments from Wei Xu
	- Added Review-by/Tested-by's from Zi Yan and Wei Xu


Two fixes:

  mm/debug_vm_pgtable: remove pte entry from the page table
	- remove a pte entry from the page table at the end of
	  debug_vm_pgtable pte test

  mm/khugepaged: unify collapse pmd clear, flush and free
  mm/page_table_check: check entries at pmd levels
	- check pmd level in page_table_check for PTE regular entries
	  prior to freeing.
	  repro.c: https://gist.github.com/soleen/fdcd501d5df103976245fe84e9535087
	  config: https://gist.github.com/soleen/8a56f923c2fea9ce9c75b4e2517d4162
	  qemu_script: https://gist.github.com/soleen/f4be4795826b7ab1a51ae659582e179c
	  base image:
	  https://storage.googleapis.com/syzkaller/wheezy.img
	  https://storage.googleapis.com/syzkaller/wheezy.img.key

Small cleanup:
  mm/page_table_check: use unsigned long for page counters and cleanup

Previous versions:
v1: https://lore.kernel.org/all/20220120042513.1648831-1-pasha.tatashin@soleen.com
v2: https://lore.kernel.org/all/20220120191250.2671557-1-pasha.tatashin@soleen.com
v3: https://lore.kernel.org/all/20220126060514.1574935-1-pasha.tatashin@soleen.com

Pasha Tatashin (4):
  mm/debug_vm_pgtable: remove pte entry from the page table
  mm/page_table_check: use unsigned long for page counters and cleanup
  mm/khugepaged: unify collapse pmd clear, flush and free
  mm/page_table_check: check entries at pmd levels

 include/linux/page_table_check.h | 18 ++++++++++
 mm/debug_vm_pgtable.c            |  2 ++
 mm/khugepaged.c                  | 37 ++++++++++++---------
 mm/page_table_check.c            | 56 ++++++++++++++++----------------
 4 files changed, 69 insertions(+), 44 deletions(-)

-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/4] mm/debug_vm_pgtable: remove pte entry from the page table
  2022-01-26 18:36 [PATCH v4 0/4] page table check fixes and cleanups Pasha Tatashin
@ 2022-01-26 18:36 ` Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 2/4] mm/page_table_check: use unsigned long for page counters and cleanup Pasha Tatashin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Pasha Tatashin @ 2022-01-26 18:36 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, rientjes, pjt,
	weixugc, gthelen, mingo, will, rppt, dave.hansen, hpa,
	aneesh.kumar, jirislaby, songmuchun, qydwhotmail, hughd, ziy,
	anshuman.khandual

The pte entry that is used in pte_advanced_tests() is never removed from
the page table at the end of the test.

The issue is detected by page_table_check, to repro compile kernel with
the following configs:

CONFIG_DEBUG_VM_PGTABLE=y
CONFIG_PAGE_TABLE_CHECK=y
CONFIG_PAGE_TABLE_CHECK_ENFORCED=y

During the boot the following BUG is printed:

[    2.262821] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating
               architecture page table helpers
[    2.276826] ------------[ cut here ]------------
[    2.280426] kernel BUG at mm/page_table_check.c:162!
[    2.284118] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    2.287787] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
               5.16.0-11413-g2c271fe77d52 #3
[    2.293226] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
               BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org
               04/01/2014
...

The entry should be properly removed from the page table before the page
is released to the free list.

Fixes: a5c3b9ffb0f4 ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers")
Cc: stable@vger.kernel.org # 5.9+

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/debug_vm_pgtable.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index a7ac97c76762..db2abd9e415b 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -171,6 +171,8 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
 	ptep_test_and_clear_young(args->vma, args->vaddr, args->ptep);
 	pte = ptep_get(args->ptep);
 	WARN_ON(pte_young(pte));
+
+	ptep_get_and_clear_full(args->mm, args->vaddr, args->ptep, 1);
 }
 
 static void __init pte_savedwrite_tests(struct pgtable_debug_args *args)
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 2/4] mm/page_table_check: use unsigned long for page counters and cleanup
  2022-01-26 18:36 [PATCH v4 0/4] page table check fixes and cleanups Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 1/4] mm/debug_vm_pgtable: remove pte entry from the page table Pasha Tatashin
@ 2022-01-26 18:36 ` Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 3/4] mm/khugepaged: unify collapse pmd clear, flush and free Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels Pasha Tatashin
  3 siblings, 0 replies; 8+ messages in thread
From: Pasha Tatashin @ 2022-01-26 18:36 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, rientjes, pjt,
	weixugc, gthelen, mingo, will, rppt, dave.hansen, hpa,
	aneesh.kumar, jirislaby, songmuchun, qydwhotmail, hughd, ziy,
	anshuman.khandual

For the consistency, use "unsigned long" for all page counters.

Also, reduce code duplication by calling
__page_table_check_*_clear() from __page_table_check_*_set() functions.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/page_table_check.c | 35 +++++++----------------------------
 1 file changed, 7 insertions(+), 28 deletions(-)

diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 7504e7caa2a1..c61d7ebe13b1 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -86,8 +86,8 @@ static void page_table_check_clear(struct mm_struct *mm, unsigned long addr,
 {
 	struct page_ext *page_ext;
 	struct page *page;
+	unsigned long i;
 	bool anon;
-	int i;
 
 	if (!pfn_valid(pfn))
 		return;
@@ -121,8 +121,8 @@ static void page_table_check_set(struct mm_struct *mm, unsigned long addr,
 {
 	struct page_ext *page_ext;
 	struct page *page;
+	unsigned long i;
 	bool anon;
-	int i;
 
 	if (!pfn_valid(pfn))
 		return;
@@ -152,10 +152,10 @@ static void page_table_check_set(struct mm_struct *mm, unsigned long addr,
 void __page_table_check_zero(struct page *page, unsigned int order)
 {
 	struct page_ext *page_ext = lookup_page_ext(page);
-	int i;
+	unsigned long i;
 
 	BUG_ON(!page_ext);
-	for (i = 0; i < (1 << order); i++) {
+	for (i = 0; i < (1ul << order); i++) {
 		struct page_table_check *ptc = get_page_table_check(page_ext);
 
 		BUG_ON(atomic_read(&ptc->anon_map_count));
@@ -206,17 +206,10 @@ EXPORT_SYMBOL(__page_table_check_pud_clear);
 void __page_table_check_pte_set(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte)
 {
-	pte_t old_pte;
-
 	if (&init_mm == mm)
 		return;
 
-	old_pte = *ptep;
-	if (pte_user_accessible_page(old_pte)) {
-		page_table_check_clear(mm, addr, pte_pfn(old_pte),
-				       PAGE_SIZE >> PAGE_SHIFT);
-	}
-
+	__page_table_check_pte_clear(mm, addr, *ptep);
 	if (pte_user_accessible_page(pte)) {
 		page_table_check_set(mm, addr, pte_pfn(pte),
 				     PAGE_SIZE >> PAGE_SHIFT,
@@ -228,17 +221,10 @@ EXPORT_SYMBOL(__page_table_check_pte_set);
 void __page_table_check_pmd_set(struct mm_struct *mm, unsigned long addr,
 				pmd_t *pmdp, pmd_t pmd)
 {
-	pmd_t old_pmd;
-
 	if (&init_mm == mm)
 		return;
 
-	old_pmd = *pmdp;
-	if (pmd_user_accessible_page(old_pmd)) {
-		page_table_check_clear(mm, addr, pmd_pfn(old_pmd),
-				       PMD_PAGE_SIZE >> PAGE_SHIFT);
-	}
-
+	__page_table_check_pmd_clear(mm, addr, *pmdp);
 	if (pmd_user_accessible_page(pmd)) {
 		page_table_check_set(mm, addr, pmd_pfn(pmd),
 				     PMD_PAGE_SIZE >> PAGE_SHIFT,
@@ -250,17 +236,10 @@ EXPORT_SYMBOL(__page_table_check_pmd_set);
 void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr,
 				pud_t *pudp, pud_t pud)
 {
-	pud_t old_pud;
-
 	if (&init_mm == mm)
 		return;
 
-	old_pud = *pudp;
-	if (pud_user_accessible_page(old_pud)) {
-		page_table_check_clear(mm, addr, pud_pfn(old_pud),
-				       PUD_PAGE_SIZE >> PAGE_SHIFT);
-	}
-
+	__page_table_check_pud_clear(mm, addr, *pudp);
 	if (pud_user_accessible_page(pud)) {
 		page_table_check_set(mm, addr, pud_pfn(pud),
 				     PUD_PAGE_SIZE >> PAGE_SHIFT,
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 3/4] mm/khugepaged: unify collapse pmd clear, flush and free
  2022-01-26 18:36 [PATCH v4 0/4] page table check fixes and cleanups Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 1/4] mm/debug_vm_pgtable: remove pte entry from the page table Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 2/4] mm/page_table_check: use unsigned long for page counters and cleanup Pasha Tatashin
@ 2022-01-26 18:36 ` Pasha Tatashin
  2022-01-26 18:36 ` [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels Pasha Tatashin
  3 siblings, 0 replies; 8+ messages in thread
From: Pasha Tatashin @ 2022-01-26 18:36 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, rientjes, pjt,
	weixugc, gthelen, mingo, will, rppt, dave.hansen, hpa,
	aneesh.kumar, jirislaby, songmuchun, qydwhotmail, hughd, ziy,
	anshuman.khandual

Unify the code that flushes, clears pmd entry, and frees the PTE table
level into a new function collapse_and_free_pmd().

This clean-up is useful as in the next patch we will add another call to
this function to iterate through PTE prior to freeing the level for page
table check.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/khugepaged.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..30e59e4af272 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1416,6 +1416,19 @@ static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	return 0;
 }
 
+static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long addr, pmd_t *pmdp)
+{
+	spinlock_t *ptl;
+	pmd_t pmd;
+
+	ptl = pmd_lock(vma->vm_mm, pmdp);
+	pmd = pmdp_collapse_flush(vma, addr, pmdp);
+	spin_unlock(ptl);
+	mm_dec_nr_ptes(mm);
+	pte_free(mm, pmd_pgtable(pmd));
+}
+
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1433,7 +1446,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	struct vm_area_struct *vma = find_vma(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
-	pmd_t *pmd, _pmd;
+	pmd_t *pmd;
 	spinlock_t *ptl;
 	int count = 0;
 	int i;
@@ -1509,12 +1522,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	}
 
 	/* step 4: collapse pmd */
-	ptl = pmd_lock(vma->vm_mm, pmd);
-	_pmd = pmdp_collapse_flush(vma, haddr, pmd);
-	spin_unlock(ptl);
-	mm_dec_nr_ptes(mm);
-	pte_free(mm, pmd_pgtable(_pmd));
-
+	collapse_and_free_pmd(mm, vma, haddr, pmd);
 drop_hpage:
 	unlock_page(hpage);
 	put_page(hpage);
@@ -1552,7 +1560,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long addr;
-	pmd_t *pmd, _pmd;
+	pmd_t *pmd;
 
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1591,14 +1599,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * reverse order. Trylock is a way to avoid deadlock.
 		 */
 		if (mmap_write_trylock(mm)) {
-			if (!khugepaged_test_exit(mm)) {
-				spinlock_t *ptl = pmd_lock(mm, pmd);
-				/* assume page table is clear */
-				_pmd = pmdp_collapse_flush(vma, addr, pmd);
-				spin_unlock(ptl);
-				mm_dec_nr_ptes(mm);
-				pte_free(mm, pmd_pgtable(_pmd));
-			}
+			if (!khugepaged_test_exit(mm))
+				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
 			/* Try again later */
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels
  2022-01-26 18:36 [PATCH v4 0/4] page table check fixes and cleanups Pasha Tatashin
                   ` (2 preceding siblings ...)
  2022-01-26 18:36 ` [PATCH v4 3/4] mm/khugepaged: unify collapse pmd clear, flush and free Pasha Tatashin
@ 2022-01-26 18:36 ` Pasha Tatashin
  2022-01-26 19:26   ` David Rientjes
  2022-01-28 23:39   ` Wei Xu
  3 siblings, 2 replies; 8+ messages in thread
From: Pasha Tatashin @ 2022-01-26 18:36 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, rientjes, pjt,
	weixugc, gthelen, mingo, will, rppt, dave.hansen, hpa,
	aneesh.kumar, jirislaby, songmuchun, qydwhotmail, hughd, ziy,
	anshuman.khandual

syzbot detected a case where the page table counters were not properly
updated.

syzkaller login:  ------------[ cut here ]------------
kernel BUG at mm/page_table_check.c:162!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 3099 Comm: pasha Not tainted 5.16.0+ #48
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO4
RIP: 0010:__page_table_check_zero+0x159/0x1a0
Code: 7d 3a b2 ff 45 39 f5 74 2a e8 43 38 b2 ff 4d 85 e4 01
RSP: 0018:ffff888010667418 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000
RDX: ffff88800cea8680 RSI: ffffffff81becaf9 RDI: 0000000003
RBP: ffff888010667450 R08: 0000000000000001 R09: 0000000000
R10: ffffffff81becaab R11: 0000000000000001 R12: ffff888008
R13: 0000000000000001 R14: 0000000000000200 R15: dffffc0000
FS:  0000000000000000(0000) GS:ffff888035e00000(0000) knlG0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd875cad00 CR3: 00000000094ce000 CR4: 0000000000
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000
Call Trace:
 <TASK>
 free_pcp_prepare+0x3be/0xaa0
 free_unref_page+0x1c/0x650
 ? trace_hardirqs_on+0x6a/0x1d0
 free_compound_page+0xec/0x130
 free_transhuge_page+0x1be/0x260
 __put_compound_page+0x90/0xd0
 release_pages+0x54c/0x1060
 ? filemap_remove_folio+0x161/0x210
 ? lock_downgrade+0x720/0x720
 ? __put_page+0x150/0x150
 ? filemap_free_folio+0x164/0x350
 __pagevec_release+0x7c/0x110
 shmem_undo_range+0x85e/0x1250
...

The repro involved having a huge page that is split due to uprobe event
temporarily replacing one of the pages in the huge page. Later the huge
page was combined again, but the counters were off, as the PTE level
was not properly updated.

Make sure that when PMD is cleared and prior to freeing the level the
PTEs are updated.

Fixes: df4e817b7108 ("mm: page table check")

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/page_table_check.h | 18 ++++++++++++++++++
 mm/khugepaged.c                  |  3 +++
 mm/page_table_check.c            | 21 +++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
index 38cace1da7b6..e88bbe37727b 100644
--- a/include/linux/page_table_check.h
+++ b/include/linux/page_table_check.h
@@ -26,6 +26,8 @@ void __page_table_check_pmd_set(struct mm_struct *mm, unsigned long addr,
 				pmd_t *pmdp, pmd_t pmd);
 void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr,
 				pud_t *pudp, pud_t pud);
+void __page_table_check_pmd_clear_full(struct mm_struct *mm, unsigned long addr,
+				       pmd_t pmd);
 
 static inline void page_table_check_alloc(struct page *page, unsigned int order)
 {
@@ -100,6 +102,16 @@ static inline void page_table_check_pud_set(struct mm_struct *mm,
 	__page_table_check_pud_set(mm, addr, pudp, pud);
 }
 
+static inline void page_table_check_pmd_clear_full(struct mm_struct *mm,
+						   unsigned long addr,
+						   pmd_t pmd)
+{
+	if (static_branch_likely(&page_table_check_disabled))
+		return;
+
+	__page_table_check_pmd_clear_full(mm, addr, pmd);
+}
+
 #else
 
 static inline void page_table_check_alloc(struct page *page, unsigned int order)
@@ -143,5 +155,11 @@ static inline void page_table_check_pud_set(struct mm_struct *mm,
 {
 }
 
+static inline void page_table_check_pmd_clear_full(struct mm_struct *mm,
+						   unsigned long addr,
+						   pmd_t pmd)
+{
+}
+
 #endif /* CONFIG_PAGE_TABLE_CHECK */
 #endif /* __LINUX_PAGE_TABLE_CHECK_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 30e59e4af272..d84977c6dc0d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -16,6 +16,7 @@
 #include <linux/hashtable.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
+#include <linux/page_table_check.h>
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 
@@ -1422,10 +1423,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 	spinlock_t *ptl;
 	pmd_t pmd;
 
+	mmap_assert_write_locked(mm);
 	ptl = pmd_lock(vma->vm_mm, pmdp);
 	pmd = pmdp_collapse_flush(vma, addr, pmdp);
 	spin_unlock(ptl);
 	mm_dec_nr_ptes(mm);
+	page_table_check_pmd_clear_full(mm, addr, pmd);
 	pte_free(mm, pmd_pgtable(pmd));
 }
 
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index c61d7ebe13b1..251f95a808b4 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -247,3 +247,24 @@ void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr,
 	}
 }
 EXPORT_SYMBOL(__page_table_check_pud_set);
+
+void __page_table_check_pmd_clear_full(struct mm_struct *mm, unsigned long addr,
+				       pmd_t pmd)
+{
+	if (&init_mm == mm)
+		return;
+
+	if (!pmd_bad(pmd) && !pmd_leaf(pmd)) {
+		pte_t *ptep = pte_offset_map(&pmd, addr);
+		unsigned long i;
+
+		pte_unmap(ptep);
+		for (i = 0; i < PTRS_PER_PTE; i++) {
+			__page_table_check_pte_clear(mm, addr, *ptep);
+			addr += PAGE_SIZE;
+			ptep++;
+		}
+	} else {
+		__page_table_check_pmd_clear(mm, addr, pmd);
+	}
+}
-- 
2.35.0.rc0.227.g00780c9af4-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels
  2022-01-26 18:36 ` [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels Pasha Tatashin
@ 2022-01-26 19:26   ` David Rientjes
  2022-01-28 23:39   ` Wei Xu
  1 sibling, 0 replies; 8+ messages in thread
From: David Rientjes @ 2022-01-26 19:26 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, linux-mm, akpm, pjt, weixugc, gthelen, mingo, will,
	rppt, dave.hansen, hpa, aneesh.kumar, jirislaby, songmuchun,
	qydwhotmail, hughd, ziy, anshuman.khandual

On Wed, 26 Jan 2022, Pasha Tatashin wrote:

> syzbot detected a case where the page table counters were not properly
> updated.
> 
> syzkaller login:  ------------[ cut here ]------------
> kernel BUG at mm/page_table_check.c:162!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 3099 Comm: pasha Not tainted 5.16.0+ #48
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO4
> RIP: 0010:__page_table_check_zero+0x159/0x1a0
> Code: 7d 3a b2 ff 45 39 f5 74 2a e8 43 38 b2 ff 4d 85 e4 01
> RSP: 0018:ffff888010667418 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000
> RDX: ffff88800cea8680 RSI: ffffffff81becaf9 RDI: 0000000003
> RBP: ffff888010667450 R08: 0000000000000001 R09: 0000000000
> R10: ffffffff81becaab R11: 0000000000000001 R12: ffff888008
> R13: 0000000000000001 R14: 0000000000000200 R15: dffffc0000
> FS:  0000000000000000(0000) GS:ffff888035e00000(0000) knlG0
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffd875cad00 CR3: 00000000094ce000 CR4: 0000000000
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000
> Call Trace:
>  <TASK>
>  free_pcp_prepare+0x3be/0xaa0
>  free_unref_page+0x1c/0x650
>  ? trace_hardirqs_on+0x6a/0x1d0
>  free_compound_page+0xec/0x130
>  free_transhuge_page+0x1be/0x260
>  __put_compound_page+0x90/0xd0
>  release_pages+0x54c/0x1060
>  ? filemap_remove_folio+0x161/0x210
>  ? lock_downgrade+0x720/0x720
>  ? __put_page+0x150/0x150
>  ? filemap_free_folio+0x164/0x350
>  __pagevec_release+0x7c/0x110
>  shmem_undo_range+0x85e/0x1250
> ...
> 
> The repro involved having a huge page that is split due to uprobe event
> temporarily replacing one of the pages in the huge page. Later the huge
> page was combined again, but the counters were off, as the PTE level
> was not properly updated.
> 
> Make sure that when PMD is cleared and prior to freeing the level the
> PTEs are updated.
> 
> Fixes: df4e817b7108 ("mm: page table check")
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels
  2022-01-26 18:36 ` [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels Pasha Tatashin
  2022-01-26 19:26   ` David Rientjes
@ 2022-01-28 23:39   ` Wei Xu
  2022-01-31 20:16     ` Pasha Tatashin
  1 sibling, 1 reply; 8+ messages in thread
From: Wei Xu @ 2022-01-28 23:39 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	David Rientjes, Paul Turner, Greg Thelen, mingo, will, rppt,
	Dave Hansen, hpa, aneesh.kumar, jirislaby, songmuchun,
	qydwhotmail, Hugh Dickins, Zi Yan, anshuman.khandual

On Wed, Jan 26, 2022 at 10:36 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> syzbot detected a case where the page table counters were not properly
> updated.
>
> syzkaller login:  ------------[ cut here ]------------
> kernel BUG at mm/page_table_check.c:162!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 3099 Comm: pasha Not tainted 5.16.0+ #48
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO4
> RIP: 0010:__page_table_check_zero+0x159/0x1a0
> Code: 7d 3a b2 ff 45 39 f5 74 2a e8 43 38 b2 ff 4d 85 e4 01
> RSP: 0018:ffff888010667418 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000
> RDX: ffff88800cea8680 RSI: ffffffff81becaf9 RDI: 0000000003
> RBP: ffff888010667450 R08: 0000000000000001 R09: 0000000000
> R10: ffffffff81becaab R11: 0000000000000001 R12: ffff888008
> R13: 0000000000000001 R14: 0000000000000200 R15: dffffc0000
> FS:  0000000000000000(0000) GS:ffff888035e00000(0000) knlG0
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffd875cad00 CR3: 00000000094ce000 CR4: 0000000000
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000
> Call Trace:
>  <TASK>
>  free_pcp_prepare+0x3be/0xaa0
>  free_unref_page+0x1c/0x650
>  ? trace_hardirqs_on+0x6a/0x1d0
>  free_compound_page+0xec/0x130
>  free_transhuge_page+0x1be/0x260
>  __put_compound_page+0x90/0xd0
>  release_pages+0x54c/0x1060
>  ? filemap_remove_folio+0x161/0x210
>  ? lock_downgrade+0x720/0x720
>  ? __put_page+0x150/0x150
>  ? filemap_free_folio+0x164/0x350
>  __pagevec_release+0x7c/0x110
>  shmem_undo_range+0x85e/0x1250
> ...
>
> The repro involved having a huge page that is split due to uprobe event
> temporarily replacing one of the pages in the huge page. Later the huge
> page was combined again, but the counters were off, as the PTE level
> was not properly updated.
>
> Make sure that when PMD is cleared and prior to freeing the level the
> PTEs are updated.
>
> Fixes: df4e817b7108 ("mm: page table check")
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/page_table_check.h | 18 ++++++++++++++++++
>  mm/khugepaged.c                  |  3 +++
>  mm/page_table_check.c            | 21 +++++++++++++++++++++
>  3 files changed, 42 insertions(+)
>
> diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
> index 38cace1da7b6..e88bbe37727b 100644
> --- a/include/linux/page_table_check.h
> +++ b/include/linux/page_table_check.h
> @@ -26,6 +26,8 @@ void __page_table_check_pmd_set(struct mm_struct *mm, unsigned long addr,
>                                 pmd_t *pmdp, pmd_t pmd);
>  void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr,
>                                 pud_t *pudp, pud_t pud);
> +void __page_table_check_pmd_clear_full(struct mm_struct *mm, unsigned long addr,
> +                                      pmd_t pmd);
>
>  static inline void page_table_check_alloc(struct page *page, unsigned int order)
>  {
> @@ -100,6 +102,16 @@ static inline void page_table_check_pud_set(struct mm_struct *mm,
>         __page_table_check_pud_set(mm, addr, pudp, pud);
>  }
>
> +static inline void page_table_check_pmd_clear_full(struct mm_struct *mm,
> +                                                  unsigned long addr,
> +                                                  pmd_t pmd)
> +{
> +       if (static_branch_likely(&page_table_check_disabled))
> +               return;
> +
> +       __page_table_check_pmd_clear_full(mm, addr, pmd);
> +}
> +
>  #else
>
>  static inline void page_table_check_alloc(struct page *page, unsigned int order)
> @@ -143,5 +155,11 @@ static inline void page_table_check_pud_set(struct mm_struct *mm,
>  {
>  }
>
> +static inline void page_table_check_pmd_clear_full(struct mm_struct *mm,
> +                                                  unsigned long addr,
> +                                                  pmd_t pmd)
> +{
> +}
> +
>  #endif /* CONFIG_PAGE_TABLE_CHECK */
>  #endif /* __LINUX_PAGE_TABLE_CHECK_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 30e59e4af272..d84977c6dc0d 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -16,6 +16,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/userfaultfd_k.h>
>  #include <linux/page_idle.h>
> +#include <linux/page_table_check.h>
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>
> @@ -1422,10 +1423,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
>         spinlock_t *ptl;
>         pmd_t pmd;
>
> +       mmap_assert_write_locked(mm);
>         ptl = pmd_lock(vma->vm_mm, pmdp);
>         pmd = pmdp_collapse_flush(vma, addr, pmdp);
>         spin_unlock(ptl);
>         mm_dec_nr_ptes(mm);
> +       page_table_check_pmd_clear_full(mm, addr, pmd);

pmdp_collapse_flush() already calls page_table_check_pmd_clear() via
pmdp_huge_get_and_clean().  Both pmdp_table_check_pmd_clear() and
page_table_check_pmd_clear_full() can call
__page_table_check_pmd_clear(). If that happens, then the page table
check counters can be messed up.  Certainly, there is no bug here
because the pmd is not huge and __page_table_check_pmd_clear() should
be skipped in both calls. But it would be better to avoid this
unnecessary subtlety by renaming page_table_check_pmd_clear_full() to
page_table_check_clear_pte_range() and not calling
__page_table_check_pmd_clear() there.  To make the code even more
clear, __page_table_check_pmd_clear() can also be renamed as
__page_table_check_huge_pmd_clear() (similar for its callers).

>         pte_free(mm, pmd_pgtable(pmd));
>  }
>
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index c61d7ebe13b1..251f95a808b4 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -247,3 +247,24 @@ void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr,
>         }
>  }
>  EXPORT_SYMBOL(__page_table_check_pud_set);
> +
> +void __page_table_check_pmd_clear_full(struct mm_struct *mm, unsigned long addr,
> +                                      pmd_t pmd)
> +{
> +       if (&init_mm == mm)
> +               return;
> +
> +       if (!pmd_bad(pmd) && !pmd_leaf(pmd)) {
> +               pte_t *ptep = pte_offset_map(&pmd, addr);
> +               unsigned long i;
> +
> +               pte_unmap(ptep);
> +               for (i = 0; i < PTRS_PER_PTE; i++) {
> +                       __page_table_check_pte_clear(mm, addr, *ptep);
> +                       addr += PAGE_SIZE;
> +                       ptep++;
> +               }
> +       } else {
> +               __page_table_check_pmd_clear(mm, addr, pmd);
> +       }
> +}
> --
> 2.35.0.rc0.227.g00780c9af4-goog
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels
  2022-01-28 23:39   ` Wei Xu
@ 2022-01-31 20:16     ` Pasha Tatashin
  0 siblings, 0 replies; 8+ messages in thread
From: Pasha Tatashin @ 2022-01-31 20:16 UTC (permalink / raw)
  To: Wei Xu
  Cc: Linux Kernel Mailing List, Linux MM, Andrew Morton,
	David Rientjes, Paul Turner, Greg Thelen, Ingo Molnar,
	Will Deacon, Mike Rapoport, Dave Hansen, H. Peter Anvin,
	Aneesh Kumar K.V, Jiri Slaby, Muchun Song, Fusion Future,
	Hugh Dickins, Zi Yan, Anshuman Khandual

On Fri, Jan 28, 2022 at 6:39 PM Wei Xu <weixugc@google.com> wrote:
>
> On Wed, Jan 26, 2022 at 10:36 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > syzbot detected a case where the page table counters were not properly
> > updated.
> >
> > syzkaller login:  ------------[ cut here ]------------
> > kernel BUG at mm/page_table_check.c:162!
> > invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 3099 Comm: pasha Not tainted 5.16.0+ #48
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO4
> > RIP: 0010:__page_table_check_zero+0x159/0x1a0
> > Code: 7d 3a b2 ff 45 39 f5 74 2a e8 43 38 b2 ff 4d 85 e4 01
> > RSP: 0018:ffff888010667418 EFLAGS: 00010293
> > RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000
> > RDX: ffff88800cea8680 RSI: ffffffff81becaf9 RDI: 0000000003
> > RBP: ffff888010667450 R08: 0000000000000001 R09: 0000000000
> > R10: ffffffff81becaab R11: 0000000000000001 R12: ffff888008
> > R13: 0000000000000001 R14: 0000000000000200 R15: dffffc0000
> > FS:  0000000000000000(0000) GS:ffff888035e00000(0000) knlG0
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007ffd875cad00 CR3: 00000000094ce000 CR4: 0000000000
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000
> > Call Trace:
> >  <TASK>
> >  free_pcp_prepare+0x3be/0xaa0
> >  free_unref_page+0x1c/0x650
> >  ? trace_hardirqs_on+0x6a/0x1d0
> >  free_compound_page+0xec/0x130
> >  free_transhuge_page+0x1be/0x260
> >  __put_compound_page+0x90/0xd0
> >  release_pages+0x54c/0x1060
> >  ? filemap_remove_folio+0x161/0x210
> >  ? lock_downgrade+0x720/0x720
> >  ? __put_page+0x150/0x150
> >  ? filemap_free_folio+0x164/0x350
> >  __pagevec_release+0x7c/0x110
> >  shmem_undo_range+0x85e/0x1250
> > ...
> >
> > The repro involved having a huge page that is split due to uprobe event
> > temporarily replacing one of the pages in the huge page. Later the huge
> > page was combined again, but the counters were off, as the PTE level
> > was not properly updated.
> >
> > Make sure that when PMD is cleared and prior to freeing the level the
> > PTEs are updated.
> >
> > Fixes: df4e817b7108 ("mm: page table check")
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  include/linux/page_table_check.h | 18 ++++++++++++++++++
> >  mm/khugepaged.c                  |  3 +++
> >  mm/page_table_check.c            | 21 +++++++++++++++++++++
> >  3 files changed, 42 insertions(+)
> >
> > diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
> > index 38cace1da7b6..e88bbe37727b 100644
> > --- a/include/linux/page_table_check.h
> > +++ b/include/linux/page_table_check.h
> > @@ -26,6 +26,8 @@ void __page_table_check_pmd_set(struct mm_struct *mm, unsigned long addr,
> >                                 pmd_t *pmdp, pmd_t pmd);
> >  void __page_table_check_pud_set(struct mm_struct *mm, unsigned long addr,
> >                                 pud_t *pudp, pud_t pud);
> > +void __page_table_check_pmd_clear_full(struct mm_struct *mm, unsigned long addr,
> > +                                      pmd_t pmd);
> >
> >  static inline void page_table_check_alloc(struct page *page, unsigned int order)
> >  {
> > @@ -100,6 +102,16 @@ static inline void page_table_check_pud_set(struct mm_struct *mm,
> >         __page_table_check_pud_set(mm, addr, pudp, pud);
> >  }
> >
> > +static inline void page_table_check_pmd_clear_full(struct mm_struct *mm,
> > +                                                  unsigned long addr,
> > +                                                  pmd_t pmd)
> > +{
> > +       if (static_branch_likely(&page_table_check_disabled))
> > +               return;
> > +
> > +       __page_table_check_pmd_clear_full(mm, addr, pmd);
> > +}
> > +
> >  #else
> >
> >  static inline void page_table_check_alloc(struct page *page, unsigned int order)
> > @@ -143,5 +155,11 @@ static inline void page_table_check_pud_set(struct mm_struct *mm,
> >  {
> >  }
> >
> > +static inline void page_table_check_pmd_clear_full(struct mm_struct *mm,
> > +                                                  unsigned long addr,
> > +                                                  pmd_t pmd)
> > +{
> > +}
> > +
> >  #endif /* CONFIG_PAGE_TABLE_CHECK */
> >  #endif /* __LINUX_PAGE_TABLE_CHECK_H */
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 30e59e4af272..d84977c6dc0d 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/hashtable.h>
> >  #include <linux/userfaultfd_k.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/page_table_check.h>
> >  #include <linux/swapops.h>
> >  #include <linux/shmem_fs.h>
> >
> > @@ -1422,10 +1423,12 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> >         spinlock_t *ptl;
> >         pmd_t pmd;
> >
> > +       mmap_assert_write_locked(mm);
> >         ptl = pmd_lock(vma->vm_mm, pmdp);
> >         pmd = pmdp_collapse_flush(vma, addr, pmdp);
> >         spin_unlock(ptl);
> >         mm_dec_nr_ptes(mm);
> > +       page_table_check_pmd_clear_full(mm, addr, pmd);
>

Hi Wei,

Thank you for your feedback,

> pmdp_collapse_flush() already calls page_table_check_pmd_clear() via
> pmdp_huge_get_and_clean().  Both pmdp_table_check_pmd_clear() and
> page_table_check_pmd_clear_full() can call
> __page_table_check_pmd_clear(). If that happens, then the page table
> check counters can be messed up.  Certainly, there is no bug here
> because the pmd is not huge and __page_table_check_pmd_clear() should
> be skipped in both calls. But it would be better to avoid this
> unnecessary subtlety by renaming page_table_check_pmd_clear_full() to
> page_table_check_clear_pte_range() and not calling
> __page_table_check_pmd_clear() there.  To make the code even more

Makes sense, I will rename page_table_check_pmd_clear_full() to
page_table_check_clear_pte_range()
and remove the call to __page_table_check_pmd_clear().

> clear, __page_table_check_pmd_clear() can also be renamed as
> __page_table_check_huge_pmd_clear() (similar for its callers).

Let's keep the current names for now as this does not affect the bug
fix. Perhaps, I can rename it later when working on ARM64 support.

Pasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-31 20:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 18:36 [PATCH v4 0/4] page table check fixes and cleanups Pasha Tatashin
2022-01-26 18:36 ` [PATCH v4 1/4] mm/debug_vm_pgtable: remove pte entry from the page table Pasha Tatashin
2022-01-26 18:36 ` [PATCH v4 2/4] mm/page_table_check: use unsigned long for page counters and cleanup Pasha Tatashin
2022-01-26 18:36 ` [PATCH v4 3/4] mm/khugepaged: unify collapse pmd clear, flush and free Pasha Tatashin
2022-01-26 18:36 ` [PATCH v4 4/4] mm/page_table_check: check entries at pmd levels Pasha Tatashin
2022-01-26 19:26   ` David Rientjes
2022-01-28 23:39   ` Wei Xu
2022-01-31 20:16     ` Pasha Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.