[PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-01-13  9:44 ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

HVO was previously disabled on arm64 [1] due to the lack of necessary
BBM(break-before-make) logic when changing page tables.
This set of patches fix this by adding necessary BBM sequence when
changing page table, and supporting vmemmap page fault handling to
fixup kernel address translation fault if vmemmap is concurrently accessed.

I have tested this patch set with concurrently accessing the vmemmap
address when do BBM and can recover by vmemmap fault handler. Also
tested under the config of 2/3/4 pgtable levels with 4K/64K page size
and all works well.

V3:
Cleanup:
Move the declarations of helper function in proper head file.

V2:
This version mainly changes some naming, and uses more appropriate helper
functions to make the code more clean, according to review comments from
Muchun Song and Kefeng Wang.

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Nanyong Sun (3):
  mm: HVO: introduce helper function to update and flush pgtable
  arm64: mm: HVO: support BBM of vmemmap pgtable safely
  arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP

 arch/arm64/Kconfig                |  1 +
 arch/arm64/include/asm/esr.h      |  4 ++
 arch/arm64/include/asm/pgtable.h  |  8 ++++
 arch/arm64/include/asm/tlbflush.h | 16 +++++++
 arch/arm64/mm/fault.c             | 78 +++++++++++++++++++++++++++++--
 arch/arm64/mm/mmu.c               | 28 +++++++++++
 mm/hugetlb_vmemmap.c              | 55 +++++++++++++++++-----
 7 files changed, 175 insertions(+), 15 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-01-13  9:44 ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

HVO was previously disabled on arm64 [1] due to the lack of necessary
BBM(break-before-make) logic when changing page tables.
This set of patches fix this by adding necessary BBM sequence when
changing page table, and supporting vmemmap page fault handling to
fixup kernel address translation fault if vmemmap is concurrently accessed.

I have tested this patch set with concurrently accessing the vmemmap
address when do BBM and can recover by vmemmap fault handler. Also
tested under the config of 2/3/4 pgtable levels with 4K/64K page size
and all works well.

V3:
Cleanup:
Move the declarations of helper function in proper head file.

V2:
This version mainly changes some naming, and uses more appropriate helper
functions to make the code more clean, according to review comments from
Muchun Song and Kefeng Wang.

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Nanyong Sun (3):
  mm: HVO: introduce helper function to update and flush pgtable
  arm64: mm: HVO: support BBM of vmemmap pgtable safely
  arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP

 arch/arm64/Kconfig                |  1 +
 arch/arm64/include/asm/esr.h      |  4 ++
 arch/arm64/include/asm/pgtable.h  |  8 ++++
 arch/arm64/include/asm/tlbflush.h | 16 +++++++
 arch/arm64/mm/fault.c             | 78 +++++++++++++++++++++++++++++--
 arch/arm64/mm/mmu.c               | 28 +++++++++++
 mm/hugetlb_vmemmap.c              | 55 +++++++++++++++++-----
 7 files changed, 175 insertions(+), 15 deletions(-)

-- 
2.25.1

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable
  2024-01-13  9:44 ` Nanyong Sun
@ 2024-01-13  9:44   ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

Add pmd/pte update and tlb flush helper function to update page
table. This refactoring patch is designed to facilitate each
architecture to implement its own special logic in preparation
for the arm64 architecture to follow the necessary break-before-make
sequence when updating page tables.

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 55 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index da177e49d956..f1f5702bce4f 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -46,6 +46,37 @@ struct vmemmap_remap_walk {
 	unsigned long		flags;
 };
 
+#ifndef vmemmap_update_pmd
+static inline void vmemmap_update_pmd(unsigned long addr,
+				      pmd_t *pmdp, pte_t *ptep)
+{
+	pmd_populate_kernel(&init_mm, pmdp, ptep);
+}
+#endif
+
+#ifndef vmemmap_update_pte
+static inline void vmemmap_update_pte(unsigned long addr,
+				      pte_t *ptep, pte_t pte)
+{
+	set_pte_at(&init_mm, addr, ptep, pte);
+}
+#endif
+
+#ifndef vmemmap_flush_tlb_all
+static inline void vmemmap_flush_tlb_all(void)
+{
+	flush_tlb_all();
+}
+#endif
+
+#ifndef vmemmap_flush_tlb_range
+static inline void vmemmap_flush_tlb_range(unsigned long start,
+					   unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#endif
+
 static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 			     struct vmemmap_remap_walk *walk)
 {
@@ -81,9 +112,9 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
-		pmd_populate_kernel(&init_mm, pmd, pgtable);
+		vmemmap_update_pmd(start, pmd, pgtable);
 		if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
-			flush_tlb_kernel_range(start, start + PMD_SIZE);
+			vmemmap_flush_tlb_range(start, start + PMD_SIZE);
 	} else {
 		pte_free_kernel(&init_mm, pgtable);
 	}
@@ -171,7 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 		return ret;
 
 	if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
-		flush_tlb_kernel_range(start, end);
+		vmemmap_flush_tlb_range(start, end);
 
 	return 0;
 }
@@ -217,15 +248,15 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 
 		/*
 		 * Makes sure that preceding stores to the page contents from
-		 * vmemmap_remap_free() become visible before the set_pte_at()
-		 * write.
+		 * vmemmap_remap_free() become visible before the
+		 * vmemmap_update_pte() write.
 		 */
 		smp_wmb();
 	}
 
 	entry = mk_pte(walk->reuse_page, pgprot);
 	list_add(&page->lru, walk->vmemmap_pages);
-	set_pte_at(&init_mm, addr, pte, entry);
+	vmemmap_update_pte(addr, pte, entry);
 }
 
 /*
@@ -264,10 +295,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 
 	/*
 	 * Makes sure that preceding stores to the page contents become visible
-	 * before the set_pte_at() write.
+	 * before the vmemmap_update_pte() write.
 	 */
 	smp_wmb();
-	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+	vmemmap_update_pte(addr, pte, mk_pte(page, pgprot));
 }
 
 /**
@@ -519,7 +550,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	}
 
 	if (restored)
-		flush_tlb_all();
+		vmemmap_flush_tlb_all();
 	if (!ret)
 		ret = restored;
 	return ret;
@@ -642,7 +673,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 			break;
 	}
 
-	flush_tlb_all();
+	vmemmap_flush_tlb_all();
 
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret;
@@ -659,7 +690,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 		 * allowing more vmemmap remaps to occur.
 		 */
 		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
-			flush_tlb_all();
+			vmemmap_flush_tlb_all();
 			free_vmemmap_page_list(&vmemmap_pages);
 			INIT_LIST_HEAD(&vmemmap_pages);
 			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
@@ -667,7 +698,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 		}
 	}
 
-	flush_tlb_all();
+	vmemmap_flush_tlb_all();
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable
@ 2024-01-13  9:44   ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

Add pmd/pte update and tlb flush helper function to update page
table. This refactoring patch is designed to facilitate each
architecture to implement its own special logic in preparation
for the arm64 architecture to follow the necessary break-before-make
sequence when updating page tables.

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 55 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index da177e49d956..f1f5702bce4f 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -46,6 +46,37 @@ struct vmemmap_remap_walk {
 	unsigned long		flags;
 };
 
+#ifndef vmemmap_update_pmd
+static inline void vmemmap_update_pmd(unsigned long addr,
+				      pmd_t *pmdp, pte_t *ptep)
+{
+	pmd_populate_kernel(&init_mm, pmdp, ptep);
+}
+#endif
+
+#ifndef vmemmap_update_pte
+static inline void vmemmap_update_pte(unsigned long addr,
+				      pte_t *ptep, pte_t pte)
+{
+	set_pte_at(&init_mm, addr, ptep, pte);
+}
+#endif
+
+#ifndef vmemmap_flush_tlb_all
+static inline void vmemmap_flush_tlb_all(void)
+{
+	flush_tlb_all();
+}
+#endif
+
+#ifndef vmemmap_flush_tlb_range
+static inline void vmemmap_flush_tlb_range(unsigned long start,
+					   unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#endif
+
 static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 			     struct vmemmap_remap_walk *walk)
 {
@@ -81,9 +112,9 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
 
 		/* Make pte visible before pmd. See comment in pmd_install(). */
 		smp_wmb();
-		pmd_populate_kernel(&init_mm, pmd, pgtable);
+		vmemmap_update_pmd(start, pmd, pgtable);
 		if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
-			flush_tlb_kernel_range(start, start + PMD_SIZE);
+			vmemmap_flush_tlb_range(start, start + PMD_SIZE);
 	} else {
 		pte_free_kernel(&init_mm, pgtable);
 	}
@@ -171,7 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 		return ret;
 
 	if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
-		flush_tlb_kernel_range(start, end);
+		vmemmap_flush_tlb_range(start, end);
 
 	return 0;
 }
@@ -217,15 +248,15 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 
 		/*
 		 * Makes sure that preceding stores to the page contents from
-		 * vmemmap_remap_free() become visible before the set_pte_at()
-		 * write.
+		 * vmemmap_remap_free() become visible before the
+		 * vmemmap_update_pte() write.
 		 */
 		smp_wmb();
 	}
 
 	entry = mk_pte(walk->reuse_page, pgprot);
 	list_add(&page->lru, walk->vmemmap_pages);
-	set_pte_at(&init_mm, addr, pte, entry);
+	vmemmap_update_pte(addr, pte, entry);
 }
 
 /*
@@ -264,10 +295,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 
 	/*
 	 * Makes sure that preceding stores to the page contents become visible
-	 * before the set_pte_at() write.
+	 * before the vmemmap_update_pte() write.
 	 */
 	smp_wmb();
-	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+	vmemmap_update_pte(addr, pte, mk_pte(page, pgprot));
 }
 
 /**
@@ -519,7 +550,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	}
 
 	if (restored)
-		flush_tlb_all();
+		vmemmap_flush_tlb_all();
 	if (!ret)
 		ret = restored;
 	return ret;
@@ -642,7 +673,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 			break;
 	}
 
-	flush_tlb_all();
+	vmemmap_flush_tlb_all();
 
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret;
@@ -659,7 +690,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 		 * allowing more vmemmap remaps to occur.
 		 */
 		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
-			flush_tlb_all();
+			vmemmap_flush_tlb_all();
 			free_vmemmap_page_list(&vmemmap_pages);
 			INIT_LIST_HEAD(&vmemmap_pages);
 			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
@@ -667,7 +698,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 		}
 	}
 
-	flush_tlb_all();
+	vmemmap_flush_tlb_all();
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
  2024-01-13  9:44 ` Nanyong Sun
@ 2024-01-13  9:44   ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
BBM(break-before-make) logic when change the page table of vmemmap
address, they will under the init_mm.page_table_lock.
If a translation fault of vmemmap address concurrently happened after
pte/pmd cleared, vmemmap page fault handler will acquire the
init_mm.page_table_lock to wait for vmemmap update to complete,
by then the virtual address is valid again, so PF can return and
access can continue.
In other case, do the traditional kernel fault.

Implement vmemmap_flush_tlb_all/range on arm64 with nothing
to do because tlb already flushed in every single BBM.

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
---
 arch/arm64/include/asm/esr.h      |  4 ++
 arch/arm64/include/asm/pgtable.h  |  8 ++++
 arch/arm64/include/asm/tlbflush.h | 16 +++++++
 arch/arm64/mm/fault.c             | 78 +++++++++++++++++++++++++++++--
 arch/arm64/mm/mmu.c               | 28 +++++++++++
 5 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index ae35939f395b..1c63256efd25 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -116,6 +116,10 @@
 #define ESR_ELx_FSC_SERROR	(0x11)
 #define ESR_ELx_FSC_ACCESS	(0x08)
 #define ESR_ELx_FSC_FAULT	(0x04)
+#define ESR_ELx_FSC_FAULT_L0    (0x04)
+#define ESR_ELx_FSC_FAULT_L1    (0x05)
+#define ESR_ELx_FSC_FAULT_L2    (0x06)
+#define ESR_ELx_FSC_FAULT_L3    (0x07)
 #define ESR_ELx_FSC_PERM	(0x0C)
 #define ESR_ELx_FSC_SEA_TTW0	(0x14)
 #define ESR_ELx_FSC_SEA_TTW1	(0x15)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..b50270107e2f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1124,6 +1124,14 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep);
+#define vmemmap_update_pmd vmemmap_update_pmd
+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
+#define vmemmap_update_pte vmemmap_update_pte
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 1deb5d789c2e..79e932a1bdf8 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -504,6 +504,22 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 	dsb(ish);
 	isb();
 }
+
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+static inline void vmemmap_flush_tlb_all(void)
+{
+	/* do nothing, already flushed tlb in every single BBM */
+}
+#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all
+
+static inline void vmemmap_flush_tlb_range(unsigned long start,
+					   unsigned long end)
+{
+	/* do nothing, already flushed tlb in every single BBM */
+}
+#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range
+#endif
+
 #endif
 
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..13189322a38f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -368,6 +368,75 @@ static bool is_el1_mte_sync_tag_check_fault(unsigned long esr)
 	return false;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+static inline bool vmemmap_fault_may_fixup(unsigned long addr,
+					   unsigned long esr)
+{
+	if (addr < VMEMMAP_START || addr >= VMEMMAP_END)
+		return false;
+
+	/*
+	 * Only try to handle translation fault level 2 or level 3,
+	 * because hugetlb vmemmap optimize only clear pmd or pte.
+	 */
+	switch (esr & ESR_ELx_FSC) {
+	case ESR_ELx_FSC_FAULT_L2:
+	case ESR_ELx_FSC_FAULT_L3:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/*
+ * PMD mapped vmemmap should has been split as PTE mapped
+ * by HVO now, here we only check this case, other cases
+ * should fail.
+ * Also should check the addr is healthy enough that will not cause
+ * a level2 or level3 translation fault again after page fault
+ * handled with success, so we need check both bits[1:0] of PMD and
+ * PTE as ARM Spec mentioned below:
+ * A Translation fault is generated if bits[1:0] of a translation
+ * table descriptor identify the descriptor as either a Fault
+ * encoding or a reserved encoding.
+ */
+static inline bool vmemmap_addr_healthy(unsigned long addr)
+{
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+
+	pmdp = pmd_off_k(addr);
+	pmd = pmdp_get(pmdp);
+	if (!pmd_table(pmd))
+		return false;
+
+	ptep = pte_offset_kernel(pmdp, addr);
+	pte = ptep_get(ptep);
+	return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE;
+}
+
+static bool vmemmap_handle_page_fault(unsigned long addr,
+				      unsigned long esr)
+{
+	bool ret;
+
+	if (likely(!vmemmap_fault_may_fixup(addr, esr)))
+		return false;
+
+	spin_lock(&init_mm.page_table_lock);
+	ret = vmemmap_addr_healthy(addr);
+	spin_unlock(&init_mm.page_table_lock);
+
+	return ret;
+}
+#else
+static inline bool vmemmap_handle_page_fault(unsigned long addr,
+					     unsigned long esr)
+{
+	return false;
+}
+#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
+
 static bool is_translation_fault(unsigned long esr)
 {
 	return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT;
@@ -405,9 +474,12 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr,
 	} else if (addr < PAGE_SIZE) {
 		msg = "NULL pointer dereference";
 	} else {
-		if (is_translation_fault(esr) &&
-		    kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
-			return;
+		if (is_translation_fault(esr)) {
+			if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
+				return;
+			if (vmemmap_handle_page_fault(addr, esr))
+				return;
+		}
 
 		msg = "paging request";
 	}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1ac7467d34c9..d794b2f4b5a3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1146,6 +1146,34 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 	return 1;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+/*
+ * In the window between the page table entry is cleared and filled
+ * with a new value, other threads have the opportunity to concurrently
+ * access the vmemmap area then page translation fault occur.
+ * Therefore, we need to ensure that the init_mm.page_table_lock is held
+ * to synchronize the vmemmap page fault handling which will wait for
+ * this lock to be released to ensure that the page table entry has been
+ * refreshed with a new valid value.
+ */
+void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep)
+{
+	lockdep_assert_held(&init_mm.page_table_lock);
+	pmd_clear(pmdp);
+	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+	pmd_populate_kernel(&init_mm, pmdp, ptep);
+}
+
+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	spin_lock(&init_mm.page_table_lock);
+	pte_clear(&init_mm, addr, ptep);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	set_pte_at(&init_mm, addr, ptep, pte);
+	spin_unlock(&init_mm.page_table_lock);
+}
+#endif
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
@ 2024-01-13  9:44   ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
BBM(break-before-make) logic when change the page table of vmemmap
address, they will under the init_mm.page_table_lock.
If a translation fault of vmemmap address concurrently happened after
pte/pmd cleared, vmemmap page fault handler will acquire the
init_mm.page_table_lock to wait for vmemmap update to complete,
by then the virtual address is valid again, so PF can return and
access can continue.
In other case, do the traditional kernel fault.

Implement vmemmap_flush_tlb_all/range on arm64 with nothing
to do because tlb already flushed in every single BBM.

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
---
 arch/arm64/include/asm/esr.h      |  4 ++
 arch/arm64/include/asm/pgtable.h  |  8 ++++
 arch/arm64/include/asm/tlbflush.h | 16 +++++++
 arch/arm64/mm/fault.c             | 78 +++++++++++++++++++++++++++++--
 arch/arm64/mm/mmu.c               | 28 +++++++++++
 5 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index ae35939f395b..1c63256efd25 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -116,6 +116,10 @@
 #define ESR_ELx_FSC_SERROR	(0x11)
 #define ESR_ELx_FSC_ACCESS	(0x08)
 #define ESR_ELx_FSC_FAULT	(0x04)
+#define ESR_ELx_FSC_FAULT_L0    (0x04)
+#define ESR_ELx_FSC_FAULT_L1    (0x05)
+#define ESR_ELx_FSC_FAULT_L2    (0x06)
+#define ESR_ELx_FSC_FAULT_L3    (0x07)
 #define ESR_ELx_FSC_PERM	(0x0C)
 #define ESR_ELx_FSC_SEA_TTW0	(0x14)
 #define ESR_ELx_FSC_SEA_TTW1	(0x15)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..b50270107e2f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1124,6 +1124,14 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
 				    unsigned long addr, pte_t *ptep,
 				    pte_t old_pte, pte_t new_pte);
+
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep);
+#define vmemmap_update_pmd vmemmap_update_pmd
+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
+#define vmemmap_update_pte vmemmap_update_pte
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 1deb5d789c2e..79e932a1bdf8 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -504,6 +504,22 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
 	dsb(ish);
 	isb();
 }
+
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+static inline void vmemmap_flush_tlb_all(void)
+{
+	/* do nothing, already flushed tlb in every single BBM */
+}
+#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all
+
+static inline void vmemmap_flush_tlb_range(unsigned long start,
+					   unsigned long end)
+{
+	/* do nothing, already flushed tlb in every single BBM */
+}
+#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range
+#endif
+
 #endif
 
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..13189322a38f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -368,6 +368,75 @@ static bool is_el1_mte_sync_tag_check_fault(unsigned long esr)
 	return false;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+static inline bool vmemmap_fault_may_fixup(unsigned long addr,
+					   unsigned long esr)
+{
+	if (addr < VMEMMAP_START || addr >= VMEMMAP_END)
+		return false;
+
+	/*
+	 * Only try to handle translation fault level 2 or level 3,
+	 * because hugetlb vmemmap optimize only clear pmd or pte.
+	 */
+	switch (esr & ESR_ELx_FSC) {
+	case ESR_ELx_FSC_FAULT_L2:
+	case ESR_ELx_FSC_FAULT_L3:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/*
+ * PMD mapped vmemmap should has been split as PTE mapped
+ * by HVO now, here we only check this case, other cases
+ * should fail.
+ * Also should check the addr is healthy enough that will not cause
+ * a level2 or level3 translation fault again after page fault
+ * handled with success, so we need check both bits[1:0] of PMD and
+ * PTE as ARM Spec mentioned below:
+ * A Translation fault is generated if bits[1:0] of a translation
+ * table descriptor identify the descriptor as either a Fault
+ * encoding or a reserved encoding.
+ */
+static inline bool vmemmap_addr_healthy(unsigned long addr)
+{
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+
+	pmdp = pmd_off_k(addr);
+	pmd = pmdp_get(pmdp);
+	if (!pmd_table(pmd))
+		return false;
+
+	ptep = pte_offset_kernel(pmdp, addr);
+	pte = ptep_get(ptep);
+	return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE;
+}
+
+static bool vmemmap_handle_page_fault(unsigned long addr,
+				      unsigned long esr)
+{
+	bool ret;
+
+	if (likely(!vmemmap_fault_may_fixup(addr, esr)))
+		return false;
+
+	spin_lock(&init_mm.page_table_lock);
+	ret = vmemmap_addr_healthy(addr);
+	spin_unlock(&init_mm.page_table_lock);
+
+	return ret;
+}
+#else
+static inline bool vmemmap_handle_page_fault(unsigned long addr,
+					     unsigned long esr)
+{
+	return false;
+}
+#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
+
 static bool is_translation_fault(unsigned long esr)
 {
 	return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT;
@@ -405,9 +474,12 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr,
 	} else if (addr < PAGE_SIZE) {
 		msg = "NULL pointer dereference";
 	} else {
-		if (is_translation_fault(esr) &&
-		    kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
-			return;
+		if (is_translation_fault(esr)) {
+			if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
+				return;
+			if (vmemmap_handle_page_fault(addr, esr))
+				return;
+		}
 
 		msg = "paging request";
 	}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1ac7467d34c9..d794b2f4b5a3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1146,6 +1146,34 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 	return 1;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+/*
+ * In the window between the page table entry is cleared and filled
+ * with a new value, other threads have the opportunity to concurrently
+ * access the vmemmap area then page translation fault occur.
+ * Therefore, we need to ensure that the init_mm.page_table_lock is held
+ * to synchronize the vmemmap page fault handling which will wait for
+ * this lock to be released to ensure that the page table entry has been
+ * refreshed with a new valid value.
+ */
+void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep)
+{
+	lockdep_assert_held(&init_mm.page_table_lock);
+	pmd_clear(pmdp);
+	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+	pmd_populate_kernel(&init_mm, pmdp, ptep);
+}
+
+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	spin_lock(&init_mm.page_table_lock);
+	pte_clear(&init_mm, addr, ptep);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	set_pte_at(&init_mm, addr, ptep, pte);
+	spin_unlock(&init_mm.page_table_lock);
+}
+#endif
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP
  2024-01-13  9:44 ` Nanyong Sun
@ 2024-01-13  9:44   ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

Now update of vmemmap page table can follow the rule of
break-before-make safely for arm64 architecture, re-enable
HVO on arm64.

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 8f6cf1221b6a..bd37a7ed32f2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -104,6 +104,7 @@ config ARM64
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
+	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP
@ 2024-01-13  9:44   ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-13  9:44 UTC (permalink / raw)
  To: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual
  Cc: willy, wangkefeng.wang, sunnanyong, linux-arm-kernel,
	linux-kernel, linux-mm

Now update of vmemmap page table can follow the rule of
break-before-make safely for arm64 architecture, re-enable
HVO on arm64.

Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 8f6cf1221b6a..bd37a7ed32f2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -104,6 +104,7 @@ config ARM64
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
+	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
  2024-01-13  9:44   ` Nanyong Sun
@ 2024-01-15  2:38     ` Muchun Song
  -1 siblings, 0 replies; 50+ messages in thread
From: Muchun Song @ 2024-01-15  2:38 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: Catalin Marinas, Will Deacon, Mike Kravetz, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm



> On Jan 13, 2024, at 17:44, Nanyong Sun <sunnanyong@huawei.com> wrote:
> 
> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
> BBM(break-before-make) logic when change the page table of vmemmap
> address, they will under the init_mm.page_table_lock.
> If a translation fault of vmemmap address concurrently happened after
> pte/pmd cleared, vmemmap page fault handler will acquire the
> init_mm.page_table_lock to wait for vmemmap update to complete,
> by then the virtual address is valid again, so PF can return and
> access can continue.
> In other case, do the traditional kernel fault.
> 
> Implement vmemmap_flush_tlb_all/range on arm64 with nothing
> to do because tlb already flushed in every single BBM.
> 
> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
@ 2024-01-15  2:38     ` Muchun Song
  0 siblings, 0 replies; 50+ messages in thread
From: Muchun Song @ 2024-01-15  2:38 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: Catalin Marinas, Will Deacon, Mike Kravetz, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm



> On Jan 13, 2024, at 17:44, Nanyong Sun <sunnanyong@huawei.com> wrote:
> 
> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
> BBM(break-before-make) logic when change the page table of vmemmap
> address, they will under the init_mm.page_table_lock.
> If a translation fault of vmemmap address concurrently happened after
> pte/pmd cleared, vmemmap page fault handler will acquire the
> init_mm.page_table_lock to wait for vmemmap update to complete,
> by then the virtual address is valid again, so PF can return and
> access can continue.
> In other case, do the traditional kernel fault.
> 
> Implement vmemmap_flush_tlb_all/range on arm64 with nothing
> to do because tlb already flushed in every single BBM.
> 
> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-01-13  9:44 ` Nanyong Sun
@ 2024-01-25 18:06   ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2024-01-25 18:06 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: will, mike.kravetz, muchun.song, akpm, anshuman.khandual, willy,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm

On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> HVO was previously disabled on arm64 [1] due to the lack of necessary
> BBM(break-before-make) logic when changing page tables.
> This set of patches fix this by adding necessary BBM sequence when
> changing page table, and supporting vmemmap page fault handling to
> fixup kernel address translation fault if vmemmap is concurrently accessed.

I'm not keen on this approach. I'm not even sure it's safe. In the
second patch, you take the init_mm.page_table_lock on the fault path but
are we sure this is unlocked when the fault was taken? Basically you can
get a fault anywhere something accesses a struct page.

How often is this code path called? I wonder whether a stop_machine()
approach would be simpler.

Andrew, I'd suggest we drop these patches from the mm tree for the time
being. They haven't received much review from the arm64 folk. Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-01-25 18:06   ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2024-01-25 18:06 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: will, mike.kravetz, muchun.song, akpm, anshuman.khandual, willy,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm

On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> HVO was previously disabled on arm64 [1] due to the lack of necessary
> BBM(break-before-make) logic when changing page tables.
> This set of patches fix this by adding necessary BBM sequence when
> changing page table, and supporting vmemmap page fault handling to
> fixup kernel address translation fault if vmemmap is concurrently accessed.

I'm not keen on this approach. I'm not even sure it's safe. In the
second patch, you take the init_mm.page_table_lock on the fault path but
are we sure this is unlocked when the fault was taken? Basically you can
get a fault anywhere something accesses a struct page.

How often is this code path called? I wonder whether a stop_machine()
approach would be simpler.

Andrew, I'd suggest we drop these patches from the mm tree for the time
being. They haven't received much review from the arm64 folk. Thanks.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-01-25 18:06   ` Catalin Marinas
@ 2024-01-27  5:04     ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-27  5:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, mike.kravetz, muchun.song, akpm, anshuman.khandual, willy,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm


On 2024/1/26 2:06, Catalin Marinas wrote:
> On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
>> HVO was previously disabled on arm64 [1] due to the lack of necessary
>> BBM(break-before-make) logic when changing page tables.
>> This set of patches fix this by adding necessary BBM sequence when
>> changing page table, and supporting vmemmap page fault handling to
>> fixup kernel address translation fault if vmemmap is concurrently accessed.
> I'm not keen on this approach. I'm not even sure it's safe. In the
> second patch, you take the init_mm.page_table_lock on the fault path but
> are we sure this is unlocked when the fault was taken?
I think this situation is impossible. In the implementation of the 
second patch, when the page table is being corrupted
(the time window when a page fault may occur), vmemmap_update_pte() 
already holds the init_mm.page_table_lock,
and unlock it until page table update is done.Another thread could not 
hold the init_mm.page_table_lock and
also trigger a page fault at the same time.
If I have missed any points in my thinking, please correct me. Thank you.
> Basically you can
> get a fault anywhere something accesses a struct page.
>
> How often is this code path called? I wonder whether a stop_machine()
> approach would be simpler.
As long as allocating or releasing hugetlb is called.  We cannot limit 
users to only allocate or release hugetlb
when booting or not running any workload on all other cpus, so if use 
stop_machine(), it will be triggered
8 times every 2M and 4096 times every 1G, which is probably too expensive.
I saw that on the X86, in order to improve performance, optimizations 
such as batch tlb flushing have been done,
means that some users are concerned about the performance of hugetlb 
allocation:
https://lwn.net/ml/linux-kernel/20230905214412.89152-1-mike.kravetz@oracle.com/ 

> Andrew, I'd suggest we drop these patches from the mm tree for the time
> being. They haven't received much review from the arm64 folk. Thanks.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-01-27  5:04     ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-01-27  5:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, mike.kravetz, muchun.song, akpm, anshuman.khandual, willy,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm


On 2024/1/26 2:06, Catalin Marinas wrote:
> On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
>> HVO was previously disabled on arm64 [1] due to the lack of necessary
>> BBM(break-before-make) logic when changing page tables.
>> This set of patches fix this by adding necessary BBM sequence when
>> changing page table, and supporting vmemmap page fault handling to
>> fixup kernel address translation fault if vmemmap is concurrently accessed.
> I'm not keen on this approach. I'm not even sure it's safe. In the
> second patch, you take the init_mm.page_table_lock on the fault path but
> are we sure this is unlocked when the fault was taken?
I think this situation is impossible. In the implementation of the 
second patch, when the page table is being corrupted
(the time window when a page fault may occur), vmemmap_update_pte() 
already holds the init_mm.page_table_lock,
and unlock it until page table update is done.Another thread could not 
hold the init_mm.page_table_lock and
also trigger a page fault at the same time.
If I have missed any points in my thinking, please correct me. Thank you.
> Basically you can
> get a fault anywhere something accesses a struct page.
>
> How often is this code path called? I wonder whether a stop_machine()
> approach would be simpler.
As long as allocating or releasing hugetlb is called.  We cannot limit 
users to only allocate or release hugetlb
when booting or not running any workload on all other cpus, so if use 
stop_machine(), it will be triggered
8 times every 2M and 4096 times every 1G, which is probably too expensive.
I saw that on the X86, in order to improve performance, optimizations 
such as batch tlb flushing have been done,
means that some users are concerned about the performance of hugetlb 
allocation:
https://lwn.net/ml/linux-kernel/20230905214412.89152-1-mike.kravetz@oracle.com/ 

> Andrew, I'd suggest we drop these patches from the mm tree for the time
> being. They haven't received much review from the arm64 folk. Thanks.
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-01-27  5:04     ` Nanyong Sun
@ 2024-02-07 11:12       ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-02-07 11:12 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: Catalin Marinas, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> 
> On 2024/1/26 2:06, Catalin Marinas wrote:
> > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > BBM(break-before-make) logic when changing page tables.
> > > This set of patches fix this by adding necessary BBM sequence when
> > > changing page table, and supporting vmemmap page fault handling to
> > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > I'm not keen on this approach. I'm not even sure it's safe. In the
> > second patch, you take the init_mm.page_table_lock on the fault path but
> > are we sure this is unlocked when the fault was taken?
> I think this situation is impossible. In the implementation of the second
> patch, when the page table is being corrupted
> (the time window when a page fault may occur), vmemmap_update_pte() already
> holds the init_mm.page_table_lock,
> and unlock it until page table update is done.Another thread could not hold
> the init_mm.page_table_lock and
> also trigger a page fault at the same time.
> If I have missed any points in my thinking, please correct me. Thank you.

It still strikes me as incredibly fragile to handle the fault and trying
to reason about all the users of 'struct page' is impossible. For example,
can the fault happen from irq context?

If we want to optimise the vmemmap mapping for arm64, I think we need to
consider approaches which avoid the possibility of the fault altogether.
It's more complicated to implement, but I think it would be a lot more
robust.

Andrew -- please can you drop these from -next?

Thanks,

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 11:12       ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-02-07 11:12 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: Catalin Marinas, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> 
> On 2024/1/26 2:06, Catalin Marinas wrote:
> > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > BBM(break-before-make) logic when changing page tables.
> > > This set of patches fix this by adding necessary BBM sequence when
> > > changing page table, and supporting vmemmap page fault handling to
> > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > I'm not keen on this approach. I'm not even sure it's safe. In the
> > second patch, you take the init_mm.page_table_lock on the fault path but
> > are we sure this is unlocked when the fault was taken?
> I think this situation is impossible. In the implementation of the second
> patch, when the page table is being corrupted
> (the time window when a page fault may occur), vmemmap_update_pte() already
> holds the init_mm.page_table_lock,
> and unlock it until page table update is done.Another thread could not hold
> the init_mm.page_table_lock and
> also trigger a page fault at the same time.
> If I have missed any points in my thinking, please correct me. Thank you.

It still strikes me as incredibly fragile to handle the fault and trying
to reason about all the users of 'struct page' is impossible. For example,
can the fault happen from irq context?

If we want to optimise the vmemmap mapping for arm64, I think we need to
consider approaches which avoid the possibility of the fault altogether.
It's more complicated to implement, but I think it would be a lot more
robust.

Andrew -- please can you drop these from -next?

Thanks,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 11:12       ` Will Deacon
@ 2024-02-07 11:21         ` Matthew Wilcox
  -1 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-02-07 11:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > 
> > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > BBM(break-before-make) logic when changing page tables.
> > > > This set of patches fix this by adding necessary BBM sequence when
> > > > changing page table, and supporting vmemmap page fault handling to
> > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > are we sure this is unlocked when the fault was taken?
> > I think this situation is impossible. In the implementation of the second
> > patch, when the page table is being corrupted
> > (the time window when a page fault may occur), vmemmap_update_pte() already
> > holds the init_mm.page_table_lock,
> > and unlock it until page table update is done.Another thread could not hold
> > the init_mm.page_table_lock and
> > also trigger a page fault at the same time.
> > If I have missed any points in my thinking, please correct me. Thank you.
> 
> It still strikes me as incredibly fragile to handle the fault and trying
> to reason about all the users of 'struct page' is impossible. For example,
> can the fault happen from irq context?

The pte lock cannot be taken in irq context (which I think is what
you're asking?)  While it is not possible to reason about all users of
struct page, we are somewhat relieved of that work by noting that this is
only for hugetlbfs, so we don't need to reason about slab, page tables,
netmem or zsmalloc.

> If we want to optimise the vmemmap mapping for arm64, I think we need to
> consider approaches which avoid the possibility of the fault altogether.
> It's more complicated to implement, but I think it would be a lot more
> robust.
> 
> Andrew -- please can you drop these from -next?
> 
> Thanks,
> 
> Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 11:21         ` Matthew Wilcox
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-02-07 11:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > 
> > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > BBM(break-before-make) logic when changing page tables.
> > > > This set of patches fix this by adding necessary BBM sequence when
> > > > changing page table, and supporting vmemmap page fault handling to
> > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > are we sure this is unlocked when the fault was taken?
> > I think this situation is impossible. In the implementation of the second
> > patch, when the page table is being corrupted
> > (the time window when a page fault may occur), vmemmap_update_pte() already
> > holds the init_mm.page_table_lock,
> > and unlock it until page table update is done.Another thread could not hold
> > the init_mm.page_table_lock and
> > also trigger a page fault at the same time.
> > If I have missed any points in my thinking, please correct me. Thank you.
> 
> It still strikes me as incredibly fragile to handle the fault and trying
> to reason about all the users of 'struct page' is impossible. For example,
> can the fault happen from irq context?

The pte lock cannot be taken in irq context (which I think is what
you're asking?)  While it is not possible to reason about all users of
struct page, we are somewhat relieved of that work by noting that this is
only for hugetlbfs, so we don't need to reason about slab, page tables,
netmem or zsmalloc.

> If we want to optimise the vmemmap mapping for arm64, I think we need to
> consider approaches which avoid the possibility of the fault altogether.
> It's more complicated to implement, but I think it would be a lot more
> robust.
> 
> Andrew -- please can you drop these from -next?
> 
> Thanks,
> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 11:21         ` Matthew Wilcox
@ 2024-02-07 12:11           ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-02-07 12:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nanyong Sun, Catalin Marinas, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > 
> > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > BBM(break-before-make) logic when changing page tables.
> > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > are we sure this is unlocked when the fault was taken?
> > > I think this situation is impossible. In the implementation of the second
> > > patch, when the page table is being corrupted
> > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > holds the init_mm.page_table_lock,
> > > and unlock it until page table update is done.Another thread could not hold
> > > the init_mm.page_table_lock and
> > > also trigger a page fault at the same time.
> > > If I have missed any points in my thinking, please correct me. Thank you.
> > 
> > It still strikes me as incredibly fragile to handle the fault and trying
> > to reason about all the users of 'struct page' is impossible. For example,
> > can the fault happen from irq context?
> 
> The pte lock cannot be taken in irq context (which I think is what
> you're asking?)  While it is not possible to reason about all users of
> struct page, we are somewhat relieved of that work by noting that this is
> only for hugetlbfs, so we don't need to reason about slab, page tables,
> netmem or zsmalloc.

My concern is that an interrupt handler tries to access a 'struct page'
which faults due to another core splitting a pmd mapping for the vmemmap.
In this case, I think we'll end up trying to resolve the fault from irq
context, which will try to take the spinlock.

Avoiding the fault would make this considerably more robust and the
architecture has introduced features to avoid break-before-make in some
circumstances (see FEAT_BBM and its levels), so having this optimisation
conditional on that would seem to be a better approach in my opinion.

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 12:11           ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-02-07 12:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nanyong Sun, Catalin Marinas, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > 
> > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > BBM(break-before-make) logic when changing page tables.
> > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > are we sure this is unlocked when the fault was taken?
> > > I think this situation is impossible. In the implementation of the second
> > > patch, when the page table is being corrupted
> > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > holds the init_mm.page_table_lock,
> > > and unlock it until page table update is done.Another thread could not hold
> > > the init_mm.page_table_lock and
> > > also trigger a page fault at the same time.
> > > If I have missed any points in my thinking, please correct me. Thank you.
> > 
> > It still strikes me as incredibly fragile to handle the fault and trying
> > to reason about all the users of 'struct page' is impossible. For example,
> > can the fault happen from irq context?
> 
> The pte lock cannot be taken in irq context (which I think is what
> you're asking?)  While it is not possible to reason about all users of
> struct page, we are somewhat relieved of that work by noting that this is
> only for hugetlbfs, so we don't need to reason about slab, page tables,
> netmem or zsmalloc.

My concern is that an interrupt handler tries to access a 'struct page'
which faults due to another core splitting a pmd mapping for the vmemmap.
In this case, I think we'll end up trying to resolve the fault from irq
context, which will try to take the spinlock.

Avoiding the fault would make this considerably more robust and the
architecture has introduced features to avoid break-before-make in some
circumstances (see FEAT_BBM and its levels), so having this optimisation
conditional on that would seem to be a better approach in my opinion.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 11:21         ` Matthew Wilcox
@ 2024-02-07 12:20           ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2024-02-07 12:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Will Deacon, Nanyong Sun, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > BBM(break-before-make) logic when changing page tables.
> > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > are we sure this is unlocked when the fault was taken?
> > > I think this situation is impossible. In the implementation of the second
> > > patch, when the page table is being corrupted
> > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > holds the init_mm.page_table_lock,
> > > and unlock it until page table update is done.Another thread could not hold
> > > the init_mm.page_table_lock and
> > > also trigger a page fault at the same time.
> > > If I have missed any points in my thinking, please correct me. Thank you.
> > 
> > It still strikes me as incredibly fragile to handle the fault and trying
> > to reason about all the users of 'struct page' is impossible. For example,
> > can the fault happen from irq context?
> 
> The pte lock cannot be taken in irq context (which I think is what
> you're asking?)

With this patchset, I think it can: IRQ -> interrupt handler accesses
vmemmap -> faults -> fault handler in patch 2 takes the
init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
such fault in any kernel context looks fragile.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 12:20           ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2024-02-07 12:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Will Deacon, Nanyong Sun, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > BBM(break-before-make) logic when changing page tables.
> > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > are we sure this is unlocked when the fault was taken?
> > > I think this situation is impossible. In the implementation of the second
> > > patch, when the page table is being corrupted
> > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > holds the init_mm.page_table_lock,
> > > and unlock it until page table update is done.Another thread could not hold
> > > the init_mm.page_table_lock and
> > > also trigger a page fault at the same time.
> > > If I have missed any points in my thinking, please correct me. Thank you.
> > 
> > It still strikes me as incredibly fragile to handle the fault and trying
> > to reason about all the users of 'struct page' is impossible. For example,
> > can the fault happen from irq context?
> 
> The pte lock cannot be taken in irq context (which I think is what
> you're asking?)

With this patchset, I think it can: IRQ -> interrupt handler accesses
vmemmap -> faults -> fault handler in patch 2 takes the
init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
such fault in any kernel context looks fragile.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
  2024-01-13  9:44   ` Nanyong Sun
@ 2024-02-07 12:21     ` Mark Rutland
  -1 siblings, 0 replies; 50+ messages in thread
From: Mark Rutland @ 2024-02-07 12:21 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Sat, Jan 13, 2024 at 05:44:35PM +0800, Nanyong Sun wrote:
> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
> BBM(break-before-make) logic when change the page table of vmemmap
> address, they will under the init_mm.page_table_lock.
> If a translation fault of vmemmap address concurrently happened after
> pte/pmd cleared, vmemmap page fault handler will acquire the
> init_mm.page_table_lock to wait for vmemmap update to complete,
> by then the virtual address is valid again, so PF can return and
> access can continue.

Is that wait bounded? ... and is it event guaranteed to make progress?

Under a hypervisor, the vCPU doing the BBM could be preempted between the break
and the make, so the thread waiting might be waiting a long time for that to
come back and finish the make step.

Further, under PREEMPT_RT regular spinlocks don't inhibit preemption, and I suspect
that means this can deadlock on RT -- the thread doing the BBM could be
preempted, the newly-scheduled thread could try to access the vmemmap, and then
get stuck in the fault handler (e.g. on a single CPU system). There's nothing
below describing how that's prevented.

I've concerned this may be subtly broken, and it feels like this is going to be
*very* painful to maintain and test. IMO this is trying to be overly clever and
I'd much rather that we avoided the transient broken step. On CPUs with
FEAT_BBM level 2 we can avoid that broken step, can we make this depend
dynamically on whether the CPU has FEAT_BBM?

> In other case, do the traditional kernel fault.
> 
> Implement vmemmap_flush_tlb_all/range on arm64 with nothing
> to do because tlb already flushed in every single BBM.
> 
> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
> ---
>  arch/arm64/include/asm/esr.h      |  4 ++
>  arch/arm64/include/asm/pgtable.h  |  8 ++++
>  arch/arm64/include/asm/tlbflush.h | 16 +++++++
>  arch/arm64/mm/fault.c             | 78 +++++++++++++++++++++++++++++--
>  arch/arm64/mm/mmu.c               | 28 +++++++++++
>  5 files changed, 131 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index ae35939f395b..1c63256efd25 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -116,6 +116,10 @@
>  #define ESR_ELx_FSC_SERROR	(0x11)
>  #define ESR_ELx_FSC_ACCESS	(0x08)
>  #define ESR_ELx_FSC_FAULT	(0x04)
> +#define ESR_ELx_FSC_FAULT_L0    (0x04)
> +#define ESR_ELx_FSC_FAULT_L1    (0x05)
> +#define ESR_ELx_FSC_FAULT_L2    (0x06)
> +#define ESR_ELx_FSC_FAULT_L3    (0x07)
>  #define ESR_ELx_FSC_PERM	(0x0C)
>  #define ESR_ELx_FSC_SEA_TTW0	(0x14)
>  #define ESR_ELx_FSC_SEA_TTW1	(0x15)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..b50270107e2f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1124,6 +1124,14 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
> +
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep);
> +#define vmemmap_update_pmd vmemmap_update_pmd
> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
> +#define vmemmap_update_pte vmemmap_update_pte
> +#endif
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 1deb5d789c2e..79e932a1bdf8 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -504,6 +504,22 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
>  	dsb(ish);
>  	isb();
>  }
> +
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +static inline void vmemmap_flush_tlb_all(void)
> +{
> +	/* do nothing, already flushed tlb in every single BBM */
> +}
> +#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all
> +
> +static inline void vmemmap_flush_tlb_range(unsigned long start,
> +					   unsigned long end)
> +{
> +	/* do nothing, already flushed tlb in every single BBM */
> +}
> +#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range
> +#endif
> +
>  #endif
>  
>  #endif
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..13189322a38f 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -368,6 +368,75 @@ static bool is_el1_mte_sync_tag_check_fault(unsigned long esr)
>  	return false;
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +static inline bool vmemmap_fault_may_fixup(unsigned long addr,
> +					   unsigned long esr)
> +{
> +	if (addr < VMEMMAP_START || addr >= VMEMMAP_END)
> +		return false;
> +
> +	/*
> +	 * Only try to handle translation fault level 2 or level 3,
> +	 * because hugetlb vmemmap optimize only clear pmd or pte.
> +	 */
> +	switch (esr & ESR_ELx_FSC) {
> +	case ESR_ELx_FSC_FAULT_L2:
> +	case ESR_ELx_FSC_FAULT_L3:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +/*
> + * PMD mapped vmemmap should has been split as PTE mapped
> + * by HVO now, here we only check this case, other cases
> + * should fail.

Sorry, I can't parse what this is trying to say.

> + * Also should check the addr is healthy enough that will not cause
> + * a level2 or level3 translation fault again after page fault
> + * handled with success, so we need check both bits[1:0] of PMD and
> + * PTE as ARM Spec mentioned below:

Which spec? Iassume you mean the ARM ARM? Are you referring to a specific part
within that?

> + * A Translation fault is generated if bits[1:0] of a translation
> + * table descriptor identify the descriptor as either a Fault
> + * encoding or a reserved encoding.
> + */
> +static inline bool vmemmap_addr_healthy(unsigned long addr)
> +{
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +
> +	pmdp = pmd_off_k(addr);
> +	pmd = pmdp_get(pmdp);
> +	if (!pmd_table(pmd))
> +		return false;

Is a block (i.e. hugetlb) entry not considered healthy? I thought the whole
point of this optimization was that you'd use a block PMD entry?

> +
> +	ptep = pte_offset_kernel(pmdp, addr);
> +	pte = ptep_get(ptep);
> +	return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE;
> +}
> +
> +static bool vmemmap_handle_page_fault(unsigned long addr,
> +				      unsigned long esr)
> +{
> +	bool ret;
> +
> +	if (likely(!vmemmap_fault_may_fixup(addr, esr)))
> +		return false;
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	ret = vmemmap_addr_healthy(addr);
> +	spin_unlock(&init_mm.page_table_lock);

As above, I'm pretty sure this is only safe is the code doing the BBM has IRQs
disabled, otherwise the thread can be preempted, and we can get stuck in here
while the entry is broken.

So at minimum this needs some explanation of why that doesn't happen in a
comment.

> +
> +	return ret;
> +}
> +#else
> +static inline bool vmemmap_handle_page_fault(unsigned long addr,
> +					     unsigned long esr)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
> +
>  static bool is_translation_fault(unsigned long esr)
>  {
>  	return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT;
> @@ -405,9 +474,12 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr,
>  	} else if (addr < PAGE_SIZE) {
>  		msg = "NULL pointer dereference";
>  	} else {
> -		if (is_translation_fault(esr) &&
> -		    kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> -			return;
> +		if (is_translation_fault(esr)) {
> +			if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> +				return;
> +			if (vmemmap_handle_page_fault(addr, esr))
> +				return;
> +		}
>  
>  		msg = "paging request";
>  	}
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 1ac7467d34c9..d794b2f4b5a3 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1146,6 +1146,34 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  	return 1;
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +/*
> + * In the window between the page table entry is cleared and filled
> + * with a new value, other threads have the opportunity to concurrently
> + * access the vmemmap area then page translation fault occur.
> + * Therefore, we need to ensure that the init_mm.page_table_lock is held
> + * to synchronize the vmemmap page fault handling which will wait for
> + * this lock to be released to ensure that the page table entry has been
> + * refreshed with a new valid value.
> + */
> +void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep)
> +{
> +	lockdep_assert_held(&init_mm.page_table_lock);
> +	pmd_clear(pmdp);
> +	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
> +	pmd_populate_kernel(&init_mm, pmdp, ptep);
> +}
> +
> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +	spin_lock(&init_mm.page_table_lock);
> +	pte_clear(&init_mm, addr, ptep);
> +	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +	set_pte_at(&init_mm, addr, ptep, pte);
> +	spin_unlock(&init_mm.page_table_lock);
> +}

As above, if this happens with IRQs unmasked, the thread can potentially be
preempted and we can get stuck in the fault handler (at least on RT).

I can't tell whether this is safe, and I think that at minimum this needs
comments and/or lockdep assertions, but I'd much rather we didn't try to play
this sort of game.

Mark.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
@ 2024-02-07 12:21     ` Mark Rutland
  0 siblings, 0 replies; 50+ messages in thread
From: Mark Rutland @ 2024-02-07 12:21 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Sat, Jan 13, 2024 at 05:44:35PM +0800, Nanyong Sun wrote:
> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
> BBM(break-before-make) logic when change the page table of vmemmap
> address, they will under the init_mm.page_table_lock.
> If a translation fault of vmemmap address concurrently happened after
> pte/pmd cleared, vmemmap page fault handler will acquire the
> init_mm.page_table_lock to wait for vmemmap update to complete,
> by then the virtual address is valid again, so PF can return and
> access can continue.

Is that wait bounded? ... and is it event guaranteed to make progress?

Under a hypervisor, the vCPU doing the BBM could be preempted between the break
and the make, so the thread waiting might be waiting a long time for that to
come back and finish the make step.

Further, under PREEMPT_RT regular spinlocks don't inhibit preemption, and I suspect
that means this can deadlock on RT -- the thread doing the BBM could be
preempted, the newly-scheduled thread could try to access the vmemmap, and then
get stuck in the fault handler (e.g. on a single CPU system). There's nothing
below describing how that's prevented.

I've concerned this may be subtly broken, and it feels like this is going to be
*very* painful to maintain and test. IMO this is trying to be overly clever and
I'd much rather that we avoided the transient broken step. On CPUs with
FEAT_BBM level 2 we can avoid that broken step, can we make this depend
dynamically on whether the CPU has FEAT_BBM?

> In other case, do the traditional kernel fault.
> 
> Implement vmemmap_flush_tlb_all/range on arm64 with nothing
> to do because tlb already flushed in every single BBM.
> 
> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
> ---
>  arch/arm64/include/asm/esr.h      |  4 ++
>  arch/arm64/include/asm/pgtable.h  |  8 ++++
>  arch/arm64/include/asm/tlbflush.h | 16 +++++++
>  arch/arm64/mm/fault.c             | 78 +++++++++++++++++++++++++++++--
>  arch/arm64/mm/mmu.c               | 28 +++++++++++
>  5 files changed, 131 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
> index ae35939f395b..1c63256efd25 100644
> --- a/arch/arm64/include/asm/esr.h
> +++ b/arch/arm64/include/asm/esr.h
> @@ -116,6 +116,10 @@
>  #define ESR_ELx_FSC_SERROR	(0x11)
>  #define ESR_ELx_FSC_ACCESS	(0x08)
>  #define ESR_ELx_FSC_FAULT	(0x04)
> +#define ESR_ELx_FSC_FAULT_L0    (0x04)
> +#define ESR_ELx_FSC_FAULT_L1    (0x05)
> +#define ESR_ELx_FSC_FAULT_L2    (0x06)
> +#define ESR_ELx_FSC_FAULT_L3    (0x07)
>  #define ESR_ELx_FSC_PERM	(0x0C)
>  #define ESR_ELx_FSC_SEA_TTW0	(0x14)
>  #define ESR_ELx_FSC_SEA_TTW1	(0x15)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..b50270107e2f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1124,6 +1124,14 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
>  extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  				    unsigned long addr, pte_t *ptep,
>  				    pte_t old_pte, pte_t new_pte);
> +
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep);
> +#define vmemmap_update_pmd vmemmap_update_pmd
> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
> +#define vmemmap_update_pte vmemmap_update_pte
> +#endif
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 1deb5d789c2e..79e932a1bdf8 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -504,6 +504,22 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
>  	dsb(ish);
>  	isb();
>  }
> +
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +static inline void vmemmap_flush_tlb_all(void)
> +{
> +	/* do nothing, already flushed tlb in every single BBM */
> +}
> +#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all
> +
> +static inline void vmemmap_flush_tlb_range(unsigned long start,
> +					   unsigned long end)
> +{
> +	/* do nothing, already flushed tlb in every single BBM */
> +}
> +#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range
> +#endif
> +
>  #endif
>  
>  #endif
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..13189322a38f 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -368,6 +368,75 @@ static bool is_el1_mte_sync_tag_check_fault(unsigned long esr)
>  	return false;
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +static inline bool vmemmap_fault_may_fixup(unsigned long addr,
> +					   unsigned long esr)
> +{
> +	if (addr < VMEMMAP_START || addr >= VMEMMAP_END)
> +		return false;
> +
> +	/*
> +	 * Only try to handle translation fault level 2 or level 3,
> +	 * because hugetlb vmemmap optimize only clear pmd or pte.
> +	 */
> +	switch (esr & ESR_ELx_FSC) {
> +	case ESR_ELx_FSC_FAULT_L2:
> +	case ESR_ELx_FSC_FAULT_L3:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +/*
> + * PMD mapped vmemmap should has been split as PTE mapped
> + * by HVO now, here we only check this case, other cases
> + * should fail.

Sorry, I can't parse what this is trying to say.

> + * Also should check the addr is healthy enough that will not cause
> + * a level2 or level3 translation fault again after page fault
> + * handled with success, so we need check both bits[1:0] of PMD and
> + * PTE as ARM Spec mentioned below:

Which spec? Iassume you mean the ARM ARM? Are you referring to a specific part
within that?

> + * A Translation fault is generated if bits[1:0] of a translation
> + * table descriptor identify the descriptor as either a Fault
> + * encoding or a reserved encoding.
> + */
> +static inline bool vmemmap_addr_healthy(unsigned long addr)
> +{
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +
> +	pmdp = pmd_off_k(addr);
> +	pmd = pmdp_get(pmdp);
> +	if (!pmd_table(pmd))
> +		return false;

Is a block (i.e. hugetlb) entry not considered healthy? I thought the whole
point of this optimization was that you'd use a block PMD entry?

> +
> +	ptep = pte_offset_kernel(pmdp, addr);
> +	pte = ptep_get(ptep);
> +	return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE;
> +}
> +
> +static bool vmemmap_handle_page_fault(unsigned long addr,
> +				      unsigned long esr)
> +{
> +	bool ret;
> +
> +	if (likely(!vmemmap_fault_may_fixup(addr, esr)))
> +		return false;
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	ret = vmemmap_addr_healthy(addr);
> +	spin_unlock(&init_mm.page_table_lock);

As above, I'm pretty sure this is only safe is the code doing the BBM has IRQs
disabled, otherwise the thread can be preempted, and we can get stuck in here
while the entry is broken.

So at minimum this needs some explanation of why that doesn't happen in a
comment.

> +
> +	return ret;
> +}
> +#else
> +static inline bool vmemmap_handle_page_fault(unsigned long addr,
> +					     unsigned long esr)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
> +
>  static bool is_translation_fault(unsigned long esr)
>  {
>  	return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT;
> @@ -405,9 +474,12 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr,
>  	} else if (addr < PAGE_SIZE) {
>  		msg = "NULL pointer dereference";
>  	} else {
> -		if (is_translation_fault(esr) &&
> -		    kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> -			return;
> +		if (is_translation_fault(esr)) {
> +			if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> +				return;
> +			if (vmemmap_handle_page_fault(addr, esr))
> +				return;
> +		}
>  
>  		msg = "paging request";
>  	}
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 1ac7467d34c9..d794b2f4b5a3 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1146,6 +1146,34 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  	return 1;
>  }
>  
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +/*
> + * In the window between the page table entry is cleared and filled
> + * with a new value, other threads have the opportunity to concurrently
> + * access the vmemmap area then page translation fault occur.
> + * Therefore, we need to ensure that the init_mm.page_table_lock is held
> + * to synchronize the vmemmap page fault handling which will wait for
> + * this lock to be released to ensure that the page table entry has been
> + * refreshed with a new valid value.
> + */
> +void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep)
> +{
> +	lockdep_assert_held(&init_mm.page_table_lock);
> +	pmd_clear(pmdp);
> +	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
> +	pmd_populate_kernel(&init_mm, pmdp, ptep);
> +}
> +
> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +	spin_lock(&init_mm.page_table_lock);
> +	pte_clear(&init_mm, addr, ptep);
> +	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +	set_pte_at(&init_mm, addr, ptep, pte);
> +	spin_unlock(&init_mm.page_table_lock);
> +}

As above, if this happens with IRQs unmasked, the thread can potentially be
preempted and we can get stuck in the fault handler (at least on RT).

I can't tell whether this is safe, and I think that at minimum this needs
comments and/or lockdep assertions, but I'd much rather we didn't try to play
this sort of game.

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 12:11           ` Will Deacon
@ 2024-02-07 12:24             ` Mark Rutland
  -1 siblings, 0 replies; 50+ messages in thread
From: Mark Rutland @ 2024-02-07 12:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Matthew Wilcox, Nanyong Sun, Catalin Marinas, mike.kravetz,
	muchun.song, akpm, anshuman.khandual, wangkefeng.wang,
	linux-arm-kernel, linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > > 
> > > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > > BBM(break-before-make) logic when changing page tables.
> > > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > > are we sure this is unlocked when the fault was taken?
> > > > I think this situation is impossible. In the implementation of the second
> > > > patch, when the page table is being corrupted
> > > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > > holds the init_mm.page_table_lock,
> > > > and unlock it until page table update is done.Another thread could not hold
> > > > the init_mm.page_table_lock and
> > > > also trigger a page fault at the same time.
> > > > If I have missed any points in my thinking, please correct me. Thank you.
> > > 
> > > It still strikes me as incredibly fragile to handle the fault and trying
> > > to reason about all the users of 'struct page' is impossible. For example,
> > > can the fault happen from irq context?
> > 
> > The pte lock cannot be taken in irq context (which I think is what
> > you're asking?)  While it is not possible to reason about all users of
> > struct page, we are somewhat relieved of that work by noting that this is
> > only for hugetlbfs, so we don't need to reason about slab, page tables,
> > netmem or zsmalloc.
> 
> My concern is that an interrupt handler tries to access a 'struct page'
> which faults due to another core splitting a pmd mapping for the vmemmap.
> In this case, I think we'll end up trying to resolve the fault from irq
> context, which will try to take the spinlock.

I think that (as per my comments on patch 2), a similar deadlock can happen on
RT even if the vmemmap is only accessed in regular process context, and at
minimum this needs better comentary and/or lockdep assertions.

I'd also prefer that we dropped this for now.

> Avoiding the fault would make this considerably more robust and the
> architecture has introduced features to avoid break-before-make in some
> circumstances (see FEAT_BBM and its levels), so having this optimisation
> conditional on that would seem to be a better approach in my opinion.

FWIW, that's my position too.

Mark.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 12:24             ` Mark Rutland
  0 siblings, 0 replies; 50+ messages in thread
From: Mark Rutland @ 2024-02-07 12:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Matthew Wilcox, Nanyong Sun, Catalin Marinas, mike.kravetz,
	muchun.song, akpm, anshuman.khandual, wangkefeng.wang,
	linux-arm-kernel, linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > > 
> > > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > > BBM(break-before-make) logic when changing page tables.
> > > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > > are we sure this is unlocked when the fault was taken?
> > > > I think this situation is impossible. In the implementation of the second
> > > > patch, when the page table is being corrupted
> > > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > > holds the init_mm.page_table_lock,
> > > > and unlock it until page table update is done.Another thread could not hold
> > > > the init_mm.page_table_lock and
> > > > also trigger a page fault at the same time.
> > > > If I have missed any points in my thinking, please correct me. Thank you.
> > > 
> > > It still strikes me as incredibly fragile to handle the fault and trying
> > > to reason about all the users of 'struct page' is impossible. For example,
> > > can the fault happen from irq context?
> > 
> > The pte lock cannot be taken in irq context (which I think is what
> > you're asking?)  While it is not possible to reason about all users of
> > struct page, we are somewhat relieved of that work by noting that this is
> > only for hugetlbfs, so we don't need to reason about slab, page tables,
> > netmem or zsmalloc.
> 
> My concern is that an interrupt handler tries to access a 'struct page'
> which faults due to another core splitting a pmd mapping for the vmemmap.
> In this case, I think we'll end up trying to resolve the fault from irq
> context, which will try to take the spinlock.

I think that (as per my comments on patch 2), a similar deadlock can happen on
RT even if the vmemmap is only accessed in regular process context, and at
minimum this needs better comentary and/or lockdep assertions.

I'd also prefer that we dropped this for now.

> Avoiding the fault would make this considerably more robust and the
> architecture has introduced features to avoid break-before-make in some
> circumstances (see FEAT_BBM and its levels), so having this optimisation
> conditional on that would seem to be a better approach in my opinion.

FWIW, that's my position too.

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-01-27  5:04     ` Nanyong Sun
@ 2024-02-07 12:44       ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2024-02-07 12:44 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: will, mike.kravetz, muchun.song, akpm, anshuman.khandual, willy,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm

On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> On 2024/1/26 2:06, Catalin Marinas wrote:
> > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > BBM(break-before-make) logic when changing page tables.
> > > This set of patches fix this by adding necessary BBM sequence when
> > > changing page table, and supporting vmemmap page fault handling to
> > > fixup kernel address translation fault if vmemmap is concurrently accessed.
[...]
> > How often is this code path called? I wonder whether a stop_machine()
> > approach would be simpler.
> As long as allocating or releasing hugetlb is called.  We cannot limit users
> to only allocate or release hugetlb
> when booting or not running any workload on all other cpus, so if use
> stop_machine(), it will be triggered
> 8 times every 2M and 4096 times every 1G, which is probably too expensive.

I'm hoping this can be batched somehow and not do a stop_machine() (or
8) for every 2MB huge page.

Just to make sure I understand - is the goal to be able to free struct
pages corresponding to hugetlbfs pages? Can we not leave the vmemmap in
place and just release that memory to the page allocator? The physical
RAM for those struct pages isn't going anywhere, we just have a vmemmap
alias to it (cacheable).

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 12:44       ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2024-02-07 12:44 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: will, mike.kravetz, muchun.song, akpm, anshuman.khandual, willy,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm

On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> On 2024/1/26 2:06, Catalin Marinas wrote:
> > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > BBM(break-before-make) logic when changing page tables.
> > > This set of patches fix this by adding necessary BBM sequence when
> > > changing page table, and supporting vmemmap page fault handling to
> > > fixup kernel address translation fault if vmemmap is concurrently accessed.
[...]
> > How often is this code path called? I wonder whether a stop_machine()
> > approach would be simpler.
> As long as allocating or releasing hugetlb is called.  We cannot limit users
> to only allocate or release hugetlb
> when booting or not running any workload on all other cpus, so if use
> stop_machine(), it will be triggered
> 8 times every 2M and 4096 times every 1G, which is probably too expensive.

I'm hoping this can be batched somehow and not do a stop_machine() (or
8) for every 2MB huge page.

Just to make sure I understand - is the goal to be able to free struct
pages corresponding to hugetlbfs pages? Can we not leave the vmemmap in
place and just release that memory to the page allocator? The physical
RAM for those struct pages isn't going anywhere, we just have a vmemmap
alias to it (cacheable).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 12:11           ` Will Deacon
@ 2024-02-07 14:17             ` Matthew Wilcox
  -1 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-02-07 14:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > The pte lock cannot be taken in irq context (which I think is what
> > you're asking?)  While it is not possible to reason about all users of
> > struct page, we are somewhat relieved of that work by noting that this is
> > only for hugetlbfs, so we don't need to reason about slab, page tables,
> > netmem or zsmalloc.
> 
> My concern is that an interrupt handler tries to access a 'struct page'
> which faults due to another core splitting a pmd mapping for the vmemmap.
> In this case, I think we'll end up trying to resolve the fault from irq
> context, which will try to take the spinlock.

Yes, this absolutely can happen (with this patch), and this patch should
be dropped for now.

While this array of ~512 pages have been allocated to hugetlbfs, and one
would think that there would be no way that there could still be
references to them, another CPU can have a pointer to this struct page
(eg attempting a speculative page cache reference or
get_user_pages_fast()).  That means it will try to call
atomic_add_unless(&page->_refcount, 1, 0);

Actually, I wonder if this isn't a problem on x86 too?  Do we need to
explicitly go through an RCU grace period before freeing the pages
for use by somebody else?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-07 14:17             ` Matthew Wilcox
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-02-07 14:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > The pte lock cannot be taken in irq context (which I think is what
> > you're asking?)  While it is not possible to reason about all users of
> > struct page, we are somewhat relieved of that work by noting that this is
> > only for hugetlbfs, so we don't need to reason about slab, page tables,
> > netmem or zsmalloc.
> 
> My concern is that an interrupt handler tries to access a 'struct page'
> which faults due to another core splitting a pmd mapping for the vmemmap.
> In this case, I think we'll end up trying to resolve the fault from irq
> context, which will try to take the spinlock.

Yes, this absolutely can happen (with this patch), and this patch should
be dropped for now.

While this array of ~512 pages have been allocated to hugetlbfs, and one
would think that there would be no way that there could still be
references to them, another CPU can have a pointer to this struct page
(eg attempting a speculative page cache reference or
get_user_pages_fast()).  That means it will try to call
atomic_add_unless(&page->_refcount, 1, 0);

Actually, I wonder if this isn't a problem on x86 too?  Do we need to
explicitly go through an RCU grace period before freeing the pages
for use by somebody else?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 14:17             ` Matthew Wilcox
@ 2024-02-08  2:24               ` Jane Chu
  -1 siblings, 0 replies; 50+ messages in thread
From: Jane Chu @ 2024-02-08  2:24 UTC (permalink / raw)
  To: Matthew Wilcox, Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On 2/7/2024 6:17 AM, Matthew Wilcox wrote:

> On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
>> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
>>> The pte lock cannot be taken in irq context (which I think is what
>>> you're asking?)  While it is not possible to reason about all users of
>>> struct page, we are somewhat relieved of that work by noting that this is
>>> only for hugetlbfs, so we don't need to reason about slab, page tables,
>>> netmem or zsmalloc.
>> My concern is that an interrupt handler tries to access a 'struct page'
>> which faults due to another core splitting a pmd mapping for the vmemmap.
>> In this case, I think we'll end up trying to resolve the fault from irq
>> context, which will try to take the spinlock.
> Yes, this absolutely can happen (with this patch), and this patch should
> be dropped for now.
>
> While this array of ~512 pages have been allocated to hugetlbfs, and one
> would think that there would be no way that there could still be
> references to them, another CPU can have a pointer to this struct page
> (eg attempting a speculative page cache reference or
> get_user_pages_fast()).  That means it will try to call
> atomic_add_unless(&page->_refcount, 1, 0);
>
> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
> explicitly go through an RCU grace period before freeing the pages
> for use by somebody else?
>
Sorry, not sure what I'm missing, please help.

 From hugetlb allocation perspective,  one of the scenarios is run time 
hugetlb page allocation (say 2M pages), starting from the buddy 
allocator returns compound pages, then the head page is set to frozen, 
then the folio(compound pages) is put thru the HVO process, one of which 
is vmemmap_split_pmd() in case a vmemmap page is a PMD page.

Until the HVO process completes, none of the vmemmap represented pages 
are available to any threads, so what are the causes for IRQ threads to 
access their vmemmap pages?

thanks!

-jane



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-08  2:24               ` Jane Chu
  0 siblings, 0 replies; 50+ messages in thread
From: Jane Chu @ 2024-02-08  2:24 UTC (permalink / raw)
  To: Matthew Wilcox, Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On 2/7/2024 6:17 AM, Matthew Wilcox wrote:

> On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
>> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
>>> The pte lock cannot be taken in irq context (which I think is what
>>> you're asking?)  While it is not possible to reason about all users of
>>> struct page, we are somewhat relieved of that work by noting that this is
>>> only for hugetlbfs, so we don't need to reason about slab, page tables,
>>> netmem or zsmalloc.
>> My concern is that an interrupt handler tries to access a 'struct page'
>> which faults due to another core splitting a pmd mapping for the vmemmap.
>> In this case, I think we'll end up trying to resolve the fault from irq
>> context, which will try to take the spinlock.
> Yes, this absolutely can happen (with this patch), and this patch should
> be dropped for now.
>
> While this array of ~512 pages have been allocated to hugetlbfs, and one
> would think that there would be no way that there could still be
> references to them, another CPU can have a pointer to this struct page
> (eg attempting a speculative page cache reference or
> get_user_pages_fast()).  That means it will try to call
> atomic_add_unless(&page->_refcount, 1, 0);
>
> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
> explicitly go through an RCU grace period before freeing the pages
> for use by somebody else?
>
Sorry, not sure what I'm missing, please help.

 From hugetlb allocation perspective,  one of the scenarios is run time 
hugetlb page allocation (say 2M pages), starting from the buddy 
allocator returns compound pages, then the head page is set to frozen, 
then the folio(compound pages) is put thru the HVO process, one of which 
is vmemmap_split_pmd() in case a vmemmap page is a PMD page.

Until the HVO process completes, none of the vmemmap represented pages 
are available to any threads, so what are the causes for IRQ threads to 
access their vmemmap pages?

thanks!

-jane



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
  2024-02-07 12:21     ` Mark Rutland
@ 2024-02-08  9:30       ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-02-08  9:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On 2024/2/7 20:21, Mark Rutland wrote:

> On Sat, Jan 13, 2024 at 05:44:35PM +0800, Nanyong Sun wrote:
>> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
>> BBM(break-before-make) logic when change the page table of vmemmap
>> address, they will under the init_mm.page_table_lock.
>> If a translation fault of vmemmap address concurrently happened after
>> pte/pmd cleared, vmemmap page fault handler will acquire the
>> init_mm.page_table_lock to wait for vmemmap update to complete,
>> by then the virtual address is valid again, so PF can return and
>> access can continue.
> Is that wait bounded? ... and is it event guaranteed to make progress?
>
> Under a hypervisor, the vCPU doing the BBM could be preempted between the break
> and the make, so the thread waiting might be waiting a long time for that to
> come back and finish the make step.
>
> Further, under PREEMPT_RT regular spinlocks don't inhibit preemption, and I suspect
> that means this can deadlock on RT -- the thread doing the BBM could be
> preempted, the newly-scheduled thread could try to access the vmemmap, and then
> get stuck in the fault handler (e.g. on a single CPU system). There's nothing
> below describing how that's prevented.
>
> I've concerned this may be subtly broken, and it feels like this is going to be
> *very* painful to maintain and test. IMO this is trying to be overly clever and
> I'd much rather that we avoided the transient broken step. On CPUs with
> FEAT_BBM level 2 we can avoid that broken step, can we make this depend
> dynamically on whether the CPU has FEAT_BBM?
If I understand correctly, FEAT_BBM is only used for changing block 
size. But in HVO
we not only need changing block size in vmemmap_split_pmd(), but also 
need changing
output address of PTE in vmemmap_remap_pte().
So I would like to ask if FEAT_BBM can cover these scenarios?
>
>> In other case, do the traditional kernel fault.
>>
>> Implement vmemmap_flush_tlb_all/range on arm64 with nothing
>> to do because tlb already flushed in every single BBM.
>>
>> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
>> ---
>>
>> +
>> +/*
>> + * PMD mapped vmemmap should has been split as PTE mapped
>> + * by HVO now, here we only check this case, other cases
>> + * should fail.
> Sorry, I can't parse what this is trying to say.
>
>> + * Also should check the addr is healthy enough that will not cause
>> + * a level2 or level3 translation fault again after page fault
>> + * handled with success, so we need check both bits[1:0] of PMD and
>> + * PTE as ARM Spec mentioned below:
> Which spec? Iassume you mean the ARM ARM? Are you referring to a specific part
> within that?

Yes, I referenced section D5.8.1 "Types of MMU faults" from Arm ARM.

>
>> + * A Translation fault is generated if bits[1:0] of a translation
>> + * table descriptor identify the descriptor as either a Fault
>> + * encoding or a reserved encoding.
>> + */
>> +static inline bool vmemmap_addr_healthy(unsigned long addr)
>> +{
>> +	pmd_t *pmdp, pmd;
>> +	pte_t *ptep, pte;
>> +
>> +	pmdp = pmd_off_k(addr);
>> +	pmd = pmdp_get(pmdp);
>> +	if (!pmd_table(pmd))
>> +		return false;
> Is a block (i.e. hugetlb) entry not considered healthy? I thought the whole
> point of this optimization was that you'd use a block PMD entry?
Yes, this patch only condiser the user is HVO and then recheck here, and 
in HVO, only PMD split and
PTE remap can happen. Any other scenarios should be treated as regular 
kernel fault and then panic.
>
>> +
>> +	ptep = pte_offset_kernel(pmdp, addr);
>> +	pte = ptep_get(ptep);
>> +	return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE;
>> +}
>> +
>> +static bool vmemmap_handle_page_fault(unsigned long addr,
>> +				      unsigned long esr)
>> +{
>> +	bool ret;
>> +
>> +	if (likely(!vmemmap_fault_may_fixup(addr, esr)))
>> +		return false;
>> +
>> +	spin_lock(&init_mm.page_table_lock);
>> +	ret = vmemmap_addr_healthy(addr);
>> +	spin_unlock(&init_mm.page_table_lock);
> As above, I'm pretty sure this is only safe is the code doing the BBM has IRQs
> disabled, otherwise the thread can be preempted, and we can get stuck in here
> while the entry is broken.
>
> So at minimum this needs some explanation of why that doesn't happen in a
> comment.
>
>
> As above, if this happens with IRQs unmasked, the thread can potentially be
> preempted and we can get stuck in the fault handler (at least on RT).
>
> I can't tell whether this is safe, and I think that at minimum this needs
> comments and/or lockdep assertions, but I'd much rather we didn't try to play
> this sort of game.
>
> Mark.
>
> .

For IRQ context problem, I wonder if take a new spin lock with irq 
disabled can solve it?

eg.

+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	spin_lock_irq(NEW_LOCK);
+	pte_clear(&init_mm, addr, ptep);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	set_pte_at(&init_mm, addr, ptep, pte);
+	spin_unlock_irq(NEW_LOCK);
+}


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely
@ 2024-02-08  9:30       ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-02-08  9:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, will, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, willy, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On 2024/2/7 20:21, Mark Rutland wrote:

> On Sat, Jan 13, 2024 at 05:44:35PM +0800, Nanyong Sun wrote:
>> Implement vmemmap_update_pmd and vmemmap_update_pte on arm64 to do
>> BBM(break-before-make) logic when change the page table of vmemmap
>> address, they will under the init_mm.page_table_lock.
>> If a translation fault of vmemmap address concurrently happened after
>> pte/pmd cleared, vmemmap page fault handler will acquire the
>> init_mm.page_table_lock to wait for vmemmap update to complete,
>> by then the virtual address is valid again, so PF can return and
>> access can continue.
> Is that wait bounded? ... and is it event guaranteed to make progress?
>
> Under a hypervisor, the vCPU doing the BBM could be preempted between the break
> and the make, so the thread waiting might be waiting a long time for that to
> come back and finish the make step.
>
> Further, under PREEMPT_RT regular spinlocks don't inhibit preemption, and I suspect
> that means this can deadlock on RT -- the thread doing the BBM could be
> preempted, the newly-scheduled thread could try to access the vmemmap, and then
> get stuck in the fault handler (e.g. on a single CPU system). There's nothing
> below describing how that's prevented.
>
> I've concerned this may be subtly broken, and it feels like this is going to be
> *very* painful to maintain and test. IMO this is trying to be overly clever and
> I'd much rather that we avoided the transient broken step. On CPUs with
> FEAT_BBM level 2 we can avoid that broken step, can we make this depend
> dynamically on whether the CPU has FEAT_BBM?
If I understand correctly, FEAT_BBM is only used for changing block 
size. But in HVO
we not only need changing block size in vmemmap_split_pmd(), but also 
need changing
output address of PTE in vmemmap_remap_pte().
So I would like to ask if FEAT_BBM can cover these scenarios?
>
>> In other case, do the traditional kernel fault.
>>
>> Implement vmemmap_flush_tlb_all/range on arm64 with nothing
>> to do because tlb already flushed in every single BBM.
>>
>> Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
>> ---
>>
>> +
>> +/*
>> + * PMD mapped vmemmap should has been split as PTE mapped
>> + * by HVO now, here we only check this case, other cases
>> + * should fail.
> Sorry, I can't parse what this is trying to say.
>
>> + * Also should check the addr is healthy enough that will not cause
>> + * a level2 or level3 translation fault again after page fault
>> + * handled with success, so we need check both bits[1:0] of PMD and
>> + * PTE as ARM Spec mentioned below:
> Which spec? Iassume you mean the ARM ARM? Are you referring to a specific part
> within that?

Yes, I referenced section D5.8.1 "Types of MMU faults" from Arm ARM.

>
>> + * A Translation fault is generated if bits[1:0] of a translation
>> + * table descriptor identify the descriptor as either a Fault
>> + * encoding or a reserved encoding.
>> + */
>> +static inline bool vmemmap_addr_healthy(unsigned long addr)
>> +{
>> +	pmd_t *pmdp, pmd;
>> +	pte_t *ptep, pte;
>> +
>> +	pmdp = pmd_off_k(addr);
>> +	pmd = pmdp_get(pmdp);
>> +	if (!pmd_table(pmd))
>> +		return false;
> Is a block (i.e. hugetlb) entry not considered healthy? I thought the whole
> point of this optimization was that you'd use a block PMD entry?
Yes, this patch only condiser the user is HVO and then recheck here, and 
in HVO, only PMD split and
PTE remap can happen. Any other scenarios should be treated as regular 
kernel fault and then panic.
>
>> +
>> +	ptep = pte_offset_kernel(pmdp, addr);
>> +	pte = ptep_get(ptep);
>> +	return (pte_val(pte) & PTE_TYPE_MASK) == PTE_TYPE_PAGE;
>> +}
>> +
>> +static bool vmemmap_handle_page_fault(unsigned long addr,
>> +				      unsigned long esr)
>> +{
>> +	bool ret;
>> +
>> +	if (likely(!vmemmap_fault_may_fixup(addr, esr)))
>> +		return false;
>> +
>> +	spin_lock(&init_mm.page_table_lock);
>> +	ret = vmemmap_addr_healthy(addr);
>> +	spin_unlock(&init_mm.page_table_lock);
> As above, I'm pretty sure this is only safe is the code doing the BBM has IRQs
> disabled, otherwise the thread can be preempted, and we can get stuck in here
> while the entry is broken.
>
> So at minimum this needs some explanation of why that doesn't happen in a
> comment.
>
>
> As above, if this happens with IRQs unmasked, the thread can potentially be
> preempted and we can get stuck in the fault handler (at least on RT).
>
> I can't tell whether this is safe, and I think that at minimum this needs
> comments and/or lockdep assertions, but I'd much rather we didn't try to play
> this sort of game.
>
> Mark.
>
> .

For IRQ context problem, I wonder if take a new spin lock with irq 
disabled can solve it?

eg.

+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+	spin_lock_irq(NEW_LOCK);
+	pte_clear(&init_mm, addr, ptep);
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	set_pte_at(&init_mm, addr, ptep, pte);
+	spin_unlock_irq(NEW_LOCK);
+}


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-07 12:20           ` Catalin Marinas
@ 2024-02-08  9:44             ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-02-08  9:44 UTC (permalink / raw)
  To: Catalin Marinas, Matthew Wilcox
  Cc: Will Deacon, mike.kravetz, muchun.song, akpm, anshuman.khandual,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm


在 2024/2/7 20:20, Catalin Marinas 写道:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
>> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
>>> On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
>>>> On 2024/1/26 2:06, Catalin Marinas wrote:
>>>>> On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
>>>>>> HVO was previously disabled on arm64 [1] due to the lack of necessary
>>>>>> BBM(break-before-make) logic when changing page tables.
>>>>>> This set of patches fix this by adding necessary BBM sequence when
>>>>>> changing page table, and supporting vmemmap page fault handling to
>>>>>> fixup kernel address translation fault if vmemmap is concurrently accessed.
>>>>> I'm not keen on this approach. I'm not even sure it's safe. In the
>>>>> second patch, you take the init_mm.page_table_lock on the fault path but
>>>>> are we sure this is unlocked when the fault was taken?
>>>> I think this situation is impossible. In the implementation of the second
>>>> patch, when the page table is being corrupted
>>>> (the time window when a page fault may occur), vmemmap_update_pte() already
>>>> holds the init_mm.page_table_lock,
>>>> and unlock it until page table update is done.Another thread could not hold
>>>> the init_mm.page_table_lock and
>>>> also trigger a page fault at the same time.
>>>> If I have missed any points in my thinking, please correct me. Thank you.
>>> It still strikes me as incredibly fragile to handle the fault and trying
>>> to reason about all the users of 'struct page' is impossible. For example,
>>> can the fault happen from irq context?
>> The pte lock cannot be taken in irq context (which I think is what
>> you're asking?)
> With this patchset, I think it can: IRQ -> interrupt handler accesses
> vmemmap -> faults -> fault handler in patch 2 takes the
> init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
> Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
> such fault in any kernel context looks fragile.
How about take a new lock with irq disabled during BBM, like:

+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+    spin_lock_irq(NEW_LOCK);
+    pte_clear(&init_mm, addr, ptep);
+    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+    set_pte_at(&init_mm, addr, ptep, pte);
+    spin_unlock_irq(NEW_LOCK);
+}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-08  9:44             ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-02-08  9:44 UTC (permalink / raw)
  To: Catalin Marinas, Matthew Wilcox
  Cc: Will Deacon, mike.kravetz, muchun.song, akpm, anshuman.khandual,
	wangkefeng.wang, linux-arm-kernel, linux-kernel, linux-mm


在 2024/2/7 20:20, Catalin Marinas 写道:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
>> On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
>>> On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
>>>> On 2024/1/26 2:06, Catalin Marinas wrote:
>>>>> On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
>>>>>> HVO was previously disabled on arm64 [1] due to the lack of necessary
>>>>>> BBM(break-before-make) logic when changing page tables.
>>>>>> This set of patches fix this by adding necessary BBM sequence when
>>>>>> changing page table, and supporting vmemmap page fault handling to
>>>>>> fixup kernel address translation fault if vmemmap is concurrently accessed.
>>>>> I'm not keen on this approach. I'm not even sure it's safe. In the
>>>>> second patch, you take the init_mm.page_table_lock on the fault path but
>>>>> are we sure this is unlocked when the fault was taken?
>>>> I think this situation is impossible. In the implementation of the second
>>>> patch, when the page table is being corrupted
>>>> (the time window when a page fault may occur), vmemmap_update_pte() already
>>>> holds the init_mm.page_table_lock,
>>>> and unlock it until page table update is done.Another thread could not hold
>>>> the init_mm.page_table_lock and
>>>> also trigger a page fault at the same time.
>>>> If I have missed any points in my thinking, please correct me. Thank you.
>>> It still strikes me as incredibly fragile to handle the fault and trying
>>> to reason about all the users of 'struct page' is impossible. For example,
>>> can the fault happen from irq context?
>> The pte lock cannot be taken in irq context (which I think is what
>> you're asking?)
> With this patchset, I think it can: IRQ -> interrupt handler accesses
> vmemmap -> faults -> fault handler in patch 2 takes the
> init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
> Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
> such fault in any kernel context looks fragile.
How about take a new lock with irq disabled during BBM, like:

+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+    spin_lock_irq(NEW_LOCK);
+    pte_clear(&init_mm, addr, ptep);
+    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+    set_pte_at(&init_mm, addr, ptep, pte);
+    spin_unlock_irq(NEW_LOCK);
+}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-08  9:44             ` Nanyong Sun
@ 2024-02-08 13:17               ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-02-08 13:17 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: Catalin Marinas, Matthew Wilcox, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Thu, Feb 08, 2024 at 05:44:48PM +0800, Nanyong Sun wrote:
> 
> 在 2024/2/7 20:20, Catalin Marinas 写道:
> > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > > > BBM(break-before-make) logic when changing page tables.
> > > > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > > > are we sure this is unlocked when the fault was taken?
> > > > > I think this situation is impossible. In the implementation of the second
> > > > > patch, when the page table is being corrupted
> > > > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > > > holds the init_mm.page_table_lock,
> > > > > and unlock it until page table update is done.Another thread could not hold
> > > > > the init_mm.page_table_lock and
> > > > > also trigger a page fault at the same time.
> > > > > If I have missed any points in my thinking, please correct me. Thank you.
> > > > It still strikes me as incredibly fragile to handle the fault and trying
> > > > to reason about all the users of 'struct page' is impossible. For example,
> > > > can the fault happen from irq context?
> > > The pte lock cannot be taken in irq context (which I think is what
> > > you're asking?)
> > With this patchset, I think it can: IRQ -> interrupt handler accesses
> > vmemmap -> faults -> fault handler in patch 2 takes the
> > init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
> > Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
> > such fault in any kernel context looks fragile.
> How about take a new lock with irq disabled during BBM, like:
> 
> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +    spin_lock_irq(NEW_LOCK);
> +    pte_clear(&init_mm, addr, ptep);
> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +    set_pte_at(&init_mm, addr, ptep, pte);
> +    spin_unlock_irq(NEW_LOCK);
> +}

I really think the only maintainable way to achieve this is to avoid the
possibility of a fault altogether.

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-08 13:17               ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-02-08 13:17 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: Catalin Marinas, Matthew Wilcox, mike.kravetz, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Thu, Feb 08, 2024 at 05:44:48PM +0800, Nanyong Sun wrote:
> 
> 在 2024/2/7 20:20, Catalin Marinas 写道:
> > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > > > BBM(break-before-make) logic when changing page tables.
> > > > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > > > are we sure this is unlocked when the fault was taken?
> > > > > I think this situation is impossible. In the implementation of the second
> > > > > patch, when the page table is being corrupted
> > > > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > > > holds the init_mm.page_table_lock,
> > > > > and unlock it until page table update is done.Another thread could not hold
> > > > > the init_mm.page_table_lock and
> > > > > also trigger a page fault at the same time.
> > > > > If I have missed any points in my thinking, please correct me. Thank you.
> > > > It still strikes me as incredibly fragile to handle the fault and trying
> > > > to reason about all the users of 'struct page' is impossible. For example,
> > > > can the fault happen from irq context?
> > > The pte lock cannot be taken in irq context (which I think is what
> > > you're asking?)
> > With this patchset, I think it can: IRQ -> interrupt handler accesses
> > vmemmap -> faults -> fault handler in patch 2 takes the
> > init_mm.page_table_lock to wait for the vmemmap rewriting to complete.
> > Maybe it works if the hugetlb code disabled the IRQs but, as Will said,
> > such fault in any kernel context looks fragile.
> How about take a new lock with irq disabled during BBM, like:
> 
> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> +    spin_lock_irq(NEW_LOCK);
> +    pte_clear(&init_mm, addr, ptep);
> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +    set_pte_at(&init_mm, addr, ptep, pte);
> +    spin_unlock_irq(NEW_LOCK);
> +}

I really think the only maintainable way to achieve this is to avoid the
possibility of a fault altogether.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-08  2:24               ` Jane Chu
@ 2024-02-08 15:49                 ` Matthew Wilcox
  -1 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-02-08 15:49 UTC (permalink / raw)
  To: Jane Chu
  Cc: Will Deacon, Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
> > While this array of ~512 pages have been allocated to hugetlbfs, and one
> > would think that there would be no way that there could still be
> > references to them, another CPU can have a pointer to this struct page
> > (eg attempting a speculative page cache reference or
> > get_user_pages_fast()).  That means it will try to call
> > atomic_add_unless(&page->_refcount, 1, 0);
> > 
> > Actually, I wonder if this isn't a problem on x86 too?  Do we need to
> > explicitly go through an RCU grace period before freeing the pages
> > for use by somebody else?
> > 
> Sorry, not sure what I'm missing, please help.

Having written out the analysis, I now think it can't happen on x86,
but let's walk through it because it's non-obvious (and I think it
illustrates what people are afraid of on Arm).

CPU A calls either get_user_pages_fast() or __filemap_get_folio().
Let's do the latter this time.

        folio = filemap_get_entry(mapping, index);
filemap_get_entry:
	rcu_read_lock();
        folio = xas_load(&xas);
        if (!folio_try_get_rcu(folio))
                goto repeat;
        if (unlikely(folio != xas_reload(&xas))) {
                folio_put(folio);
                goto repeat;
        }
folio_try_get_rcu:
	folio_ref_try_add_rcu(folio, 1);
folio_ref_try_add_rcu:
        if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
                /* Either the folio has been freed, or will be freed. */
                return false;
folio_ref_add_unless:
        return page_ref_add_unless(&folio->page, nr, u);
page_ref_add_unless:
	atomic_add_unless(&page->_refcount, nr, u);

A rather deep callchain there, but for our purposes the important part
is: we take the RCU read lock, we look up a folio, we increment its
refcount if it's not zero, then check that looking up this index gets
the same folio; if it doesn't, we decrement the refcount again and retry
the lookup.

For this analysis, we can be preempted at any point after we've got the
folio pointer from xa_load().

> From hugetlb allocation perspective,  one of the scenarios is run time
> hugetlb page allocation (say 2M pages), starting from the buddy allocator
> returns compound pages, then the head page is set to frozen, then the
> folio(compound pages) is put thru the HVO process, one of which is
> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
> 
> Until the HVO process completes, none of the vmemmap represented pages are
> available to any threads, so what are the causes for IRQ threads to access
> their vmemmap pages?

Yup, this sounds like enough, but it's not.  The problem is the person
who's looking up the folio in the pagecache under RCU.  They've got
the folio pointer and have been preempted.  So now what happens to our
victim folio?

Something happens to remove it from the page cache.  Maybe the file is
truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
removed from the xarray and gets its refcount set to 0.  If the lookup
were to continue at this time, everything would be fine because it would
see a refcount of 0 and not increment it (in page_ref_add_unless()).
And this is where my analysis of RCU tends to go wrong, because I only
think of interleaving event A and B.  I don't think about B and then C
happening before A resumes.  But it can!  Let's follow the journey of
this struct page.

Now that it's been removed from the page cache, it's allocated by hugetlb,
as you describe.  And it's one of the tail pages towards the end of
the 512 contiguous struct pages.  That means that we alter vmemmap so
that the pointer to struct page now points to a different struct page
(one of the earlier ones).  Then the original page of vmemmap containing
our lucky struct page is returned to the page allocator.  At this point,
it no longer contains struct pages; it can contain literally anything.

Where my analysis went wrong was that CPU A _no longer has a pointer
to it_.  CPU A has a pointer into vmemmap.  So it will access the
replacement struct page (which definitely has a refcount 0) instead of
the one which has been freed.  I had thought that CPU A would access the
original memory which has now been allocated to someone else.  But no,
it can't because its pointer is virtual, not physical.

---

Now I'm thinking more about this and there's another scenario which I
thought might go wrong, and doesn't.  For 7 of the 512 pages which are
freed, the struct page pointer gathered by CPU A will not point to a
page with a refcount of 0.  Instead it will point to an alias of the
head page with a positive refcount.  For those pages, CPU A will see
folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
the folio isn't there any more, so it will call folio_put() on something
which used to be a folio, and isn't any more.

But folio_put() calls folio_put_testzero() which calls put_page_testzero()
without asserting that the pointer is actually to a folio.
So everything's fine, but really only by coincidence; I don't think
anybody's thought about this scenario before (maybe Muchun has, but I
don't remember it being discussed).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-08 15:49                 ` Matthew Wilcox
  0 siblings, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2024-02-08 15:49 UTC (permalink / raw)
  To: Jane Chu
  Cc: Will Deacon, Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
> > While this array of ~512 pages have been allocated to hugetlbfs, and one
> > would think that there would be no way that there could still be
> > references to them, another CPU can have a pointer to this struct page
> > (eg attempting a speculative page cache reference or
> > get_user_pages_fast()).  That means it will try to call
> > atomic_add_unless(&page->_refcount, 1, 0);
> > 
> > Actually, I wonder if this isn't a problem on x86 too?  Do we need to
> > explicitly go through an RCU grace period before freeing the pages
> > for use by somebody else?
> > 
> Sorry, not sure what I'm missing, please help.

Having written out the analysis, I now think it can't happen on x86,
but let's walk through it because it's non-obvious (and I think it
illustrates what people are afraid of on Arm).

CPU A calls either get_user_pages_fast() or __filemap_get_folio().
Let's do the latter this time.

        folio = filemap_get_entry(mapping, index);
filemap_get_entry:
	rcu_read_lock();
        folio = xas_load(&xas);
        if (!folio_try_get_rcu(folio))
                goto repeat;
        if (unlikely(folio != xas_reload(&xas))) {
                folio_put(folio);
                goto repeat;
        }
folio_try_get_rcu:
	folio_ref_try_add_rcu(folio, 1);
folio_ref_try_add_rcu:
        if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
                /* Either the folio has been freed, or will be freed. */
                return false;
folio_ref_add_unless:
        return page_ref_add_unless(&folio->page, nr, u);
page_ref_add_unless:
	atomic_add_unless(&page->_refcount, nr, u);

A rather deep callchain there, but for our purposes the important part
is: we take the RCU read lock, we look up a folio, we increment its
refcount if it's not zero, then check that looking up this index gets
the same folio; if it doesn't, we decrement the refcount again and retry
the lookup.

For this analysis, we can be preempted at any point after we've got the
folio pointer from xa_load().

> From hugetlb allocation perspective,  one of the scenarios is run time
> hugetlb page allocation (say 2M pages), starting from the buddy allocator
> returns compound pages, then the head page is set to frozen, then the
> folio(compound pages) is put thru the HVO process, one of which is
> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
> 
> Until the HVO process completes, none of the vmemmap represented pages are
> available to any threads, so what are the causes for IRQ threads to access
> their vmemmap pages?

Yup, this sounds like enough, but it's not.  The problem is the person
who's looking up the folio in the pagecache under RCU.  They've got
the folio pointer and have been preempted.  So now what happens to our
victim folio?

Something happens to remove it from the page cache.  Maybe the file is
truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
removed from the xarray and gets its refcount set to 0.  If the lookup
were to continue at this time, everything would be fine because it would
see a refcount of 0 and not increment it (in page_ref_add_unless()).
And this is where my analysis of RCU tends to go wrong, because I only
think of interleaving event A and B.  I don't think about B and then C
happening before A resumes.  But it can!  Let's follow the journey of
this struct page.

Now that it's been removed from the page cache, it's allocated by hugetlb,
as you describe.  And it's one of the tail pages towards the end of
the 512 contiguous struct pages.  That means that we alter vmemmap so
that the pointer to struct page now points to a different struct page
(one of the earlier ones).  Then the original page of vmemmap containing
our lucky struct page is returned to the page allocator.  At this point,
it no longer contains struct pages; it can contain literally anything.

Where my analysis went wrong was that CPU A _no longer has a pointer
to it_.  CPU A has a pointer into vmemmap.  So it will access the
replacement struct page (which definitely has a refcount 0) instead of
the one which has been freed.  I had thought that CPU A would access the
original memory which has now been allocated to someone else.  But no,
it can't because its pointer is virtual, not physical.

---

Now I'm thinking more about this and there's another scenario which I
thought might go wrong, and doesn't.  For 7 of the 512 pages which are
freed, the struct page pointer gathered by CPU A will not point to a
page with a refcount of 0.  Instead it will point to an alias of the
head page with a positive refcount.  For those pages, CPU A will see
folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
the folio isn't there any more, so it will call folio_put() on something
which used to be a folio, and isn't any more.

But folio_put() calls folio_put_testzero() which calls put_page_testzero()
without asserting that the pointer is actually to a folio.
So everything's fine, but really only by coincidence; I don't think
anybody's thought about this scenario before (maybe Muchun has, but I
don't remember it being discussed).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-08 15:49                 ` Matthew Wilcox
@ 2024-02-08 19:21                   ` Jane Chu
  -1 siblings, 0 replies; 50+ messages in thread
From: Jane Chu @ 2024-02-08 19:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Will Deacon, Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On 2/8/2024 7:49 AM, Matthew Wilcox wrote:

> On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
>> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
>>> While this array of ~512 pages have been allocated to hugetlbfs, and one
>>> would think that there would be no way that there could still be
>>> references to them, another CPU can have a pointer to this struct page
>>> (eg attempting a speculative page cache reference or
>>> get_user_pages_fast()).  That means it will try to call
>>> atomic_add_unless(&page->_refcount, 1, 0);
>>>
>>> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
>>> explicitly go through an RCU grace period before freeing the pages
>>> for use by somebody else?
>>>
>> Sorry, not sure what I'm missing, please help.
> Having written out the analysis, I now think it can't happen on x86,
> but let's walk through it because it's non-obvious (and I think it
> illustrates what people are afraid of on Arm).
>
> CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> Let's do the latter this time.
>
>          folio = filemap_get_entry(mapping, index);
> filemap_get_entry:
> 	rcu_read_lock();
>          folio = xas_load(&xas);
>          if (!folio_try_get_rcu(folio))
>                  goto repeat;
>          if (unlikely(folio != xas_reload(&xas))) {
>                  folio_put(folio);
>                  goto repeat;
>          }
> folio_try_get_rcu:
> 	folio_ref_try_add_rcu(folio, 1);
> folio_ref_try_add_rcu:
>          if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
>                  /* Either the folio has been freed, or will be freed. */
>                  return false;
> folio_ref_add_unless:
>          return page_ref_add_unless(&folio->page, nr, u);
> page_ref_add_unless:
> 	atomic_add_unless(&page->_refcount, nr, u);
>
> A rather deep callchain there, but for our purposes the important part
> is: we take the RCU read lock, we look up a folio, we increment its
> refcount if it's not zero, then check that looking up this index gets
> the same folio; if it doesn't, we decrement the refcount again and retry
> the lookup.
>
> For this analysis, we can be preempted at any point after we've got the
> folio pointer from xa_load().
>
>>  From hugetlb allocation perspective,  one of the scenarios is run time
>> hugetlb page allocation (say 2M pages), starting from the buddy allocator
>> returns compound pages, then the head page is set to frozen, then the
>> folio(compound pages) is put thru the HVO process, one of which is
>> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
>>
>> Until the HVO process completes, none of the vmemmap represented pages are
>> available to any threads, so what are the causes for IRQ threads to access
>> their vmemmap pages?
> Yup, this sounds like enough, but it's not.  The problem is the person
> who's looking up the folio in the pagecache under RCU.  They've got
> the folio pointer and have been preempted.  So now what happens to our
> victim folio?
>
> Something happens to remove it from the page cache.  Maybe the file is
> truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
> removed from the xarray and gets its refcount set to 0.  If the lookup
> were to continue at this time, everything would be fine because it would
> see a refcount of 0 and not increment it (in page_ref_add_unless()).
> And this is where my analysis of RCU tends to go wrong, because I only
> think of interleaving event A and B.  I don't think about B and then C
> happening before A resumes.  But it can!  Let's follow the journey of
> this struct page.
>
> Now that it's been removed from the page cache, it's allocated by hugetlb,
> as you describe.  And it's one of the tail pages towards the end of
> the 512 contiguous struct pages.  That means that we alter vmemmap so
> that the pointer to struct page now points to a different struct page
> (one of the earlier ones).  Then the original page of vmemmap containing
> our lucky struct page is returned to the page allocator.  At this point,
> it no longer contains struct pages; it can contain literally anything.
>
> Where my analysis went wrong was that CPU A _no longer has a pointer
> to it_.  CPU A has a pointer into vmemmap.  So it will access the
> replacement struct page (which definitely has a refcount 0) instead of
> the one which has been freed.  I had thought that CPU A would access the
> original memory which has now been allocated to someone else.  But no,
> it can't because its pointer is virtual, not physical.
>
>
> ---
>
> Now I'm thinking more about this and there's another scenario which I
> thought might go wrong, and doesn't.  For 7 of the 512 pages which are
> freed, the struct page pointer gathered by CPU A will not point to a
> page with a refcount of 0.  Instead it will point to an alias of the
> head page with a positive refcount.  For those pages, CPU A will see
> folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
> the folio isn't there any more, so it will call folio_put() on something
> which used to be a folio, and isn't any more.
>
> But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> without asserting that the pointer is actually to a folio.
> So everything's fine, but really only by coincidence; I don't think
> anybody's thought about this scenario before (maybe Muchun has, but I
> don't remember it being discussed).

Wow!  Marvelous analysis, thank you!

So is the solution simple as making folio_put_testzero() to check 
whether the folio pointer actually points to a folio?

or there is more to consider?

Thanks a lot!

-jane


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-08 19:21                   ` Jane Chu
  0 siblings, 0 replies; 50+ messages in thread
From: Jane Chu @ 2024-02-08 19:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Will Deacon, Nanyong Sun, Catalin Marinas, muchun.song, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm

On 2/8/2024 7:49 AM, Matthew Wilcox wrote:

> On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
>> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
>>> While this array of ~512 pages have been allocated to hugetlbfs, and one
>>> would think that there would be no way that there could still be
>>> references to them, another CPU can have a pointer to this struct page
>>> (eg attempting a speculative page cache reference or
>>> get_user_pages_fast()).  That means it will try to call
>>> atomic_add_unless(&page->_refcount, 1, 0);
>>>
>>> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
>>> explicitly go through an RCU grace period before freeing the pages
>>> for use by somebody else?
>>>
>> Sorry, not sure what I'm missing, please help.
> Having written out the analysis, I now think it can't happen on x86,
> but let's walk through it because it's non-obvious (and I think it
> illustrates what people are afraid of on Arm).
>
> CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> Let's do the latter this time.
>
>          folio = filemap_get_entry(mapping, index);
> filemap_get_entry:
> 	rcu_read_lock();
>          folio = xas_load(&xas);
>          if (!folio_try_get_rcu(folio))
>                  goto repeat;
>          if (unlikely(folio != xas_reload(&xas))) {
>                  folio_put(folio);
>                  goto repeat;
>          }
> folio_try_get_rcu:
> 	folio_ref_try_add_rcu(folio, 1);
> folio_ref_try_add_rcu:
>          if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
>                  /* Either the folio has been freed, or will be freed. */
>                  return false;
> folio_ref_add_unless:
>          return page_ref_add_unless(&folio->page, nr, u);
> page_ref_add_unless:
> 	atomic_add_unless(&page->_refcount, nr, u);
>
> A rather deep callchain there, but for our purposes the important part
> is: we take the RCU read lock, we look up a folio, we increment its
> refcount if it's not zero, then check that looking up this index gets
> the same folio; if it doesn't, we decrement the refcount again and retry
> the lookup.
>
> For this analysis, we can be preempted at any point after we've got the
> folio pointer from xa_load().
>
>>  From hugetlb allocation perspective,  one of the scenarios is run time
>> hugetlb page allocation (say 2M pages), starting from the buddy allocator
>> returns compound pages, then the head page is set to frozen, then the
>> folio(compound pages) is put thru the HVO process, one of which is
>> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
>>
>> Until the HVO process completes, none of the vmemmap represented pages are
>> available to any threads, so what are the causes for IRQ threads to access
>> their vmemmap pages?
> Yup, this sounds like enough, but it's not.  The problem is the person
> who's looking up the folio in the pagecache under RCU.  They've got
> the folio pointer and have been preempted.  So now what happens to our
> victim folio?
>
> Something happens to remove it from the page cache.  Maybe the file is
> truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
> removed from the xarray and gets its refcount set to 0.  If the lookup
> were to continue at this time, everything would be fine because it would
> see a refcount of 0 and not increment it (in page_ref_add_unless()).
> And this is where my analysis of RCU tends to go wrong, because I only
> think of interleaving event A and B.  I don't think about B and then C
> happening before A resumes.  But it can!  Let's follow the journey of
> this struct page.
>
> Now that it's been removed from the page cache, it's allocated by hugetlb,
> as you describe.  And it's one of the tail pages towards the end of
> the 512 contiguous struct pages.  That means that we alter vmemmap so
> that the pointer to struct page now points to a different struct page
> (one of the earlier ones).  Then the original page of vmemmap containing
> our lucky struct page is returned to the page allocator.  At this point,
> it no longer contains struct pages; it can contain literally anything.
>
> Where my analysis went wrong was that CPU A _no longer has a pointer
> to it_.  CPU A has a pointer into vmemmap.  So it will access the
> replacement struct page (which definitely has a refcount 0) instead of
> the one which has been freed.  I had thought that CPU A would access the
> original memory which has now been allocated to someone else.  But no,
> it can't because its pointer is virtual, not physical.
>
>
> ---
>
> Now I'm thinking more about this and there's another scenario which I
> thought might go wrong, and doesn't.  For 7 of the 512 pages which are
> freed, the struct page pointer gathered by CPU A will not point to a
> page with a refcount of 0.  Instead it will point to an alias of the
> head page with a positive refcount.  For those pages, CPU A will see
> folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
> the folio isn't there any more, so it will call folio_put() on something
> which used to be a folio, and isn't any more.
>
> But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> without asserting that the pointer is actually to a folio.
> So everything's fine, but really only by coincidence; I don't think
> anybody's thought about this scenario before (maybe Muchun has, but I
> don't remember it being discussed).

Wow!  Marvelous analysis, thank you!

So is the solution simple as making folio_put_testzero() to check 
whether the folio pointer actually points to a folio?

or there is more to consider?

Thanks a lot!

-jane


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-08 15:49                 ` Matthew Wilcox
@ 2024-02-11 11:59                   ` Muchun Song
  -1 siblings, 0 replies; 50+ messages in thread
From: Muchun Song @ 2024-02-11 11:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jane Chu, Will Deacon, Nanyong Sun, Catalin Marinas, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm



> On Feb 8, 2024, at 23:49, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
>> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
>>> While this array of ~512 pages have been allocated to hugetlbfs, and one
>>> would think that there would be no way that there could still be
>>> references to them, another CPU can have a pointer to this struct page
>>> (eg attempting a speculative page cache reference or
>>> get_user_pages_fast()).  That means it will try to call
>>> atomic_add_unless(&page->_refcount, 1, 0);
>>> 
>>> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
>>> explicitly go through an RCU grace period before freeing the pages
>>> for use by somebody else?
>>> 
>> Sorry, not sure what I'm missing, please help.
> 
> Having written out the analysis, I now think it can't happen on x86,
> but let's walk through it because it's non-obvious (and I think it
> illustrates what people are afraid of on Arm).
> 
> CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> Let's do the latter this time.
> 
>        folio = filemap_get_entry(mapping, index);
> filemap_get_entry:
>        rcu_read_lock();
>        folio = xas_load(&xas);
>        if (!folio_try_get_rcu(folio))
>                goto repeat;
>        if (unlikely(folio != xas_reload(&xas))) {
>                folio_put(folio);
>                goto repeat;
>        }
> folio_try_get_rcu:
>        folio_ref_try_add_rcu(folio, 1);
> folio_ref_try_add_rcu:
>        if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
>                /* Either the folio has been freed, or will be freed. */
>                return false;
> folio_ref_add_unless:
>        return page_ref_add_unless(&folio->page, nr, u);
> page_ref_add_unless:
>        atomic_add_unless(&page->_refcount, nr, u);
> 
> A rather deep callchain there, but for our purposes the important part
> is: we take the RCU read lock, we look up a folio, we increment its
> refcount if it's not zero, then check that looking up this index gets
> the same folio; if it doesn't, we decrement the refcount again and retry
> the lookup.
> 
> For this analysis, we can be preempted at any point after we've got the
> folio pointer from xa_load().
> 
>> From hugetlb allocation perspective,  one of the scenarios is run time
>> hugetlb page allocation (say 2M pages), starting from the buddy allocator
>> returns compound pages, then the head page is set to frozen, then the
>> folio(compound pages) is put thru the HVO process, one of which is
>> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
>> 
>> Until the HVO process completes, none of the vmemmap represented pages are
>> available to any threads, so what are the causes for IRQ threads to access
>> their vmemmap pages?
> 
> Yup, this sounds like enough, but it's not.  The problem is the person
> who's looking up the folio in the pagecache under RCU.  They've got
> the folio pointer and have been preempted.  So now what happens to our
> victim folio?
> 
> Something happens to remove it from the page cache.  Maybe the file is
> truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
> removed from the xarray and gets its refcount set to 0.  If the lookup
> were to continue at this time, everything would be fine because it would
> see a refcount of 0 and not increment it (in page_ref_add_unless()).
> And this is where my analysis of RCU tends to go wrong, because I only
> think of interleaving event A and B.  I don't think about B and then C
> happening before A resumes.  But it can!  Let's follow the journey of
> this struct page.
> 
> Now that it's been removed from the page cache, it's allocated by hugetlb,
> as you describe.  And it's one of the tail pages towards the end of
> the 512 contiguous struct pages.  That means that we alter vmemmap so
> that the pointer to struct page now points to a different struct page
> (one of the earlier ones).  Then the original page of vmemmap containing
> our lucky struct page is returned to the page allocator.  At this point,
> it no longer contains struct pages; it can contain literally anything.
> 
> Where my analysis went wrong was that CPU A _no longer has a pointer
> to it_.  CPU A has a pointer into vmemmap.  So it will access the
> replacement struct page (which definitely has a refcount 0) instead of
> the one which has been freed.  I had thought that CPU A would access the
> original memory which has now been allocated to someone else.  But no,
> it can't because its pointer is virtual, not physical.
> 
> 
> ---
> 
> Now I'm thinking more about this and there's another scenario which I
> thought might go wrong, and doesn't.  For 7 of the 512 pages which are
> freed, the struct page pointer gathered by CPU A will not point to a
> page with a refcount of 0.  Instead it will point to an alias of the
> head page with a positive refcount.  For those pages, CPU A will see
> folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
> the folio isn't there any more, so it will call folio_put() on something
> which used to be a folio, and isn't any more.
> 
> But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> without asserting that the pointer is actually to a folio.
> So everything's fine, but really only by coincidence; I don't think
> anybody's thought about this scenario before (maybe Muchun has, but I
> don't remember it being discussed).

I have to say it is a really great analysis, I haven't thought about the
case of get_page_unless_zero() so deeply.

To avoid increasing a refcount to a tail page struct, I have made
all the 7 tail pages read-only when I first write those code. But it
is a really problem, because it will panic (due to RO permission)
when encountering the above scenario to increase its refcount.

In order to fix the race with __filemap_get_folio(), my first
thought of fixing this issue is to add a rcu_synchronize() after
the processing of HVO optimization and before being allocated to
users. Note that HugePage pages are frozen before going through
the precessing of HVO optimization meaning all the refcount of all
the struct pages are 0. Therefore, folio_try_get_rcu() in
__filemap_get_folio() will fail unless the HugeTLB page has been
allocated to the user.

But I realized there are some users who may pass a arbitrary
page struct (which may be those 7 special tail page structs,
alias of the head page struct, of a HugeTLB page) to the following
helpers, which also could get a refcount of a tail page struct.
Those helpers also need to be fixed.

  1) get_page_unless_zero
  2) folio_try_get
  3) folio_try_get_rcu

I have checked all the users of 1), If I am not wrong, all the users
already handle the HugeTLB pages before calling to get_page_unless_zero().
Although there is no problem with 1) now, it will be fragile to let users
guarantee that it will not pass any tail pages of a HugeTLB page to
1). So I want to change 1) to the following to fix this.

	static inline bool get_page_unless_zero(struct page *page)
	{
		if (page_ref_add_unless(page, 1, 0)) {
			/* @page must be a genuine head or alias head page here. */
			struct page *head = page_fixed_fake_head(page);

			if (likely(head == page))
				return true;
			put_page(head);
		}

		return false;
	}

2) and 3) should adopt the similar approach to make sure we cannot increase
tail pages' refcount. 2) and 3) will be like the following (only demonstrate
the key logic):

	static inline bool folio_try_get(struct folio *folio)/folio_ref_try_add_rcu
	{
		if (folio_ref_add_unless(folio, 1, 0)) {
			struct folio *genuine = page_folio(&folio->page);

			if (likely(genuine == folio))
				return true;
			folio_put(genuine);
		}

		return false;
	}

Additionally, we also should alter RO permission of those 7 tail pages
to RW to avoid panic().

There is no problem in the following helpers since all of them already
handle HVO case through _compound_head(), they will get the __genuine__
head page struct and increase its refcount.

  1) try_get_page
  2) folio_get
  3) get_page

Just some thoughts from mine, maybe you guys have more simple and graceful
approaches. Comments are welcome.

Muchun,
Thanks.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-02-11 11:59                   ` Muchun Song
  0 siblings, 0 replies; 50+ messages in thread
From: Muchun Song @ 2024-02-11 11:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jane Chu, Will Deacon, Nanyong Sun, Catalin Marinas, akpm,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm



> On Feb 8, 2024, at 23:49, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
>> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
>>> While this array of ~512 pages have been allocated to hugetlbfs, and one
>>> would think that there would be no way that there could still be
>>> references to them, another CPU can have a pointer to this struct page
>>> (eg attempting a speculative page cache reference or
>>> get_user_pages_fast()).  That means it will try to call
>>> atomic_add_unless(&page->_refcount, 1, 0);
>>> 
>>> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
>>> explicitly go through an RCU grace period before freeing the pages
>>> for use by somebody else?
>>> 
>> Sorry, not sure what I'm missing, please help.
> 
> Having written out the analysis, I now think it can't happen on x86,
> but let's walk through it because it's non-obvious (and I think it
> illustrates what people are afraid of on Arm).
> 
> CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> Let's do the latter this time.
> 
>        folio = filemap_get_entry(mapping, index);
> filemap_get_entry:
>        rcu_read_lock();
>        folio = xas_load(&xas);
>        if (!folio_try_get_rcu(folio))
>                goto repeat;
>        if (unlikely(folio != xas_reload(&xas))) {
>                folio_put(folio);
>                goto repeat;
>        }
> folio_try_get_rcu:
>        folio_ref_try_add_rcu(folio, 1);
> folio_ref_try_add_rcu:
>        if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
>                /* Either the folio has been freed, or will be freed. */
>                return false;
> folio_ref_add_unless:
>        return page_ref_add_unless(&folio->page, nr, u);
> page_ref_add_unless:
>        atomic_add_unless(&page->_refcount, nr, u);
> 
> A rather deep callchain there, but for our purposes the important part
> is: we take the RCU read lock, we look up a folio, we increment its
> refcount if it's not zero, then check that looking up this index gets
> the same folio; if it doesn't, we decrement the refcount again and retry
> the lookup.
> 
> For this analysis, we can be preempted at any point after we've got the
> folio pointer from xa_load().
> 
>> From hugetlb allocation perspective,  one of the scenarios is run time
>> hugetlb page allocation (say 2M pages), starting from the buddy allocator
>> returns compound pages, then the head page is set to frozen, then the
>> folio(compound pages) is put thru the HVO process, one of which is
>> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
>> 
>> Until the HVO process completes, none of the vmemmap represented pages are
>> available to any threads, so what are the causes for IRQ threads to access
>> their vmemmap pages?
> 
> Yup, this sounds like enough, but it's not.  The problem is the person
> who's looking up the folio in the pagecache under RCU.  They've got
> the folio pointer and have been preempted.  So now what happens to our
> victim folio?
> 
> Something happens to remove it from the page cache.  Maybe the file is
> truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
> removed from the xarray and gets its refcount set to 0.  If the lookup
> were to continue at this time, everything would be fine because it would
> see a refcount of 0 and not increment it (in page_ref_add_unless()).
> And this is where my analysis of RCU tends to go wrong, because I only
> think of interleaving event A and B.  I don't think about B and then C
> happening before A resumes.  But it can!  Let's follow the journey of
> this struct page.
> 
> Now that it's been removed from the page cache, it's allocated by hugetlb,
> as you describe.  And it's one of the tail pages towards the end of
> the 512 contiguous struct pages.  That means that we alter vmemmap so
> that the pointer to struct page now points to a different struct page
> (one of the earlier ones).  Then the original page of vmemmap containing
> our lucky struct page is returned to the page allocator.  At this point,
> it no longer contains struct pages; it can contain literally anything.
> 
> Where my analysis went wrong was that CPU A _no longer has a pointer
> to it_.  CPU A has a pointer into vmemmap.  So it will access the
> replacement struct page (which definitely has a refcount 0) instead of
> the one which has been freed.  I had thought that CPU A would access the
> original memory which has now been allocated to someone else.  But no,
> it can't because its pointer is virtual, not physical.
> 
> 
> ---
> 
> Now I'm thinking more about this and there's another scenario which I
> thought might go wrong, and doesn't.  For 7 of the 512 pages which are
> freed, the struct page pointer gathered by CPU A will not point to a
> page with a refcount of 0.  Instead it will point to an alias of the
> head page with a positive refcount.  For those pages, CPU A will see
> folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
> the folio isn't there any more, so it will call folio_put() on something
> which used to be a folio, and isn't any more.
> 
> But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> without asserting that the pointer is actually to a folio.
> So everything's fine, but really only by coincidence; I don't think
> anybody's thought about this scenario before (maybe Muchun has, but I
> don't remember it being discussed).

I have to say it is a really great analysis, I haven't thought about the
case of get_page_unless_zero() so deeply.

To avoid increasing a refcount to a tail page struct, I have made
all the 7 tail pages read-only when I first write those code. But it
is a really problem, because it will panic (due to RO permission)
when encountering the above scenario to increase its refcount.

In order to fix the race with __filemap_get_folio(), my first
thought of fixing this issue is to add a rcu_synchronize() after
the processing of HVO optimization and before being allocated to
users. Note that HugePage pages are frozen before going through
the precessing of HVO optimization meaning all the refcount of all
the struct pages are 0. Therefore, folio_try_get_rcu() in
__filemap_get_folio() will fail unless the HugeTLB page has been
allocated to the user.

But I realized there are some users who may pass a arbitrary
page struct (which may be those 7 special tail page structs,
alias of the head page struct, of a HugeTLB page) to the following
helpers, which also could get a refcount of a tail page struct.
Those helpers also need to be fixed.

  1) get_page_unless_zero
  2) folio_try_get
  3) folio_try_get_rcu

I have checked all the users of 1), If I am not wrong, all the users
already handle the HugeTLB pages before calling to get_page_unless_zero().
Although there is no problem with 1) now, it will be fragile to let users
guarantee that it will not pass any tail pages of a HugeTLB page to
1). So I want to change 1) to the following to fix this.

	static inline bool get_page_unless_zero(struct page *page)
	{
		if (page_ref_add_unless(page, 1, 0)) {
			/* @page must be a genuine head or alias head page here. */
			struct page *head = page_fixed_fake_head(page);

			if (likely(head == page))
				return true;
			put_page(head);
		}

		return false;
	}

2) and 3) should adopt the similar approach to make sure we cannot increase
tail pages' refcount. 2) and 3) will be like the following (only demonstrate
the key logic):

	static inline bool folio_try_get(struct folio *folio)/folio_ref_try_add_rcu
	{
		if (folio_ref_add_unless(folio, 1, 0)) {
			struct folio *genuine = page_folio(&folio->page);

			if (likely(genuine == folio))
				return true;
			folio_put(genuine);
		}

		return false;
	}

Additionally, we also should alter RO permission of those 7 tail pages
to RW to avoid panic().

There is no problem in the following helpers since all of them already
handle HVO case through _compound_head(), they will get the __genuine__
head page struct and increase its refcount.

  1) try_get_page
  2) folio_get
  3) get_page

Just some thoughts from mine, maybe you guys have more simple and graceful
approaches. Comments are welcome.

Muchun,
Thanks.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-02-08 13:17               ` Will Deacon
@ 2024-03-13 23:32                 ` David Rientjes
  -1 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2024-03-13 23:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, Matthew Wilcox, muchun.song,
	Andrew Morton, anshuman.khandual, wangkefeng.wang,
	linux-arm-kernel, linux-kernel, linux-mm, Yu Zhao, Yosry Ahmed,
	Sourav Panda

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

On Thu, 8 Feb 2024, Will Deacon wrote:

> > How about take a new lock with irq disabled during BBM, like:
> > 
> > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> > +{
> > +    spin_lock_irq(NEW_LOCK);
> > +    pte_clear(&init_mm, addr, ptep);
> > +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > +    set_pte_at(&init_mm, addr, ptep, pte);
> > +    spin_unlock_irq(NEW_LOCK);
> > +}
> 
> I really think the only maintainable way to achieve this is to avoid the
> possibility of a fault altogether.
> 
> Will
> 
> 

Nanyong, are you still actively working on making HVO possible on arm64?

This would yield a substantial memory savings on hosts that are largely 
configured with hugetlbfs.  In our case, the size of this hugetlbfs pool 
is actually never changed after boot, but it sounds from the thread that 
there was an idea to make HVO conditional on FEAT_BBM.  Is this being 
pursued?

If so, any testing help needed?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-03-13 23:32                 ` David Rientjes
  0 siblings, 0 replies; 50+ messages in thread
From: David Rientjes @ 2024-03-13 23:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nanyong Sun, Catalin Marinas, Matthew Wilcox, muchun.song,
	Andrew Morton, anshuman.khandual, wangkefeng.wang,
	linux-arm-kernel, linux-kernel, linux-mm, Yu Zhao, Yosry Ahmed,
	Sourav Panda

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

On Thu, 8 Feb 2024, Will Deacon wrote:

> > How about take a new lock with irq disabled during BBM, like:
> > 
> > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> > +{
> > +    spin_lock_irq(NEW_LOCK);
> > +    pte_clear(&init_mm, addr, ptep);
> > +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > +    set_pte_at(&init_mm, addr, ptep, pte);
> > +    spin_unlock_irq(NEW_LOCK);
> > +}
> 
> I really think the only maintainable way to achieve this is to avoid the
> possibility of a fault altogether.
> 
> Will
> 
> 

Nanyong, are you still actively working on making HVO possible on arm64?

This would yield a substantial memory savings on hosts that are largely 
configured with hugetlbfs.  In our case, the size of this hugetlbfs pool 
is actually never changed after boot, but it sounds from the thread that 
there was an idea to make HVO conditional on FEAT_BBM.  Is this being 
pursued?

If so, any testing help needed?

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-03-13 23:32                 ` David Rientjes
@ 2024-03-25 15:24                   ` Nanyong Sun
  -1 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-03-25 15:24 UTC (permalink / raw)
  To: David Rientjes, Will Deacon
  Cc: Catalin Marinas, Matthew Wilcox, muchun.song, Andrew Morton,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao, Yosry Ahmed, Sourav Panda

On 2024/3/14 7:32, David Rientjes wrote:

> On Thu, 8 Feb 2024, Will Deacon wrote:
>
>>> How about take a new lock with irq disabled during BBM, like:
>>>
>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
>>> +{
>>> +     (NEW_LOCK);
>>> +    pte_clear(&init_mm, addr, ptep);
>>> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>> +    set_pte_at(&init_mm, addr, ptep, pte);
>>> +    spin_unlock_irq(NEW_LOCK);
>>> +}
>> I really think the only maintainable way to achieve this is to avoid the
>> possibility of a fault altogether.
>>
>> Will
>>
>>
> Nanyong, are you still actively working on making HVO possible on arm64?
>
> This would yield a substantial memory savings on hosts that are largely
> configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
> is actually never changed after boot, but it sounds from the thread that
> there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> pursued?
>
> If so, any testing help needed?
I'm afraid that FEAT_BBM may not solve the problem here, because from 
Arm ARM,
I see that FEAT_BBM is only used for changing block size. Therefore, in 
this HVO feature,
it can work in the split PMD stage, that is, BBM can be avoided in 
vmemmap_split_pmd,
but in the subsequent vmemmap_remap_pte, the Output address of PTE still 
needs to be
changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my 
understanding
of ARM FEAT_BBM is wrong, and I hope someone can correct me.
Actually, the solution I first considered was to use the stop_machine 
method, but we have
products that rely on /proc/sys/vm/nr_overcommit_hugepages to 
dynamically use hugepages,
so I have to consider performance issues. If your product does not 
change the amount of huge
pages after booting, using stop_machine() may be a feasible way.
So far, I still haven't come up with a good solution.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-03-25 15:24                   ` Nanyong Sun
  0 siblings, 0 replies; 50+ messages in thread
From: Nanyong Sun @ 2024-03-25 15:24 UTC (permalink / raw)
  To: David Rientjes, Will Deacon
  Cc: Catalin Marinas, Matthew Wilcox, muchun.song, Andrew Morton,
	anshuman.khandual, wangkefeng.wang, linux-arm-kernel,
	linux-kernel, linux-mm, Yu Zhao, Yosry Ahmed, Sourav Panda

On 2024/3/14 7:32, David Rientjes wrote:

> On Thu, 8 Feb 2024, Will Deacon wrote:
>
>>> How about take a new lock with irq disabled during BBM, like:
>>>
>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
>>> +{
>>> +     (NEW_LOCK);
>>> +    pte_clear(&init_mm, addr, ptep);
>>> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>> +    set_pte_at(&init_mm, addr, ptep, pte);
>>> +    spin_unlock_irq(NEW_LOCK);
>>> +}
>> I really think the only maintainable way to achieve this is to avoid the
>> possibility of a fault altogether.
>>
>> Will
>>
>>
> Nanyong, are you still actively working on making HVO possible on arm64?
>
> This would yield a substantial memory savings on hosts that are largely
> configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
> is actually never changed after boot, but it sounds from the thread that
> there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> pursued?
>
> If so, any testing help needed?
I'm afraid that FEAT_BBM may not solve the problem here, because from 
Arm ARM,
I see that FEAT_BBM is only used for changing block size. Therefore, in 
this HVO feature,
it can work in the split PMD stage, that is, BBM can be avoided in 
vmemmap_split_pmd,
but in the subsequent vmemmap_remap_pte, the Output address of PTE still 
needs to be
changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my 
understanding
of ARM FEAT_BBM is wrong, and I hope someone can correct me.
Actually, the solution I first considered was to use the stop_machine 
method, but we have
products that rely on /proc/sys/vm/nr_overcommit_hugepages to 
dynamically use hugepages,
so I have to consider performance issues. If your product does not 
change the amount of huge
pages after booting, using stop_machine() may be a feasible way.
So far, I still haven't come up with a good solution.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
  2024-03-25 15:24                   ` Nanyong Sun
@ 2024-03-26 12:54                     ` Will Deacon
  -1 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-03-26 12:54 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: David Rientjes, Catalin Marinas, Matthew Wilcox, muchun.song,
	Andrew Morton, anshuman.khandual, wangkefeng.wang,
	linux-arm-kernel, linux-kernel, linux-mm, Yu Zhao, Yosry Ahmed,
	Sourav Panda

On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
> On 2024/3/14 7:32, David Rientjes wrote:
> > On Thu, 8 Feb 2024, Will Deacon wrote:
> > > > How about take a new lock with irq disabled during BBM, like:
> > > > 
> > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> > > > +{
> > > > +     (NEW_LOCK);
> > > > +    pte_clear(&init_mm, addr, ptep);
> > > > +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > > > +    set_pte_at(&init_mm, addr, ptep, pte);
> > > > +    spin_unlock_irq(NEW_LOCK);
> > > > +}
> > > I really think the only maintainable way to achieve this is to avoid the
> > > possibility of a fault altogether.
> > > 
> > Nanyong, are you still actively working on making HVO possible on arm64?
> > 
> > This would yield a substantial memory savings on hosts that are largely
> > configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
> > is actually never changed after boot, but it sounds from the thread that
> > there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> > pursued?
> > 
> > If so, any testing help needed?
> I'm afraid that FEAT_BBM may not solve the problem here, because from Arm
> ARM,
> I see that FEAT_BBM is only used for changing block size. Therefore, in this
> HVO feature,
> it can work in the split PMD stage, that is, BBM can be avoided in
> vmemmap_split_pmd,
> but in the subsequent vmemmap_remap_pte, the Output address of PTE still
> needs to be
> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
> understanding
> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
> Actually, the solution I first considered was to use the stop_machine
> method, but we have
> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
> use hugepages,
> so I have to consider performance issues. If your product does not change
> the amount of huge
> pages after booting, using stop_machine() may be a feasible way.
> So far, I still haven't come up with a good solution.

Oh, I hadn't appreciated that you needed to remap the memmap live. How
do you synchronise the two copies in that case? I think we (i.e. the arch
folks) probably need some more explanation on exactly who can race with
what here, otherwise I don't grok how this can work.

Thanks,

Will

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
@ 2024-03-26 12:54                     ` Will Deacon
  0 siblings, 0 replies; 50+ messages in thread
From: Will Deacon @ 2024-03-26 12:54 UTC (permalink / raw)
  To: Nanyong Sun
  Cc: David Rientjes, Catalin Marinas, Matthew Wilcox, muchun.song,
	Andrew Morton, anshuman.khandual, wangkefeng.wang,
	linux-arm-kernel, linux-kernel, linux-mm, Yu Zhao, Yosry Ahmed,
	Sourav Panda

On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
> On 2024/3/14 7:32, David Rientjes wrote:
> > On Thu, 8 Feb 2024, Will Deacon wrote:
> > > > How about take a new lock with irq disabled during BBM, like:
> > > > 
> > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> > > > +{
> > > > +     (NEW_LOCK);
> > > > +    pte_clear(&init_mm, addr, ptep);
> > > > +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > > > +    set_pte_at(&init_mm, addr, ptep, pte);
> > > > +    spin_unlock_irq(NEW_LOCK);
> > > > +}
> > > I really think the only maintainable way to achieve this is to avoid the
> > > possibility of a fault altogether.
> > > 
> > Nanyong, are you still actively working on making HVO possible on arm64?
> > 
> > This would yield a substantial memory savings on hosts that are largely
> > configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
> > is actually never changed after boot, but it sounds from the thread that
> > there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> > pursued?
> > 
> > If so, any testing help needed?
> I'm afraid that FEAT_BBM may not solve the problem here, because from Arm
> ARM,
> I see that FEAT_BBM is only used for changing block size. Therefore, in this
> HVO feature,
> it can work in the split PMD stage, that is, BBM can be avoided in
> vmemmap_split_pmd,
> but in the subsequent vmemmap_remap_pte, the Output address of PTE still
> needs to be
> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
> understanding
> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
> Actually, the solution I first considered was to use the stop_machine
> method, but we have
> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
> use hugepages,
> so I have to consider performance issues. If your product does not change
> the amount of huge
> pages after booting, using stop_machine() may be a feasible way.
> So far, I still haven't come up with a good solution.

Oh, I hadn't appreciated that you needed to remap the memmap live. How
do you synchronise the two copies in that case? I think we (i.e. the arch
folks) probably need some more explanation on exactly who can race with
what here, otherwise I don't grok how this can work.

Thanks,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-03-26 12:54 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-13  9:44 [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Nanyong Sun
2024-01-13  9:44 ` Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable Nanyong Sun
2024-01-13  9:44   ` Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely Nanyong Sun
2024-01-13  9:44   ` Nanyong Sun
2024-01-15  2:38   ` Muchun Song
2024-01-15  2:38     ` Muchun Song
2024-02-07 12:21   ` Mark Rutland
2024-02-07 12:21     ` Mark Rutland
2024-02-08  9:30     ` Nanyong Sun
2024-02-08  9:30       ` Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Nanyong Sun
2024-01-13  9:44   ` Nanyong Sun
2024-01-25 18:06 ` [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Catalin Marinas
2024-01-25 18:06   ` Catalin Marinas
2024-01-27  5:04   ` Nanyong Sun
2024-01-27  5:04     ` Nanyong Sun
2024-02-07 11:12     ` Will Deacon
2024-02-07 11:12       ` Will Deacon
2024-02-07 11:21       ` Matthew Wilcox
2024-02-07 11:21         ` Matthew Wilcox
2024-02-07 12:11         ` Will Deacon
2024-02-07 12:11           ` Will Deacon
2024-02-07 12:24           ` Mark Rutland
2024-02-07 12:24             ` Mark Rutland
2024-02-07 14:17           ` Matthew Wilcox
2024-02-07 14:17             ` Matthew Wilcox
2024-02-08  2:24             ` Jane Chu
2024-02-08  2:24               ` Jane Chu
2024-02-08 15:49               ` Matthew Wilcox
2024-02-08 15:49                 ` Matthew Wilcox
2024-02-08 19:21                 ` Jane Chu
2024-02-08 19:21                   ` Jane Chu
2024-02-11 11:59                 ` Muchun Song
2024-02-11 11:59                   ` Muchun Song
2024-02-07 12:20         ` Catalin Marinas
2024-02-07 12:20           ` Catalin Marinas
2024-02-08  9:44           ` Nanyong Sun
2024-02-08  9:44             ` Nanyong Sun
2024-02-08 13:17             ` Will Deacon
2024-02-08 13:17               ` Will Deacon
2024-03-13 23:32               ` David Rientjes
2024-03-13 23:32                 ` David Rientjes
2024-03-25 15:24                 ` Nanyong Sun
2024-03-25 15:24                   ` Nanyong Sun
2024-03-26 12:54                   ` Will Deacon
2024-03-26 12:54                     ` Will Deacon
2024-02-07 12:44     ` Catalin Marinas
2024-02-07 12:44       ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.