linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
@ 2020-11-13 11:19 Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 1/5] mm/gup: Provide gup_get_pte() more generic Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:19 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov, peterz

Hi,

These patches provide generic infrastructure to determine TLB page size from
page table entries alone. Perf will use this (for either data or code address)
to aid in profiling TLB issues.

While most architectures only have page table aligned large pages, some
(notably ARM64, Sparc64 and Power) provide non page table aligned large pages
and need to provide their own implementation of these functions.

I've provided (completely untested) implementations for ARM64 and Sparc64, but
failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
me out there.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] mm/gup: Provide gup_get_pte() more generic
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
@ 2020-11-13 11:19 ` Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 2/5] mm: Introduce pXX_leaf_size() Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:19 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov, peterz

In order to write another lockless page-table walker, we need
gup_get_pte() exposed. While doing that, rename it to
ptep_get_lockless() to match the existing ptep_get() naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h |   55 +++++++++++++++++++++++++++++++++++++++++++++
 mm/gup.c                |   58 ------------------------------------------------
 2 files changed, 56 insertions(+), 57 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,61 @@ static inline pte_t ptep_get(pte_t *ptep
 }
 #endif
 
+#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
+/*
+ * WARNING: only to be used in the get_user_pages_fast() implementation.
+ *
+ * With get_user_pages_fast(), we walk down the pagetables without taking any
+ * locks.  For this we would like to load the pointers atomically, but sometimes
+ * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
+ * we do have is the guarantee that a PTE will only either go from not present
+ * to present, or present to not present or both -- it will not switch to a
+ * completely different present page without a TLB flush in between; something
+ * that we are blocking by holding interrupts off.
+ *
+ * Setting ptes from not present to present goes:
+ *
+ *   ptep->pte_high = h;
+ *   smp_wmb();
+ *   ptep->pte_low = l;
+ *
+ * And present to not present goes:
+ *
+ *   ptep->pte_low = 0;
+ *   smp_wmb();
+ *   ptep->pte_high = 0;
+ *
+ * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
+ * We load pte_high *after* loading pte_low, which ensures we don't see an older
+ * value of pte_high.  *Then* we recheck pte_low, which ensures that we haven't
+ * picked up a changed pte high. We might have gotten rubbish values from
+ * pte_low and pte_high, but we are guaranteed that pte_low will not have the
+ * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
+ * operates on present ptes we're safe.
+ */
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	pte_t pte;
+
+	do {
+		pte.pte_low = ptep->pte_low;
+		smp_rmb();
+		pte.pte_high = ptep->pte_high;
+		smp_rmb();
+	} while (unlikely(pte.pte_low != ptep->pte_low));
+
+	return pte;
+}
+#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+/*
+ * We require that the PTE can be read atomically.
+ */
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+	return ptep_get(ptep);
+}
+#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2079,62 +2079,6 @@ static void put_compound_head(struct pag
 	put_page(page);
 }
 
-#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
-
-/*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks.  For this we would like to load the pointers atomically, but sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
- *
- * Setting ptes from not present to present goes:
- *
- *   ptep->pte_high = h;
- *   smp_wmb();
- *   ptep->pte_low = l;
- *
- * And present to not present goes:
- *
- *   ptep->pte_low = 0;
- *   smp_wmb();
- *   ptep->pte_high = 0;
- *
- * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
- * We load pte_high *after* loading pte_low, which ensures we don't see an older
- * value of pte_high.  *Then* we recheck pte_low, which ensures that we haven't
- * picked up a changed pte high. We might have gotten rubbish values from
- * pte_low and pte_high, but we are guaranteed that pte_low will not have the
- * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
- * operates on present ptes we're safe.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-	pte_t pte;
-
-	do {
-		pte.pte_low = ptep->pte_low;
-		smp_rmb();
-		pte.pte_high = ptep->pte_high;
-		smp_rmb();
-	} while (unlikely(pte.pte_low != ptep->pte_low));
-
-	return pte;
-}
-#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-/*
- * We require that the PTE can be read atomically.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-	return ptep_get(ptep);
-}
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 					    unsigned int flags,
 					    struct page **pages)
@@ -2160,7 +2104,7 @@ static int gup_pte_range(pmd_t pmd, unsi
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
 	do {
-		pte_t pte = gup_get_pte(ptep);
+		pte_t pte = ptep_get_lockless(ptep);
 		struct page *head, *page;
 
 		/*



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/5] mm: Introduce pXX_leaf_size()
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 1/5] mm/gup: Provide gup_get_pte() more generic Peter Zijlstra
@ 2020-11-13 11:19 ` Peter Zijlstra
  2020-11-13 11:45   ` Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 3/5] perf/core: Fix arch_perf_get_page_size() Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:19 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov, peterz

A number of architectures have non-pagetable aligned huge/large pages.
For such architectures a leaf can actually be part of a larger TLB
entry.

Provide generic helpers to determine the TLB size of a page-table
leaf.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1536,4 +1536,20 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf(x)	0
 #endif
 
+#ifndef pgd_leaf_size
+#define pgd_leaf_size(x) PGD_SIZE
+#endif
+#ifndef p4d_leaf_size
+#define p4d_leaf_size(x) P4D_SIZE
+#endif
+#ifndef pud_leaf_size
+#define pud_leaf_size(x) PUD_SIZE
+#endif
+#ifndef pmd_leaf_size
+#define pmd_leaf_size(x) PMD_SIZE
+#endif
+#ifndef pte_leaf_size
+#define pte_leaf_size(x) PAGE_SIZE
+#endif
+
 #endif /* _LINUX_PGTABLE_H */



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/5] perf/core: Fix arch_perf_get_page_size()
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 1/5] mm/gup: Provide gup_get_pte() more generic Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 2/5] mm: Introduce pXX_leaf_size() Peter Zijlstra
@ 2020-11-13 11:19 ` Peter Zijlstra
  2020-11-16 13:11   ` Liang, Kan
  2020-11-13 11:19 ` [PATCH 4/5] arm64/mm: Implement pXX_leaf_size() support Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:19 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov, peterz

The (new) page-table walker in arch_perf_get_page_size() is broken in
various ways. Specifically while it is used in a locless manner, it
doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
offset methods, nor is careful to only read each entry only once.

Also the hugetlb support is broken due to calling pte_page() without
first checking pte_special().

Rewrite the whole thing to be a proper lockless page-table walker and
employ the new pXX_leaf_size() pgtable functions to determine the TLB
size without looking at the page-frames.

Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/arm64/include/asm/pgtable.h    |    3 +
 arch/sparc/include/asm/pgtable_64.h |   13 ++++
 arch/sparc/mm/hugetlbpage.c         |   19 ++++--
 include/linux/pgtable.h             |   16 +++++
 kernel/events/core.c                |  102 +++++++++++++-----------------------
 5 files changed, 82 insertions(+), 71 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7001,90 +7001,62 @@ static u64 perf_virt_to_phys(u64 virt)
 	return phys_addr;
 }
 
-#ifdef CONFIG_MMU
-
 /*
- * Return the MMU page size of a given virtual address.
- *
- * This generic implementation handles page-table aligned huge pages, as well
- * as non-page-table aligned hugetlbfs compound pages.
- *
- * If an architecture supports and uses non-page-table aligned pages in their
- * kernel mapping it will need to provide it's own implementation of this
- * function.
+ * Return the MMU/TLB page size of a given virtual address.
  */
-__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+static u64 perf_get_tlb_page_size(struct mm_struct *mm, unsigned long addr)
 {
-	struct page *page;
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
+	u64 size = 0;
 
-	pgd = pgd_offset(mm, addr);
-	if (pgd_none(*pgd))
-		return 0;
+#ifdef CONFIG_HAVE_FAST_GUP
+	pgd_t *pgdp, pgd;
+	p4d_t *p4dp, p4d;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
 
-	p4d = p4d_offset(pgd, addr);
-	if (!p4d_present(*p4d))
+	pgdp = pgd_offset(mm, addr);
+	pgd = READ_ONCE(*pgdp);
+	if (pgd_none(pgd))
 		return 0;
 
-	if (p4d_leaf(*p4d))
-		return 1ULL << P4D_SHIFT;
+	if (pgd_leaf(pgd))
+		return pgd_leaf_size(pgd);
 
-	pud = pud_offset(p4d, addr);
-	if (!pud_present(*pud))
+	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
+	p4d = READ_ONCE(*p4dp);
+	if (!p4d_present(p4d))
 		return 0;
 
-	if (pud_leaf(*pud)) {
-#ifdef pud_page
-		page = pud_page(*pud);
-		if (PageHuge(page))
-			return page_size(compound_head(page));
-#endif
-		return 1ULL << PUD_SHIFT;
-	}
+	if (p4d_leaf(p4d))
+		return p4d_leaf_size(p4d);
 
-	pmd = pmd_offset(pud, addr);
-	if (!pmd_present(*pmd))
+	pudp = pud_offset_lockless(p4dp, p4d, addr);
+	pud = READ_ONCE(*pudp);
+	if (!pud_present(pud))
 		return 0;
 
-	if (pmd_leaf(*pmd)) {
-#ifdef pmd_page
-		page = pmd_page(*pmd);
-		if (PageHuge(page))
-			return page_size(compound_head(page));
-#endif
-		return 1ULL << PMD_SHIFT;
-	}
+	if (pud_leaf(pud))
+		return pud_leaf_size(pud);
 
-	pte = pte_offset_map(pmd, addr);
-	if (!pte_present(*pte)) {
-		pte_unmap(pte);
+	pmdp = pmd_offset_lockless(pudp, pud, addr);
+	pmd = READ_ONCE(*pmdp);
+	if (!pmd_present(pmd))
 		return 0;
-	}
 
-	page = pte_page(*pte);
-	if (PageHuge(page)) {
-		u64 size = page_size(compound_head(page));
-		pte_unmap(pte);
-		return size;
-	}
-
-	pte_unmap(pte);
-	return PAGE_SIZE;
-}
+	if (pmd_leaf(pmd))
+		return pmd_leaf_size(pmd);
 
-#else
+	ptep = pte_offset_map(&pmd, addr);
+	pte = ptep_get_lockless(ptep);
+	if (pte_present(pte))
+		size = pte_leaf_size(pte);
+	pte_unmap(ptep);
+#endif /* CONFIG_HAVE_FAST_GUP */
 
-static u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
-{
-	return 0;
+	return size;
 }
 
-#endif
-
 static u64 perf_get_page_size(unsigned long addr)
 {
 	struct mm_struct *mm;
@@ -7109,7 +7081,7 @@ static u64 perf_get_page_size(unsigned l
 		mm = &init_mm;
 	}
 
-	size = arch_perf_get_page_size(mm, addr);
+	size = perf_get_tlb_page_size(mm, addr);
 
 	local_irq_restore(flags);
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/5] arm64/mm: Implement pXX_leaf_size() support
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
                   ` (2 preceding siblings ...)
  2020-11-13 11:19 ` [PATCH 3/5] perf/core: Fix arch_perf_get_page_size() Peter Zijlstra
@ 2020-11-13 11:19 ` Peter Zijlstra
  2020-11-13 11:19 ` [PATCH 5/5] sparc64/mm: " Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:19 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov, peterz

ARM64 has non-pagetable aligned large page support with PTE_CONT, when
this bit is set the page is part of a super-page. Match the hugetlb
code and support these super pages for PTE and PMD levels.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
page sizes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/arm64/include/asm/pgtable.h |    3 +++
 1 file changed, 3 insertions(+)

--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -503,6 +503,9 @@ extern pgprot_t phys_mem_access_prot(str
 				 PMD_TYPE_SECT)
 #define pmd_leaf(pmd)		pmd_sect(pmd)
 
+#define pmd_leaf_size(pmd)	(pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
+#define pte_leaf_size(pte)	(pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
 static inline bool pud_table(pud_t pud) { return true; }



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/5] sparc64/mm: Implement pXX_leaf_size() support
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
                   ` (3 preceding siblings ...)
  2020-11-13 11:19 ` [PATCH 4/5] arm64/mm: Implement pXX_leaf_size() support Peter Zijlstra
@ 2020-11-13 11:19 ` Peter Zijlstra
  2020-11-13 13:44 ` [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Christophe Leroy
  2020-11-16 15:43 ` Kirill A. Shutemov
  6 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:19 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov, peterz

Sparc64 has non-pagetable aligned large page support; wire up the
pXX_leaf_size() functions to report the correct TLB page size.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
page sizes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/sparc/include/asm/pgtable_64.h |   13 +++++++++++++
 arch/sparc/mm/hugetlbpage.c         |   19 +++++++++++++------
 2 files changed, 26 insertions(+), 6 deletions(-)

--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1121,6 +1121,19 @@ extern unsigned long cmdline_memory_size
 
 asmlinkage void do_sparc64_fault(struct pt_regs *regs);
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+#define pud_leaf_size pud_leaf_size
+extern unsigned long pud_leaf_size(pud_t pud);
+
+#define pmd_leaf_size pmd_leaf_size
+extern unsigned long pmd_leaf_size(pmd_t pmd);
+
+#define pte_leaf_size pte_leaf_size
+extern unsigned long pte_leaf_size(pte_t pte);
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
 #endif /* !(__ASSEMBLY__) */
 
 #endif /* !(_SPARC64_PGTABLE_H) */
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -247,14 +247,17 @@ static unsigned int sun4u_huge_tte_to_sh
 	return shift;
 }
 
-static unsigned int huge_tte_to_shift(pte_t entry)
+static unsigned long tte_to_shift(pte_t entry)
 {
-	unsigned long shift;
-
 	if (tlb_type == hypervisor)
-		shift = sun4v_huge_tte_to_shift(entry);
-	else
-		shift = sun4u_huge_tte_to_shift(entry);
+		return sun4v_huge_tte_to_shift(entry);
+
+	return sun4u_huge_tte_to_shift(entry);
+}
+
+static unsigned int huge_tte_to_shift(pte_t entry)
+{
+	unsigned long shift = tte_to_shift(entry);
 
 	if (shift == PAGE_SHIFT)
 		WARN_ONCE(1, "tto_to_shift: invalid hugepage tte=0x%lx\n",
@@ -272,6 +275,10 @@ static unsigned long huge_tte_to_size(pt
 	return size;
 }
 
+unsigned long pud_leaf_size(pud_t pud) { return 1UL << tte_to_shift((pte_t)pud); }
+unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << tte_to_shift((pte_t)pmd); }
+unsigned long pte_leaf_size(pte_t pte) { return 1UL << tte_to_shift((pte_t)pte); }
+
 pte_t *huge_pte_alloc(struct mm_struct *mm,
 			unsigned long addr, unsigned long sz)
 {



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] mm: Introduce pXX_leaf_size()
  2020-11-13 11:19 ` [PATCH 2/5] mm: Introduce pXX_leaf_size() Peter Zijlstra
@ 2020-11-13 11:45   ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-13 11:45 UTC (permalink / raw)
  To: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov

On Fri, Nov 13, 2020 at 12:19:03PM +0100, Peter Zijlstra wrote:
> A number of architectures have non-pagetable aligned huge/large pages.
> For such architectures a leaf can actually be part of a larger TLB
> entry.
> 
> Provide generic helpers to determine the TLB size of a page-table
> leaf.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/pgtable.h |   16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1536,4 +1536,20 @@ typedef unsigned int pgtbl_mod_mask;
>  #define pmd_leaf(x)	0
>  #endif
>  
> +#ifndef pgd_leaf_size
> +#define pgd_leaf_size(x) PGD_SIZE

Argh, I lost a refresh, that should've been:

+#define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)


> +#endif
> +#ifndef p4d_leaf_size
> +#define p4d_leaf_size(x) P4D_SIZE
> +#endif
> +#ifndef pud_leaf_size
> +#define pud_leaf_size(x) PUD_SIZE
> +#endif
> +#ifndef pmd_leaf_size
> +#define pmd_leaf_size(x) PMD_SIZE
> +#endif
> +#ifndef pte_leaf_size
> +#define pte_leaf_size(x) PAGE_SIZE
> +#endif
> +
>  #endif /* _LINUX_PGTABLE_H */
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
                   ` (4 preceding siblings ...)
  2020-11-13 11:19 ` [PATCH 5/5] sparc64/mm: " Peter Zijlstra
@ 2020-11-13 13:44 ` Christophe Leroy
  2020-11-20 11:18   ` Christophe Leroy
  2020-11-16 15:43 ` Kirill A. Shutemov
  6 siblings, 1 reply; 19+ messages in thread
From: Christophe Leroy @ 2020-11-13 13:44 UTC (permalink / raw)
  To: Peter Zijlstra, kan.liang, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, eranian
  Cc: npiggin, linuxppc-dev, mpe, will, willy, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	dave.hansen, kirill.shutemov

Hi

Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
> Hi,
> 
> These patches provide generic infrastructure to determine TLB page size from
> page table entries alone. Perf will use this (for either data or code address)
> to aid in profiling TLB issues.
> 
> While most architectures only have page table aligned large pages, some
> (notably ARM64, Sparc64 and Power) provide non page table aligned large pages
> and need to provide their own implementation of these functions.
> 
> I've provided (completely untested) implementations for ARM64 and Sparc64, but
> failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
> me out there.
> 

I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024 entries, that means each 
entry maps 4M.

Page sizes are 4k, 16k, 512k and 8M.

For the 8M pages we use hugepd with a single entry. The two related PGD entries point to the same 
hugepd.

For the other sizes, they are in standard page tables. 16k pages appear 4 times in the page table. 
512k entries appear 128 times in the page table.

When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a hugepd with holds the single 8M 
entry.

In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE

_PAGE_HUGE means it is a 512k page
_PAGE_SPS means it is not a 4k page

The kernel can by build either with 4k pages as standard page size, or 16k pages. It doesn't change 
the page table layout though.

Hope this is clear. Now I don't really know to wire that up to your series.

Christophe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] perf/core: Fix arch_perf_get_page_size()
  2020-11-13 11:19 ` [PATCH 3/5] perf/core: Fix arch_perf_get_page_size() Peter Zijlstra
@ 2020-11-16 13:11   ` Liang, Kan
  0 siblings, 0 replies; 19+ messages in thread
From: Liang, Kan @ 2020-11-16 13:11 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, acme, mark.rutland, alexander.shishkin,
	jolsa, eranian
  Cc: christophe.leroy, npiggin, linuxppc-dev, mpe, will, willy,
	aneesh.kumar, sparclinux, davem, catalin.marinas, linux-arch,
	linux-kernel, ak, dave.hansen, kirill.shutemov



On 11/13/2020 6:19 AM, Peter Zijlstra wrote:
> The (new) page-table walker in arch_perf_get_page_size() is broken in
> various ways. Specifically while it is used in a locless manner, it
> doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
> offset methods, nor is careful to only read each entry only once.
> 
> Also the hugetlb support is broken due to calling pte_page() without
> first checking pte_special().
> 
> Rewrite the whole thing to be a proper lockless page-table walker and
> employ the new pXX_leaf_size() pgtable functions to determine the TLB
> size without looking at the page-frames.
> 
> Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
> Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")

The issue 
(https://lkml.kernel.org/r/8e88ba79-7c40-ea32-a7ed-bdc4fc04b2af@linux.intel.com) 
has been fixed by this patch set.

Tested-by: Kan Liang <kan.liang@linux.intel.com>


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   arch/arm64/include/asm/pgtable.h    |    3 +
>   arch/sparc/include/asm/pgtable_64.h |   13 ++++
>   arch/sparc/mm/hugetlbpage.c         |   19 ++++--
>   include/linux/pgtable.h             |   16 +++++
>   kernel/events/core.c                |  102 +++++++++++++-----------------------
>   5 files changed, 82 insertions(+), 71 deletions(-)
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7001,90 +7001,62 @@ static u64 perf_virt_to_phys(u64 virt)
>   	return phys_addr;
>   }
>   
> -#ifdef CONFIG_MMU
> -
>   /*
> - * Return the MMU page size of a given virtual address.
> - *
> - * This generic implementation handles page-table aligned huge pages, as well
> - * as non-page-table aligned hugetlbfs compound pages.
> - *
> - * If an architecture supports and uses non-page-table aligned pages in their
> - * kernel mapping it will need to provide it's own implementation of this
> - * function.
> + * Return the MMU/TLB page size of a given virtual address.
>    */
> -__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
> +static u64 perf_get_tlb_page_size(struct mm_struct *mm, unsigned long addr)
>   {
> -	struct page *page;
> -	pgd_t *pgd;
> -	p4d_t *p4d;
> -	pud_t *pud;
> -	pmd_t *pmd;
> -	pte_t *pte;
> +	u64 size = 0;
>   
> -	pgd = pgd_offset(mm, addr);
> -	if (pgd_none(*pgd))
> -		return 0;
> +#ifdef CONFIG_HAVE_FAST_GUP
> +	pgd_t *pgdp, pgd;
> +	p4d_t *p4dp, p4d;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
>   
> -	p4d = p4d_offset(pgd, addr);
> -	if (!p4d_present(*p4d))
> +	pgdp = pgd_offset(mm, addr);
> +	pgd = READ_ONCE(*pgdp);
> +	if (pgd_none(pgd))
>   		return 0;
>   
> -	if (p4d_leaf(*p4d))
> -		return 1ULL << P4D_SHIFT;
> +	if (pgd_leaf(pgd))
> +		return pgd_leaf_size(pgd);
>   
> -	pud = pud_offset(p4d, addr);
> -	if (!pud_present(*pud))
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> +	p4d = READ_ONCE(*p4dp);
> +	if (!p4d_present(p4d))
>   		return 0;
>   
> -	if (pud_leaf(*pud)) {
> -#ifdef pud_page
> -		page = pud_page(*pud);
> -		if (PageHuge(page))
> -			return page_size(compound_head(page));
> -#endif
> -		return 1ULL << PUD_SHIFT;
> -	}
> +	if (p4d_leaf(p4d))
> +		return p4d_leaf_size(p4d);
>   
> -	pmd = pmd_offset(pud, addr);
> -	if (!pmd_present(*pmd))
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
> +	pud = READ_ONCE(*pudp);
> +	if (!pud_present(pud))
>   		return 0;
>   
> -	if (pmd_leaf(*pmd)) {
> -#ifdef pmd_page
> -		page = pmd_page(*pmd);
> -		if (PageHuge(page))
> -			return page_size(compound_head(page));
> -#endif
> -		return 1ULL << PMD_SHIFT;
> -	}
> +	if (pud_leaf(pud))
> +		return pud_leaf_size(pud);
>   
> -	pte = pte_offset_map(pmd, addr);
> -	if (!pte_present(*pte)) {
> -		pte_unmap(pte);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
> +	pmd = READ_ONCE(*pmdp);
> +	if (!pmd_present(pmd))
>   		return 0;
> -	}
>   
> -	page = pte_page(*pte);
> -	if (PageHuge(page)) {
> -		u64 size = page_size(compound_head(page));
> -		pte_unmap(pte);
> -		return size;
> -	}
> -
> -	pte_unmap(pte);
> -	return PAGE_SIZE;
> -}
> +	if (pmd_leaf(pmd))
> +		return pmd_leaf_size(pmd);
>   
> -#else
> +	ptep = pte_offset_map(&pmd, addr);
> +	pte = ptep_get_lockless(ptep);
> +	if (pte_present(pte))
> +		size = pte_leaf_size(pte);
> +	pte_unmap(ptep);
> +#endif /* CONFIG_HAVE_FAST_GUP */
>   
> -static u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
> -{
> -	return 0;
> +	return size;
>   }
>   
> -#endif
> -
>   static u64 perf_get_page_size(unsigned long addr)
>   {
>   	struct mm_struct *mm;
> @@ -7109,7 +7081,7 @@ static u64 perf_get_page_size(unsigned l
>   		mm = &init_mm;
>   	}
>   
> -	size = arch_perf_get_page_size(mm, addr);
> +	size = perf_get_tlb_page_size(mm, addr);
>   
>   	local_irq_restore(flags);
>   
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
                   ` (5 preceding siblings ...)
  2020-11-13 13:44 ` [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Christophe Leroy
@ 2020-11-16 15:43 ` Kirill A. Shutemov
  2020-11-16 15:54   ` Matthew Wilcox
  6 siblings, 1 reply; 19+ messages in thread
From: Kirill A. Shutemov @ 2020-11-16 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	eranian, christophe.leroy, npiggin, linuxppc-dev, mpe, will,
	willy, aneesh.kumar, sparclinux, davem, catalin.marinas,
	linux-arch, linux-kernel, ak, dave.hansen, kirill.shutemov

On Fri, Nov 13, 2020 at 12:19:01PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> These patches provide generic infrastructure to determine TLB page size from
> page table entries alone. Perf will use this (for either data or code address)
> to aid in profiling TLB issues.

I'm not sure it's an issue, but strictly speaking, size of page according
to page table tree doesn't mean pagewalk would fill TLB entry of the size.
CPU may support 1G pages in page table tree without 1G TLB at all.

IIRC, current Intel CPU still don't have any 1G iTLB entries and fill 2M
iTLB instead.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-16 15:43 ` Kirill A. Shutemov
@ 2020-11-16 15:54   ` Matthew Wilcox
  2020-11-16 16:28     ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2020-11-16 15:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Zijlstra, kan.liang, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, eranian, christophe.leroy, npiggin,
	linuxppc-dev, mpe, will, aneesh.kumar, sparclinux, davem,
	catalin.marinas, linux-arch, linux-kernel, ak, dave.hansen,
	kirill.shutemov

On Mon, Nov 16, 2020 at 06:43:57PM +0300, Kirill A. Shutemov wrote:
> On Fri, Nov 13, 2020 at 12:19:01PM +0100, Peter Zijlstra wrote:
> > Hi,
> > 
> > These patches provide generic infrastructure to determine TLB page size from
> > page table entries alone. Perf will use this (for either data or code address)
> > to aid in profiling TLB issues.
> 
> I'm not sure it's an issue, but strictly speaking, size of page according
> to page table tree doesn't mean pagewalk would fill TLB entry of the size.
> CPU may support 1G pages in page table tree without 1G TLB at all.
> 
> IIRC, current Intel CPU still don't have any 1G iTLB entries and fill 2M
> iTLB instead.

It gets even more complicated with CPUs with multiple levels of TLB
which support different TLB entry sizes.  My CPU reports:

TLB info
 Instruction TLB: 2M/4M pages, fully associative, 8 entries
 Instruction TLB: 4K pages, 8-way associative, 64 entries
 Data TLB: 1GB pages, 4-way set associative, 4 entries
 Data TLB: 4KB pages, 4-way associative, 64 entries
 Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries

I'm not quite sure what the rules are for evicting a 1GB entry in the
dTLB into the s2TLB.  I've read them for so many different processors,
I get quite confused.  Some CPUs fracture them; others ditch them entirely
and will look them up again if needed.

I think the architecture here is fine, but it'll need a little bit of
finagling to maybe pass i-vs-d to the pXd_leaf_size() routines, and x86
will need an implementation of pud_leaf_size() which interrogates the
TLB info to find out what size TLB entry will actually be used.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-16 15:54   ` Matthew Wilcox
@ 2020-11-16 16:28     ` Dave Hansen
  2020-11-16 16:32       ` Matthew Wilcox
  2020-11-16 16:55       ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Hansen @ 2020-11-16 16:28 UTC (permalink / raw)
  To: Matthew Wilcox, Kirill A. Shutemov
  Cc: Peter Zijlstra, kan.liang, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, eranian, christophe.leroy, npiggin,
	linuxppc-dev, mpe, will, aneesh.kumar, sparclinux, davem,
	catalin.marinas, linux-arch, linux-kernel, ak, kirill.shutemov

On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> It gets even more complicated with CPUs with multiple levels of TLB
> which support different TLB entry sizes.  My CPU reports:
> 
> TLB info
>  Instruction TLB: 2M/4M pages, fully associative, 8 entries
>  Instruction TLB: 4K pages, 8-way associative, 64 entries
>  Data TLB: 1GB pages, 4-way set associative, 4 entries
>  Data TLB: 4KB pages, 4-way associative, 64 entries
>  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries

It's even "worse" on recent AMD systems.  Those will coalesce multiple
adjacent PTEs into a single TLB entry.  I think Alphas did something
like this back in the day with an opt-in.

Anyway, the changelog should probably replace:

> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> page sizes.

with something more like:

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
table mapping sizes.

That's really the best we can do from software without digging into
microarchitecture-specific events.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-16 16:28     ` Dave Hansen
@ 2020-11-16 16:32       ` Matthew Wilcox
  2020-11-16 16:36         ` Dave Hansen
  2020-11-16 16:55       ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2020-11-16 16:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Peter Zijlstra, kan.liang, mingo, acme,
	mark.rutland, alexander.shishkin, jolsa, eranian,
	christophe.leroy, npiggin, linuxppc-dev, mpe, will, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	kirill.shutemov

On Mon, Nov 16, 2020 at 08:28:23AM -0800, Dave Hansen wrote:
> On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> > It gets even more complicated with CPUs with multiple levels of TLB
> > which support different TLB entry sizes.  My CPU reports:
> > 
> > TLB info
> >  Instruction TLB: 2M/4M pages, fully associative, 8 entries
> >  Instruction TLB: 4K pages, 8-way associative, 64 entries
> >  Data TLB: 1GB pages, 4-way set associative, 4 entries
> >  Data TLB: 4KB pages, 4-way associative, 64 entries
> >  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries
> 
> It's even "worse" on recent AMD systems.  Those will coalesce multiple
> adjacent PTEs into a single TLB entry.  I think Alphas did something
> like this back in the day with an opt-in.

I debated mentioning that ;-)  We can detect in software whether that's
_possible_, but we can't detect whether it's *done* it.  I heard it
sometimes takes several faults on the 4kB entries for the CPU to decide
that it's beneficial to use a 32kB TLB entry.  But this is all rumour.

> Anyway, the changelog should probably replace:
> 
> > This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> > page sizes.
> 
> with something more like:
> 
> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
> table mapping sizes.
> 
> That's really the best we can do from software without digging into
> microarchitecture-specific events.

I mean this is perf.  Digging into microarch specific events is what it
does ;-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-16 16:32       ` Matthew Wilcox
@ 2020-11-16 16:36         ` Dave Hansen
  2020-11-16 16:57           ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2020-11-16 16:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Peter Zijlstra, kan.liang, mingo, acme,
	mark.rutland, alexander.shishkin, jolsa, eranian,
	christophe.leroy, npiggin, linuxppc-dev, mpe, will, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	kirill.shutemov

On 11/16/20 8:32 AM, Matthew Wilcox wrote:
>>
>> That's really the best we can do from software without digging into
>> microarchitecture-specific events.
> I mean this is perf.  Digging into microarch specific events is what it
> does ;-)

Yeah, totally.

But, if we see a bunch of 4k TLB hit events, it's still handy to know
that those 4k TLB hits originated from a 2M page table entry.  This
series just makes sure that perf has the data about the page table
mapping sizes regardless of what the microarchitecture does with it.

I'm just saying we need to make the descriptions in this perf feature
specifically about the page tables, not the TLB.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-16 16:28     ` Dave Hansen
  2020-11-16 16:32       ` Matthew Wilcox
@ 2020-11-16 16:55       ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-16 16:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matthew Wilcox, Kirill A. Shutemov, kan.liang, mingo, acme,
	mark.rutland, alexander.shishkin, jolsa, eranian,
	christophe.leroy, npiggin, linuxppc-dev, mpe, will, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	kirill.shutemov

On Mon, Nov 16, 2020 at 08:28:23AM -0800, Dave Hansen wrote:
> On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> > It gets even more complicated with CPUs with multiple levels of TLB
> > which support different TLB entry sizes.  My CPU reports:
> > 
> > TLB info
> >  Instruction TLB: 2M/4M pages, fully associative, 8 entries
> >  Instruction TLB: 4K pages, 8-way associative, 64 entries
> >  Data TLB: 1GB pages, 4-way set associative, 4 entries
> >  Data TLB: 4KB pages, 4-way associative, 64 entries
> >  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries
> 
> It's even "worse" on recent AMD systems.  Those will coalesce multiple
> adjacent PTEs into a single TLB entry.  I think Alphas did something
> like this back in the day with an opt-in.
> 
> Anyway, the changelog should probably replace:

ARM64 does too.

> > This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> > page sizes.
> 
> with something more like:
> 
> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
> table mapping sizes.

Sure.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-16 16:36         ` Dave Hansen
@ 2020-11-16 16:57           ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-16 16:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Matthew Wilcox, Kirill A. Shutemov, kan.liang, mingo, acme,
	mark.rutland, alexander.shishkin, jolsa, eranian,
	christophe.leroy, npiggin, linuxppc-dev, mpe, will, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	kirill.shutemov

On Mon, Nov 16, 2020 at 08:36:36AM -0800, Dave Hansen wrote:
> On 11/16/20 8:32 AM, Matthew Wilcox wrote:
> >>
> >> That's really the best we can do from software without digging into
> >> microarchitecture-specific events.
> > I mean this is perf.  Digging into microarch specific events is what it
> > does ;-)
> 
> Yeah, totally.

Sure, but the automatic promotion/demotion of TLB sizes is not visible
if you don't know what you startd out with.

> But, if we see a bunch of 4k TLB hit events, it's still handy to know
> that those 4k TLB hits originated from a 2M page table entry.  This
> series just makes sure that perf has the data about the page table
> mapping sizes regardless of what the microarchitecture does with it.

This. 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-13 13:44 ` [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Christophe Leroy
@ 2020-11-20 11:18   ` Christophe Leroy
  2020-11-20 12:20     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Christophe Leroy @ 2020-11-20 11:18 UTC (permalink / raw)
  To: Peter Zijlstra, kan.liang, mingo, acme, mark.rutland,
	alexander.shishkin, jolsa, eranian
  Cc: npiggin, linuxppc-dev, mpe, will, willy, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	dave.hansen, kirill.shutemov

Hi Peter,

Le 13/11/2020 à 14:44, Christophe Leroy a écrit :
> Hi
> 
> Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
>> Hi,
>>
>> These patches provide generic infrastructure to determine TLB page size from
>> page table entries alone. Perf will use this (for either data or code address)
>> to aid in profiling TLB issues.
>>
>> While most architectures only have page table aligned large pages, some
>> (notably ARM64, Sparc64 and Power) provide non page table aligned large pages
>> and need to provide their own implementation of these functions.
>>
>> I've provided (completely untested) implementations for ARM64 and Sparc64, but
>> failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
>> me out there.
>>
> 
> I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024 entries, that means each 
> entry maps 4M.
> 
> Page sizes are 4k, 16k, 512k and 8M.
> 
> For the 8M pages we use hugepd with a single entry. The two related PGD entries point to the same 
> hugepd.
> 
> For the other sizes, they are in standard page tables. 16k pages appear 4 times in the page table. 
> 512k entries appear 128 times in the page table.
> 
> When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a hugepd with holds the single 8M 
> entry.
> 
> In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> 
> _PAGE_HUGE means it is a 512k page
> _PAGE_SPS means it is not a 4k page
> 
> The kernel can by build either with 4k pages as standard page size, or 16k pages. It doesn't change 
> the page table layout though.
> 
> Hope this is clear. Now I don't really know to wire that up to your series.

Does my description make sense ? Is there anything I can help with ?

Christophe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-20 11:18   ` Christophe Leroy
@ 2020-11-20 12:20     ` Peter Zijlstra
  2020-11-26 10:46       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-20 12:20 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	eranian, npiggin, linuxppc-dev, mpe, will, willy, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	dave.hansen, kirill.shutemov

On Fri, Nov 20, 2020 at 12:18:22PM +0100, Christophe Leroy wrote:
> Hi Peter,
> 
> Le 13/11/2020 à 14:44, Christophe Leroy a écrit :
> > Hi
> > 
> > Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
> > > Hi,
> > > 
> > > These patches provide generic infrastructure to determine TLB page size from
> > > page table entries alone. Perf will use this (for either data or code address)
> > > to aid in profiling TLB issues.
> > > 
> > > While most architectures only have page table aligned large pages, some
> > > (notably ARM64, Sparc64 and Power) provide non page table aligned large pages
> > > and need to provide their own implementation of these functions.
> > > 
> > > I've provided (completely untested) implementations for ARM64 and Sparc64, but
> > > failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
> > > me out there.
> > > 
> > 
> > I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> > entries, that means each entry maps 4M.
> > 
> > Page sizes are 4k, 16k, 512k and 8M.
> > 
> > For the 8M pages we use hugepd with a single entry. The two related PGD
> > entries point to the same hugepd.
> > 
> > For the other sizes, they are in standard page tables. 16k pages appear
> > 4 times in the page table. 512k entries appear 128 times in the page
> > table.
> > 
> > When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> > hugepd with holds the single 8M entry.
> > 
> > In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> > 
> > _PAGE_HUGE means it is a 512k page
> > _PAGE_SPS means it is not a 4k page
> > 
> > The kernel can by build either with 4k pages as standard page size, or
> > 16k pages. It doesn't change the page table layout though.
> > 
> > Hope this is clear. Now I don't really know to wire that up to your series.
> 
> Does my description make sense ? Is there anything I can help with ?

It did, and I had vague memories from when we fixed that pgd_t issue.
I've just not had time to dig through the powerpc code yet to find the
right mmu header to stick it in.

I was meaning to get another version of these patches posted this week,
but time keeps slipping away, I'll try.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE
  2020-11-20 12:20     ` Peter Zijlstra
@ 2020-11-26 10:46       ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2020-11-26 10:46 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: kan.liang, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	eranian, npiggin, linuxppc-dev, mpe, will, willy, aneesh.kumar,
	sparclinux, davem, catalin.marinas, linux-arch, linux-kernel, ak,
	dave.hansen, kirill.shutemov

On Fri, Nov 20, 2020 at 01:20:04PM +0100, Peter Zijlstra wrote:

> > > I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> > > entries, that means each entry maps 4M.
> > > 
> > > Page sizes are 4k, 16k, 512k and 8M.
> > > 
> > > For the 8M pages we use hugepd with a single entry. The two related PGD
> > > entries point to the same hugepd.
> > > 
> > > For the other sizes, they are in standard page tables. 16k pages appear
> > > 4 times in the page table. 512k entries appear 128 times in the page
> > > table.
> > > 
> > > When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> > > hugepd with holds the single 8M entry.
> > > 
> > > In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> > > 
> > > _PAGE_HUGE means it is a 512k page
> > > _PAGE_SPS means it is not a 4k page
> > > 
> > > The kernel can by build either with 4k pages as standard page size, or
> > > 16k pages. It doesn't change the page table layout though.
> > > 
> > > Hope this is clear. Now I don't really know to wire that up to your series.

Does the below accurately reflect things?

Let me go find a suitable cross-compiler ..

diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 1581204467e1..fcc48d590d88 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -135,6 +135,29 @@ static inline pte_t pte_mkhuge(pte_t pte)
 }
 
 #define pte_mkhuge pte_mkhuge
+
+static inline unsigned long pgd_leaf_size(pgd_t pgd)
+{
+	if (pgd_val(pgd) & _PMD_PAGE_8M)
+		return SZ_8M;
+	return SZ_4M;
+}
+
+#define pgd_leaf_size pgd_leaf_size
+
+static inline unsigned long pte_leaf_size(pte_t pte)
+{
+	pte_basic_t val = pte_val(pte);
+
+	if (val & _PAGE_HUGE)
+		return SZ_512K;
+	if (val & _PAGE_SPS)
+		return SZ_16K;
+	return SZ_4K;
+}
+
+#define pte_leaf_size pte_leaf_size
+
 #endif
 
 #endif /* __KERNEL__ */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-11-26 10:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-13 11:19 [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Peter Zijlstra
2020-11-13 11:19 ` [PATCH 1/5] mm/gup: Provide gup_get_pte() more generic Peter Zijlstra
2020-11-13 11:19 ` [PATCH 2/5] mm: Introduce pXX_leaf_size() Peter Zijlstra
2020-11-13 11:45   ` Peter Zijlstra
2020-11-13 11:19 ` [PATCH 3/5] perf/core: Fix arch_perf_get_page_size() Peter Zijlstra
2020-11-16 13:11   ` Liang, Kan
2020-11-13 11:19 ` [PATCH 4/5] arm64/mm: Implement pXX_leaf_size() support Peter Zijlstra
2020-11-13 11:19 ` [PATCH 5/5] sparc64/mm: " Peter Zijlstra
2020-11-13 13:44 ` [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE Christophe Leroy
2020-11-20 11:18   ` Christophe Leroy
2020-11-20 12:20     ` Peter Zijlstra
2020-11-26 10:46       ` Peter Zijlstra
2020-11-16 15:43 ` Kirill A. Shutemov
2020-11-16 15:54   ` Matthew Wilcox
2020-11-16 16:28     ` Dave Hansen
2020-11-16 16:32       ` Matthew Wilcox
2020-11-16 16:36         ` Dave Hansen
2020-11-16 16:57           ` Peter Zijlstra
2020-11-16 16:55       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).