All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64
@ 2023-06-06  4:55 Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting Aneesh Kumar K.V
                   ` (16 more replies)
  0 siblings, 17 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
translation and 1GB PUD mapping with 64K page size. The patch series also split
hugetlb vmemmap optimization as a separate Kconfig variable so that
architectures can enable DAX vmemmap optimization without enabling hugetlb
vmemmap optimization. This should enable architectures like arm64 to enable DAX
vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
d

Aneesh Kumar K.V (16):
  powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
  powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
  powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
  powerpc/mm/dax: Fix the condition when checking if altmap vmemap can
    cross-boundary
  mm/hugepage pud: Allow arch-specific helper function to check huge
    page pud support
  mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
    override
  mm/vmemmap: Allow architectures to override how vmemmap optimization
    works
  mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
  mm/huge pud: Use transparent huge pud helpers only with
    CONFIG_TRANSPARENT_HUGEPAGE
  mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  powerpc/book3s64/mm: Enable transparent pud hugepage
  powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
    handling function
  powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  powerpc/book3s64/radix: Remove mmu_vmemmap_psize

 Documentation/mm/vmemmap_dedup.rst            |   1 +
 Documentation/powerpc/vmemmap_dedup.rst       | 101 ++++
 arch/loongarch/Kconfig                        |   2 +-
 arch/powerpc/Kconfig                          |   1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 ++++-
 arch/powerpc/include/asm/book3s/64/radix.h    |  47 ++
 .../include/asm/book3s/64/tlbflush-radix.h    |   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/pgtable.h            |   3 +
 arch/powerpc/mm/book3s64/pgtable.c            |  78 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c      | 551 ++++++++++++++++--
 arch/powerpc/mm/book3s64/radix_tlb.c          |   7 +
 arch/powerpc/mm/init_64.c                     |  39 +-
 arch/powerpc/platforms/Kconfig.cputype        |   1 +
 arch/riscv/Kconfig                            |   2 +-
 arch/x86/Kconfig                              |   3 +-
 drivers/nvdimm/pfn_devs.c                     |   2 +-
 fs/Kconfig                                    |   2 +-
 include/linux/mm.h                            |  32 +-
 include/linux/pgtable.h                       |  11 +-
 include/trace/events/thp.h                    |  17 +
 mm/Kconfig                                    |   5 +-
 mm/debug_vm_pgtable.c                         |   2 +-
 mm/huge_memory.c                              |   2 +-
 mm/mm_init.c                                  |   2 +-
 mm/mremap.c                                   |   2 +-
 mm/sparse-vmemmap.c                           |   3 +
 27 files changed, 1005 insertions(+), 77 deletions(-)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

-- 
2.40.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix Aneesh Kumar K.V
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 2297aa764ecd..5f8c6fbe8a69 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -952,7 +952,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
-	old = radix__pte_update(mm, addr, (pte_t *)pmdp, clr, set, 1);
+	old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1);
 	trace_hugepage_update(addr, old, clr, set);
 
 	return old;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-21  4:08     ` Michael Ellerman
  2023-06-06  4:55 ` [PATCH 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo Aneesh Kumar K.V
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
on radix while mapping the vmemmap area.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 5f8c6fbe8a69..570add33c02d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,7 +594,6 @@ void __init radix__early_init_mmu(void)
 {
 	unsigned long lpcr;
 
-#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
 	/* PAGE_SIZE mappings */
 	mmu_virtual_psize = MMU_PAGE_64K;
@@ -611,7 +610,6 @@ void __init radix__early_init_mmu(void)
 		mmu_vmemmap_psize = MMU_PAGE_2M;
 	} else
 		mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 #endif
 	/*
 	 * initialize page table size
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding Aneesh Kumar K.V
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

On memory unplug reduce DirectMap page count correctly.
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:           0 kB
DirectMap64k:           0 kB
DirectMap2M:    115343360 kB
DirectMap1G:           0 kB

Before fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:           0 kB
DirectMap64k:           0 kB
DirectMap2M:    115343360 kB
DirectMap1G:           0 kB

After fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:           0 kB
DirectMap64k:           0 kB
DirectMap2M:    104857600 kB
DirectMap1G:           0 kB

Fixes: a2dc009afa9a ("powerpc/mm/book3s/radix: Add mapping statistics")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 34 +++++++++++++++---------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 570add33c02d..15a099e53cde 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -743,9 +743,9 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 }
 
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-			     unsigned long end)
+			     unsigned long end, bool direct)
 {
-	unsigned long next;
+	unsigned long next, pages = 0;
 	pte_t *pte;
 
 	pte = pte_start + pte_index(addr);
@@ -767,13 +767,16 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 		}
 
 		pte_clear(&init_mm, addr, pte);
+		pages++;
 	}
+	if (direct)
+		update_page_count(mmu_virtual_psize, -pages);
 }
 
 static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
-			     unsigned long end)
+				       unsigned long end, bool direct)
 {
-	unsigned long next;
+	unsigned long next, pages = 0;
 	pte_t *pte_base;
 	pmd_t *pmd;
 
@@ -791,19 +794,22 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 				continue;
 			}
 			pte_clear(&init_mm, addr, (pte_t *)pmd);
+			pages++;
 			continue;
 		}
 
 		pte_base = (pte_t *)pmd_page_vaddr(*pmd);
-		remove_pte_table(pte_base, addr, next);
+		remove_pte_table(pte_base, addr, next, direct);
 		free_pte_table(pte_base, pmd);
 	}
+	if (direct)
+		update_page_count(MMU_PAGE_2M, -pages);
 }
 
 static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
-			     unsigned long end)
+				       unsigned long end, bool direct)
 {
-	unsigned long next;
+	unsigned long next, pages = 0;
 	pmd_t *pmd_base;
 	pud_t *pud;
 
@@ -821,16 +827,20 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 				continue;
 			}
 			pte_clear(&init_mm, addr, (pte_t *)pud);
+			pages++;
 			continue;
 		}
 
 		pmd_base = pud_pgtable(*pud);
-		remove_pmd_table(pmd_base, addr, next);
+		remove_pmd_table(pmd_base, addr, next, direct);
 		free_pmd_table(pmd_base, pud);
 	}
+	if (direct)
+		update_page_count(MMU_PAGE_1G, -pages);
 }
 
-static void __meminit remove_pagetable(unsigned long start, unsigned long end)
+static void __meminit remove_pagetable(unsigned long start, unsigned long end,
+				       bool direct)
 {
 	unsigned long addr, next;
 	pud_t *pud_base;
@@ -859,7 +869,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 		}
 
 		pud_base = p4d_pgtable(*p4d);
-		remove_pud_table(pud_base, addr, next);
+		remove_pud_table(pud_base, addr, next, direct);
 		free_pud_table(pud_base, p4d);
 	}
 
@@ -882,7 +892,7 @@ int __meminit radix__create_section_mapping(unsigned long start,
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
 {
-	remove_pagetable(start, end);
+	remove_pagetable(start, end, true);
 	return 0;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -918,7 +928,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size)
 {
-	remove_pagetable(start, start + page_size);
+	remove_pagetable(start, start + page_size, false);
 }
 #endif
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2023-06-06  4:55 ` [PATCH 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary Aneesh Kumar K.V
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 15a099e53cde..76f6a1f3b9d8 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -910,7 +910,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 				      unsigned long phys)
 {
 	/* Create a PTE encoding */
-	unsigned long flags = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_KERNEL_RW;
 	int nid = early_pfn_to_nid(phys >> PAGE_SHIFT);
 	int ret;
 
@@ -919,7 +918,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 		return -1;
 	}
 
-	ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, nid);
+	ret = __map_kernel_page_nid(start, phys, PAGE_KERNEL, page_size, nid);
 	BUG_ON(ret);
 
 	return 0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2023-06-06  4:55 ` [PATCH 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support Aneesh Kumar K.V
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

Without this fix, the last subsection vmemmap can end up in memory even if the
namespace is created with -M mem and has sufficient space in the altmap area.

Fixes: cf387d9644d8 ("libnvdimm/altmap: Track namespace boundaries in altmap")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/init_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 05b0d584e50b..fe1b83020e0d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -189,7 +189,7 @@ static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long star
 	unsigned long nr_pfn = page_size / sizeof(struct page);
 	unsigned long start_pfn = page_to_pfn((struct page *)start);
 
-	if ((start_pfn + nr_pfn) > altmap->end_pfn)
+	if ((start_pfn + nr_pfn - 1) > altmap->end_pfn)
 		return true;
 
 	if (start_pfn < altmap->base_pfn)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2023-06-06  4:55 ` [PATCH 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg Aneesh Kumar K.V
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/nvdimm/pfn_devs.c | 2 +-
 include/linux/pgtable.h   | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index af7d9301520c..18ad315581ca 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned long *alignments)
 
 	if (has_transparent_hugepage()) {
 		alignments[1] = HPAGE_PMD_SIZE;
-		if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+		if (has_transparent_pud_hugepage())
 			alignments[2] = HPAGE_PUD_SIZE;
 	}
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..b3f4dd0240f5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1597,6 +1597,9 @@ typedef unsigned int pgtbl_mod_mask;
 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
 #endif
 
+#ifndef has_transparent_pud_hugepage
+#define has_transparent_pud_hugepage() IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#endif
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2023-06-06  4:55 ` [PATCH 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support Aneesh Kumar K.V
@ 2023-06-06  4:55 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override Aneesh Kumar K.V
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:55 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

We will use this in a later patch to do tlb flush when clearing pud entries on
powerpc. This is similar to
commit 93a98695f2f9 ("mm: change pmdp_huge_get_and_clear_full take vm_area_struct as arg")

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/pgtable.h | 4 ++--
 mm/debug_vm_pgtable.c   | 2 +-
 mm/huge_memory.c        | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b3f4dd0240f5..2fe19720075e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -442,11 +442,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
+static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 					    unsigned long address, pud_t *pudp,
 					    int full)
 {
-	return pudp_huge_get_and_clear(mm, address, pudp);
+	return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..c2bf25d5e5cd 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -382,7 +382,7 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 	WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-	pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1);
+	pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1);
 	pud = READ_ONCE(*args->pudp);
 	WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..8774b4751a84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1980,7 +1980,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (!ptl)
 		return 0;
 
-	pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2023-06-06  4:55 ` [PATCH 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works Aneesh Kumar K.V
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within vmemmap
such that tail page mapping can point to the second PAGE_SIZE area. Enforce that
in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/mm.h | 30 ++++++++++++++++++++++++++----
 mm/mm_init.c       |  2 +-
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..9a45e61cd83f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,8 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 
+#include <asm/page.h>
+
 struct mempolicy;
 struct anon_vma;
 struct anon_vma_chain;
@@ -3550,13 +3552,33 @@ void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap);
 #endif
 
+#define VMEMMAP_RESERVE_NR	2
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-					   struct dev_pagemap *pgmap)
+static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
+					  struct dev_pagemap *pgmap)
 {
-	return is_power_of_2(sizeof(struct page)) &&
-		pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
+	if (pgmap) {
+		unsigned long nr_pages;
+		unsigned long nr_vmemmap_pages;
+
+		nr_pages = pgmap_vmemmap_nr(pgmap);
+		nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT);
+		/*
+		 * For vmemmap optimization with DAX we need minimum 2 vmemmap
+		 * pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst
+		 */
+		return is_power_of_2(sizeof(struct page)) &&
+			(nr_vmemmap_pages > VMEMMAP_RESERVE_NR) && !altmap;
+	}
+	return false;
 }
+/*
+ * If we don't have an architecture override, use the generic rule
+ */
+#ifndef vmemmap_can_optimize
+#define vmemmap_can_optimize __vmemmap_can_optimize
+#endif
+
 #else
 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
 					   struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..d1676afc94f1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
 	if (!vmemmap_can_optimize(altmap, pgmap))
 		return pgmap_vmemmap_nr(pgmap);
 
-	return 2 * (PAGE_SIZE / sizeof(struct page));
+	return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
 }
 
 static void __ref memmap_init_compound(struct page *head,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

Architectures like powerpc will like to use different page table allocators and
mapping mechanisms to implement vmemmap optimization. Similar to
vmemmap_populate allow architectures to implement vmemap_populate_compound_pages

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/sparse-vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 10d73a0dfcec..0b83706c08fd 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
+#ifndef vmemmap_populate_compound_pages
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 				       struct vmem_altmap *altmap,
 				       struct page *reuse)
@@ -446,6 +447,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	return 0;
 }
 
+#endif
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
 		struct dev_pagemap *pgmap)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (8 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

This helps architectures to override pmd_same and pud_same
independently.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2fe19720075e..8c5174d1f9db 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,7 +681,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
 	return pmd_val(pmd_a) == pmd_val(pmd_b);
 }
+#endif
 
+#ifndef __HAVE_ARCH_PUD_SAME
 static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
 	return pud_val(pud_a) == pud_val(pud_b);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (9 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/pgtable.h | 2 ++
 mm/mremap.c             | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8c5174d1f9db..c7f5806dc9d1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -550,6 +550,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pud_t *pudp)
 {
@@ -563,6 +564,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
 	BUILD_BUG();
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #endif
 
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..6373db571e5c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -338,7 +338,7 @@ static inline bool move_normal_pud(struct vm_area_struct *vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
 			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (10 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to read-only)
and the output address (pfn) of the vmemmap ptes. That is not supported without
unmapping of pte(marking it invalid) by some architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb vmemmap
optimization disabled. Hence split DAX optimization support into a different
config.

loongarch and riscv don't have devdax support. So the DAX config is not enabled
for them. With this change, arm64 should be able to select DAX optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/loongarch/Kconfig | 2 +-
 arch/riscv/Kconfig     | 2 +-
 arch/x86/Kconfig       | 3 ++-
 fs/Kconfig             | 2 +-
 include/linux/mm.h     | 2 +-
 mm/Kconfig             | 5 ++++-
 6 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index d38b066fc931..2060990c4612 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -55,7 +55,7 @@ config LOONGARCH
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select ARCH_WANT_LD_ORPHAN_WARN
-	select ARCH_WANT_OPTIMIZE_VMEMMAP
+	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	select ARCH_WANTS_NO_INSTR
 	select BUILDTIME_TABLE_SORT
 	select COMMON_CLK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 348c0fa1fc8c..aafbf201f708 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -49,7 +49,7 @@ config RISCV
 	select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
 	select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
 	select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-	select ARCH_WANT_OPTIMIZE_VMEMMAP
+	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
 	select BUILDTIME_TABLE_SORT if MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..eb383960b6ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,7 +127,8 @@ config X86
 	select ARCH_WANT_GENERAL_HUGETLB
 	select ARCH_WANT_HUGE_PMD_SHARE
 	select ARCH_WANT_LD_ORPHAN_WARN
-	select ARCH_WANT_OPTIMIZE_VMEMMAP	if X86_64
+	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if X86_64
+	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select BUILDTIME_TABLE_SORT
diff --git a/fs/Kconfig b/fs/Kconfig
index 18d034ec7953..9c104c130a6e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -252,7 +252,7 @@ config HUGETLB_PAGE
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	def_bool HUGETLB_PAGE
-	depends on ARCH_WANT_OPTIMIZE_VMEMMAP
+	depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a45e61cd83f..6e56ae09f0c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3553,7 +3553,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 #endif
 
 #define VMEMMAP_RESERVE_NR	2
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
 					  struct dev_pagemap *pgmap)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..7b388c10baab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -461,7 +461,10 @@ config SPARSEMEM_VMEMMAP
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_VMEMMAP
+config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
+	bool
+
+config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
 	bool
 
 config HAVE_MEMBLOCK_PHYS_MAP
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (11 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-06  4:56 ` [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

This is enabled only with radix translation and 1G hugepage size. This will be
used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge(). We should never find that return true. The only
expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation and
hence is marked to call BUG() in such a case.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 ++++++++++++++++--
 arch/powerpc/include/asm/book3s/64/radix.h    |  37 +++++
 .../include/asm/book3s/64/tlbflush-radix.h    |   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/mm/book3s64/pgtable.c            |  78 +++++++++
 arch/powerpc/mm/book3s64/radix_pgtable.c      |  28 ++++
 arch/powerpc/mm/book3s64/radix_tlb.c          |   7 +
 arch/powerpc/platforms/Kconfig.cputype        |   1 +
 include/trace/events/thp.h                    |  17 ++
 9 files changed, 323 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..9a05de007956 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
 {
 	return __pud_raw(pte_raw(pte));
 }
+
+static inline pte_t *pudp_ptep(pud_t *pud)
+{
+	return (pte_t *)pud;
+}
+
+#define pud_pfn(pud)		pte_pfn(pud_pte(pud))
+#define pud_dirty(pud)		pte_dirty(pud_pte(pud))
+#define pud_young(pud)		pte_young(pud_pte(pud))
+#define pud_mkold(pud)		pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_wrprotect(pud)	pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkdirty(pud)	pte_pud(pte_mkdirty(pud_pte(pud)))
+#define pud_mkclean(pud)	pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkyoung(pud)	pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite(pud)	pte_pud(pte_mkwrite(pud_pte(pud)))
 #define pud_write(pud)		pte_write(pud_pte(pud))
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#define pud_soft_dirty(pmd)    pte_soft_dirty(pud_pte(pud))
+#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
+#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 static inline int pud_bad(pud_t pud)
 {
 	if (radix_enabled())
@@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		       pmd_t *pmdp, pmd_t pmd);
+extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
+		       pud_t *pudp, pud_t pud);
+
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
 					unsigned long addr, pmd_t *pmd)
 {
 }
 
+static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
+					unsigned long addr, pud_t *pud)
+{
+}
+
 extern int hash__has_transparent_hugepage(void);
 static inline int has_transparent_hugepage(void)
 {
@@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
 }
 #define has_transparent_hugepage has_transparent_hugepage
 
+static inline int has_transparent_pud_hugepage(void)
+{
+	if (radix_enabled())
+		return radix__has_transparent_pud_hugepage();
+	return 0;
+}
+#define has_transparent_pud_hugepage has_transparent_pud_hugepage
+
 static inline unsigned long
 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
 		    unsigned long clr, unsigned long set)
@@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
 	return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
 }
 
+static inline unsigned long
+pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp,
+		    unsigned long clr, unsigned long set)
+{
+	if (radix_enabled())
+		return radix__pud_hugepage_update(mm, addr, pudp, clr, set);
+	BUG();
+	return pud_val(*pudp);
+}
+
 /*
  * returns true for pmd migration entries, THP, devmap, hugetlb
  * But compile time dependent on THP config
@@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd)
 	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
+static inline int pud_large(pud_t pud)
+{
+	return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
 /*
  * For radix we should always find H_PAGE_HASHPTE zero. Hence
  * the below will work for radix too
@@ -1166,6 +1219,17 @@ static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
 	return ((old & _PAGE_ACCESSED) != 0);
 }
 
+static inline int __pudp_test_and_clear_young(struct mm_struct *mm,
+					      unsigned long addr, pud_t *pudp)
+{
+	unsigned long old;
+
+	if ((pud_raw(*pudp) & cpu_to_be64(_PAGE_ACCESSED | H_PAGE_HASHPTE)) == 0)
+		return 0;
+	old = pud_hugepage_update(mm, addr, pudp, _PAGE_ACCESSED, 0);
+	return ((old & _PAGE_ACCESSED) != 0);
+}
+
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 				      pmd_t *pmdp)
@@ -1174,6 +1238,14 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 		pmd_hugepage_update(mm, addr, pmdp, _PAGE_WRITE, 0);
 }
 
+#define __HAVE_ARCH_PUDP_SET_WRPROTECT
+static inline void pudp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
+				      pud_t *pudp)
+{
+	if (pud_write(*pudp))
+		pud_hugepage_update(mm, addr, pudp, _PAGE_WRITE, 0);
+}
+
 /*
  * Only returns true for a THP. False for pmd migration entry.
  * We also need to return true when we come across a pte that
@@ -1195,6 +1267,17 @@ static inline int pmd_trans_huge(pmd_t pmd)
 	return hash__pmd_trans_huge(pmd);
 }
 
+static inline int pud_trans_huge(pud_t pud)
+{
+	if (!pud_present(pud))
+		return false;
+
+	if (radix_enabled())
+		return radix__pud_trans_huge(pud);
+	return 0;
+}
+
+
 #define __HAVE_ARCH_PMD_SAME
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
@@ -1203,6 +1286,16 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 	return hash__pmd_same(pmd_a, pmd_b);
 }
 
+#define __HAVE_ARCH_PUD_SAME
+static inline int pud_same(pud_t pud_a, pud_t pud_b)
+{
+	if (radix_enabled())
+		return radix__pud_same(pud_a, pud_b);
+	BUG();
+	return 0;
+}
+
+
 static inline pmd_t __pmd_mkhuge(pmd_t pmd)
 {
 	if (radix_enabled())
@@ -1210,6 +1303,14 @@ static inline pmd_t __pmd_mkhuge(pmd_t pmd)
 	return hash__pmd_mkhuge(pmd);
 }
 
+static inline pud_t __pud_mkhuge(pud_t pud)
+{
+	if (radix_enabled())
+		return radix__pud_mkhuge(pud);
+	BUG();
+	return pud;
+}
+
 /*
  * pfn_pmd return a pmd_t that can be used as pmd pte entry.
  */
@@ -1225,14 +1326,34 @@ static inline pmd_t pmd_mkhuge(pmd_t pmd)
 	return pmd;
 }
 
+static inline pud_t pud_mkhuge(pud_t pud)
+{
+#ifdef CONFIG_DEBUG_VM
+	if (radix_enabled())
+		WARN_ON((pud_raw(pud) & cpu_to_be64(_PAGE_PTE)) == 0);
+	else
+		WARN_ON(1);
+#endif
+	return pud;
+}
+
+
 #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 extern int pmdp_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp,
 				 pmd_t entry, int dirty);
+#define __HAVE_ARCH_PUDP_SET_ACCESS_FLAGS
+extern int pudp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pud_t *pudp,
+				 pud_t entry, int dirty);
 
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 				     unsigned long address, pmd_t *pmdp);
+#define __HAVE_ARCH_PUDP_TEST_AND_CLEAR_YOUNG
+extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long address, pud_t *pudp);
+
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1243,6 +1364,16 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 	return hash__pmdp_huge_get_and_clear(mm, addr, pmdp);
 }
 
+#define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
+static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
+					    unsigned long addr, pud_t *pudp)
+{
+	if (radix_enabled())
+		return radix__pudp_huge_get_and_clear(mm, addr, pudp);
+	BUG();
+	return *pudp;
+}
+
 static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp)
 {
@@ -1257,6 +1388,11 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr,
 				   pmd_t *pmdp, int full);
 
+#define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
+pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
+				   unsigned long addr,
+				   pud_t *pudp, int full);
+
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 static inline void pgtable_trans_huge_deposit(struct mm_struct *mm,
 					      pmd_t *pmdp, pgtable_t pgtable)
@@ -1305,6 +1441,14 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return hash__pmd_mkdevmap(pmd);
 }
 
+static inline pud_t pud_mkdevmap(pud_t pud)
+{
+	if (radix_enabled())
+		return radix__pud_mkdevmap(pud);
+	BUG();
+	return pud;
+}
+
 static inline int pmd_devmap(pmd_t pmd)
 {
 	return pte_devmap(pmd_pte(pmd));
@@ -1312,7 +1456,7 @@ static inline int pmd_devmap(pmd_t pmd)
 
 static inline int pud_devmap(pud_t pud)
 {
-	return 0;
+	return pte_devmap(pud_pte(pud));
 }
 
 static inline int pgd_devmap(pgd_t pgd)
@@ -1321,16 +1465,6 @@ static inline int pgd_devmap(pgd_t pgd)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline int pud_pfn(pud_t pud)
-{
-	/*
-	 * Currently all calls to pud_pfn() are gated around a pud_devmap()
-	 * check so this should never be used. If it grows another user we
-	 * want to know about it.
-	 */
-	BUILD_BUG();
-	return 0;
-}
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
 pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
 void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 686001eda936..8cdff5a05011 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -250,6 +250,10 @@ static inline int radix__pud_bad(pud_t pud)
 	return !!(pud_val(pud) & RADIX_PUD_BAD_BITS);
 }
 
+static inline int radix__pud_same(pud_t pud_a, pud_t pud_b)
+{
+	return ((pud_raw(pud_a) ^ pud_raw(pud_b)) == 0);
+}
 
 static inline int radix__p4d_bad(p4d_t p4d)
 {
@@ -268,9 +272,22 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 	return __pmd(pmd_val(pmd) | _PAGE_PTE);
 }
 
+static inline int radix__pud_trans_huge(pud_t pud)
+{
+	return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+}
+
+static inline pud_t radix__pud_mkhuge(pud_t pud)
+{
+	return __pud(pud_val(pud) | _PAGE_PTE);
+}
+
 extern unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
 					  pmd_t *pmdp, unsigned long clr,
 					  unsigned long set);
+extern unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long addr,
+						pud_t *pudp, unsigned long clr,
+						unsigned long set);
 extern pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 extern void radix__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
@@ -278,6 +295,9 @@ extern void radix__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 extern pgtable_t radix__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 extern pmd_t radix__pmdp_huge_get_and_clear(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp);
+pud_t radix__pudp_huge_get_and_clear(struct mm_struct *mm,
+				     unsigned long addr, pud_t *pudp);
+
 static inline int radix__has_transparent_hugepage(void)
 {
 	/* For radix 2M at PMD level means thp */
@@ -285,6 +305,14 @@ static inline int radix__has_transparent_hugepage(void)
 		return 1;
 	return 0;
 }
+
+static inline int radix__has_transparent_pud_hugepage(void)
+{
+	/* For radix 1G at PUD level means pud hugepage support */
+	if (mmu_psize_defs[MMU_PAGE_1G].shift == PUD_SHIFT)
+		return 1;
+	return 0;
+}
 #endif
 
 static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
@@ -292,9 +320,18 @@ static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
 	return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
 }
 
+static inline pud_t radix__pud_mkdevmap(pud_t pud)
+{
+	return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP));
+}
+
+struct vmem_altmap;
+struct dev_pagemap;
 extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
 					     unsigned long page_size,
 					     unsigned long phys);
+int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
+				      int node, struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
 				    unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 77797a2a82eb..a38542259fab 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -68,6 +68,8 @@ void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
 				      unsigned long end, int psize);
 extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
 				       unsigned long start, unsigned long end);
+extern void radix__flush_pud_tlb_range(struct vm_area_struct *vma,
+				       unsigned long start, unsigned long end);
 extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 			    unsigned long end);
 extern void radix__flush_tlb_kernel_range(unsigned long start, unsigned long end);
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0d0c1447ecf0..a01c20a8fbf7 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -50,6 +50,14 @@ static inline void flush_pmd_tlb_range(struct vm_area_struct *vma,
 		radix__flush_pmd_tlb_range(vma, start, end);
 }
 
+#define __HAVE_ARCH_FLUSH_PUD_TLB_RANGE
+static inline void flush_pud_tlb_range(struct vm_area_struct *vma,
+				       unsigned long start, unsigned long end)
+{
+	if (radix_enabled())
+		radix__flush_pud_tlb_range(vma, start, end);
+}
+
 #define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
 static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
 					   unsigned long start,
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 85c84e89e3ea..9e5f01a1738c 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -64,11 +64,39 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 	return changed;
 }
 
+int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
+			  pud_t *pudp, pud_t entry, int dirty)
+{
+	int changed;
+#ifdef CONFIG_DEBUG_VM
+	WARN_ON(!pud_devmap(*pudp));
+	assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
+#endif
+	changed = !pud_same(*(pudp), entry);
+	if (changed) {
+		/*
+		 * We can use MMU_PAGE_2M here, because only radix
+		 * path look at the psize.
+		 */
+		__ptep_set_access_flags(vma, pudp_ptep(pudp),
+					pud_pte(entry), address, MMU_PAGE_1G);
+	}
+	return changed;
+}
+
+
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long address, pmd_t *pmdp)
 {
 	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
 }
+
+int pudp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long address, pud_t *pudp)
+{
+	return __pudp_test_and_clear_young(vma->vm_mm, address, pudp);
+}
+
 /*
  * set a new huge pmd. We should not be called for updating
  * an existing pmd entry. That should go via pmd_hugepage_update.
@@ -90,6 +118,23 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
 }
 
+void set_pud_at(struct mm_struct *mm, unsigned long addr,
+		pud_t *pudp, pud_t pud)
+{
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * Make sure hardware valid bit is not set. We don't do
+	 * tlb flush for this update.
+	 */
+
+	WARN_ON(pte_hw_valid(pud_pte(*pudp)));
+	assert_spin_locked(pud_lockptr(mm, pudp));
+	WARN_ON(!(pud_large(pud)));
+#endif
+	trace_hugepage_set_pud(addr, pud_val(pud));
+	return set_pte_at(mm, addr, pudp_ptep(pudp), pud_pte(pud));
+}
+
 static void do_serialize(void *arg)
 {
 	/* We've taken the IPI, so try to trim the mask while here */
@@ -147,11 +192,35 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 	return pmd;
 }
 
+pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
+				   unsigned long addr, pud_t *pudp, int full)
+{
+	pud_t pud;
+
+	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
+	VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) ||
+		  !pud_present(*pudp));
+	pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp);
+	/*
+	 * if it not a fullmm flush, then we can possibly end up converting
+	 * this PMD pte entry to a regular level 0 PTE by a parallel page fault.
+	 * Make sure we flush the tlb in this case.
+	 */
+	if (!full)
+		flush_pud_tlb_range(vma, addr, addr + HPAGE_PUD_SIZE);
+	return pud;
+}
+
 static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
 {
 	return __pmd(pmd_val(pmd) | pgprot_val(pgprot));
 }
 
+static pud_t pud_set_protbits(pud_t pud, pgprot_t pgprot)
+{
+	return __pud(pud_val(pud) | pgprot_val(pgprot));
+}
+
 /*
  * At some point we should be able to get rid of
  * pmd_mkhuge() and mk_huge_pmd() when we update all the
@@ -166,6 +235,15 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
 	return __pmd_mkhuge(pmd_set_protbits(__pmd(pmdv), pgprot));
 }
 
+pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot)
+{
+	unsigned long pudv;
+
+	pudv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK;
+
+	return __pud_mkhuge(pud_set_protbits(__pud(pudv), pgprot));
+}
+
 pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
 {
 	return pfn_pmd(page_to_pfn(page), pgprot);
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 76f6a1f3b9d8..d7e2dd3d4add 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -965,6 +965,23 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add
 	return old;
 }
 
+unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long addr,
+					 pud_t *pudp, unsigned long clr,
+					 unsigned long set)
+{
+	unsigned long old;
+
+#ifdef CONFIG_DEBUG_VM
+	WARN_ON(!pud_devmap(*pudp));
+	assert_spin_locked(pud_lockptr(mm, pudp));
+#endif
+
+	old = radix__pte_update(mm, addr, pudp_ptep(pudp), clr, set, 1);
+	trace_hugepage_update(addr, old, clr, set);
+
+	return old;
+}
+
 pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			pmd_t *pmdp)
 
@@ -1041,6 +1058,17 @@ pmd_t radix__pmdp_huge_get_and_clear(struct mm_struct *mm,
 	return old_pmd;
 }
 
+pud_t radix__pudp_huge_get_and_clear(struct mm_struct *mm,
+				     unsigned long addr, pud_t *pudp)
+{
+	pud_t old_pud;
+	unsigned long old;
+
+	old = radix__pud_hugepage_update(mm, addr, pudp, ~0UL, 0);
+	old_pud = __pud(old);
+	return old_pud;
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 void radix__ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index ce804b7bf84e..a18f7d2c9f63 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1453,6 +1453,13 @@ void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(radix__flush_pmd_tlb_range);
 
+void radix__flush_pud_tlb_range(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
+{
+	radix__flush_tlb_range_psize(vma->vm_mm, start, end, MMU_PAGE_1G);
+}
+EXPORT_SYMBOL(radix__flush_pud_tlb_range);
+
 void radix__flush_tlb_all(void)
 {
 	unsigned long rb,prs,r,rs;
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 45fd975ef521..340b86ef7284 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -94,6 +94,7 @@ config PPC_BOOK3S_64
 	select PPC_FPU
 	select PPC_HAVE_PMU_SUPPORT
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
diff --git a/include/trace/events/thp.h b/include/trace/events/thp.h
index 202b3e3e67ff..a919d943d106 100644
--- a/include/trace/events/thp.h
+++ b/include/trace/events/thp.h
@@ -25,6 +25,23 @@ TRACE_EVENT(hugepage_set_pmd,
 	    TP_printk("Set pmd with 0x%lx with 0x%lx", __entry->addr, __entry->pmd)
 );
 
+TRACE_EVENT(hugepage_set_pud,
+
+	    TP_PROTO(unsigned long addr, unsigned long pud),
+	    TP_ARGS(addr, pud),
+	    TP_STRUCT__entry(
+		    __field(unsigned long, addr)
+		    __field(unsigned long, pud)
+			    ),
+
+	    TP_fast_assign(
+		    __entry->addr = addr;
+		    __entry->pud = pud;
+		    ),
+
+	    TP_printk("Set pud with 0x%lx with 0x%lx", __entry->addr, __entry->pud)
+	);
+
 
 TRACE_EVENT(hugepage_update,
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (12 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-14 10:50     ` Sachin Sant
  2023-06-06  4:56 ` [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

This is in preparation to update radix to implement vmemmap optimization for
devdax. Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE
3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the vmemmap
area is invalid. If found to be invalid, that implies we can safely free the
vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
page size, we need to do the above check even at the PAGE_SIZE granularity.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h |   2 +
 arch/powerpc/include/asm/pgtable.h         |   3 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 293 +++++++++++++++++++--
 arch/powerpc/mm/init_64.c                  |  26 +-
 4 files changed, 293 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 8cdff5a05011..87d4c1e62491 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
 					     unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
 				      int node, struct vmem_altmap *altmap);
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+			       struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
 				    unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 9972626ddaf6..6d4cd2ebae6e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x)
 
 struct seq_file;
 void arch_report_meminfo(struct seq_file *m);
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size);
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+			   unsigned long page_size);
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index d7e2dd3d4add..65de8630abcb 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 	p4d_clear(p4d);
 }
 
+static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end)
+{
+	unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+	return vmemmap_populated(start, PMD_SIZE);
+}
+
+static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long end)
+{
+	unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+
+	return vmemmap_populated(start, PAGE_SIZE);
+
+}
+
+static void __meminit free_vmemmap_pages(struct page *page,
+					 struct vmem_altmap *altmap,
+					 int order)
+{
+	unsigned int nr_pages = 1 << order;
+
+	if (altmap) {
+		unsigned long alt_start, alt_end;
+		unsigned long base_pfn = page_to_pfn(page);
+
+		/*
+		 * with 1G vmemmap mmaping we can have things setup
+		 * such that even though atlmap is specified we never
+		 * used altmap.
+		 */
+		alt_start = altmap->base_pfn;
+		alt_end = altmap->base_pfn + altmap->reserve +
+			altmap->free + altmap->alloc + altmap->align;
+
+		if (base_pfn >= alt_start && base_pfn < alt_end) {
+			vmem_altmap_free(altmap, nr_pages);
+			return;
+		}
+	}
+
+	if (PageReserved(page)) {
+		/* allocated from memblock */
+		while (nr_pages--)
+			free_reserved_page(page++);
+	} else
+		free_pages((unsigned long)page_address(page), order);
+}
+
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-			     unsigned long end, bool direct)
+			     unsigned long end, bool direct,
+			     struct vmem_altmap *altmap)
 {
 	unsigned long next, pages = 0;
 	pte_t *pte;
@@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 		if (!pte_present(*pte))
 			continue;
 
-		if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) {
-			/*
-			 * The vmemmap_free() and remove_section_mapping()
-			 * codepaths call us with aligned addresses.
-			 */
-			WARN_ONCE(1, "%s: unaligned range\n", __func__);
-			continue;
+		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+			if (!direct)
+				free_vmemmap_pages(pte_page(*pte), altmap, 0);
+			pte_clear(&init_mm, addr, pte);
+			pages++;
+		} else if (!direct && vmemmap_page_is_unused(addr, next)) {
+			free_vmemmap_pages(pte_page(*pte), altmap, 0);
+			pte_clear(&init_mm, addr, pte);
 		}
-
-		pte_clear(&init_mm, addr, pte);
-		pages++;
 	}
 	if (direct)
 		update_page_count(mmu_virtual_psize, -pages);
 }
 
 static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
-				       unsigned long end, bool direct)
+				       unsigned long end, bool direct,
+				       struct vmem_altmap *altmap)
 {
 	unsigned long next, pages = 0;
 	pte_t *pte_base;
@@ -788,18 +836,21 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 			continue;
 
 		if (pmd_is_leaf(*pmd)) {
-			if (!IS_ALIGNED(addr, PMD_SIZE) ||
-			    !IS_ALIGNED(next, PMD_SIZE)) {
-				WARN_ONCE(1, "%s: unaligned range\n", __func__);
-				continue;
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE)) {
+				if (!direct)
+					free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE));
+				pte_clear(&init_mm, addr, (pte_t *)pmd);
+				pages++;
+			} else if (vmemmap_pmd_is_unused(addr, next)) {
+				free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE));
+				pte_clear(&init_mm, addr, (pte_t *)pmd);
 			}
-			pte_clear(&init_mm, addr, (pte_t *)pmd);
-			pages++;
 			continue;
 		}
 
 		pte_base = (pte_t *)pmd_page_vaddr(*pmd);
-		remove_pte_table(pte_base, addr, next, direct);
+		remove_pte_table(pte_base, addr, next, direct, altmap);
 		free_pte_table(pte_base, pmd);
 	}
 	if (direct)
@@ -807,7 +858,8 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 }
 
 static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
-				       unsigned long end, bool direct)
+				       unsigned long end, bool direct,
+				       struct vmem_altmap *altmap)
 {
 	unsigned long next, pages = 0;
 	pmd_t *pmd_base;
@@ -832,15 +884,16 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 		}
 
 		pmd_base = pud_pgtable(*pud);
-		remove_pmd_table(pmd_base, addr, next, direct);
+		remove_pmd_table(pmd_base, addr, next, direct, altmap);
 		free_pmd_table(pmd_base, pud);
 	}
 	if (direct)
 		update_page_count(MMU_PAGE_1G, -pages);
 }
 
-static void __meminit remove_pagetable(unsigned long start, unsigned long end,
-				       bool direct)
+static void __meminit
+remove_pagetable(unsigned long start, unsigned long end, bool direct,
+		 struct vmem_altmap *altmap)
 {
 	unsigned long addr, next;
 	pud_t *pud_base;
@@ -869,7 +922,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end,
 		}
 
 		pud_base = p4d_pgtable(*p4d);
-		remove_pud_table(pud_base, addr, next, direct);
+		remove_pud_table(pud_base, addr, next, direct, altmap);
 		free_pud_table(pud_base, p4d);
 	}
 
@@ -892,7 +945,7 @@ int __meminit radix__create_section_mapping(unsigned long start,
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
 {
-	remove_pagetable(start, end, true);
+	remove_pagetable(start, end, true, NULL);
 	return 0;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -924,10 +977,198 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 	return 0;
 }
 
+int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
+				unsigned long addr, unsigned long next)
+{
+	int large = pmd_large(*pmd);
+
+	if (pmd_large(*pmd))
+		vmemmap_verify((pte_t *)pmd, node, addr, next);
+
+	return large;
+}
+
+void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node,
+			       unsigned long addr, unsigned long next)
+{
+	pte_t entry;
+	pte_t *ptep = pmdp_ptep(pmdp);
+
+	entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+	set_pte_at(&init_mm, addr, ptep, entry);
+	asm volatile("ptesync": : :"memory");
+
+	vmemmap_verify(ptep, node, addr, next);
+}
+
+static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
+						     struct vmem_altmap *altmap,
+						     struct page *reuse)
+{
+	pte_t *pte = pte_offset_kernel(pmd, addr);
+
+	if (pte_none(*pte)) {
+		pte_t entry;
+		void *p;
+
+		if (!reuse) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			/*
+			 * When a PTE/PMD entry is freed from the init_mm
+			 * there's a free_pages() call to this page allocated
+			 * above. Thus this get_page() is paired with the
+			 * put_page_testzero() on the freeing path.
+			 * This can only called by certain ZONE_DEVICE path,
+			 * and through vmemmap_populate_compound_pages() when
+			 * slab is available.
+			 */
+			get_page(reuse);
+			p = page_to_virt(reuse);
+		}
+		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+		set_pte_at(&init_mm, addr, pte, entry);
+		asm volatile("ptesync": : :"memory");
+	}
+	return pte;
+}
+
+static inline pud_t *vmemmap_pud_alloc(p4d_t *p4d, int node,
+				       unsigned long address)
+{
+	pud_t *pud;
+
+	/* All early vmemmap mapping to keep simple do it at PAGE_SIZE */
+	if (unlikely(p4d_none(*p4d))) {
+		if (unlikely(!slab_is_available())) {
+			pud = early_alloc_pgtable(PAGE_SIZE, node, 0, 0);
+			p4d_populate(&init_mm, p4d, pud);
+			/* go to the pud_offset */
+		} else
+			return pud_alloc(&init_mm, p4d, address);
+	}
+	return pud_offset(p4d, address);
+}
+
+static inline pmd_t *vmemmap_pmd_alloc(pud_t *pud, int node,
+				       unsigned long address)
+{
+	pmd_t *pmd;
+
+	/* All early vmemmap mapping to keep simple do it at PAGE_SIZE */
+	if (unlikely(pud_none(*pud))) {
+		if (unlikely(!slab_is_available())) {
+			pmd = early_alloc_pgtable(PAGE_SIZE, node, 0, 0);
+			pud_populate(&init_mm, pud, pmd);
+		} else
+			return pmd_alloc(&init_mm, pud, address);
+	}
+	return pmd_offset(pud, address);
+}
+
+static inline pte_t *vmemmap_pte_alloc(pmd_t *pmd, int node,
+				       unsigned long address)
+{
+	pte_t *pte;
+
+	/* All early vmemmap mapping to keep simple do it at PAGE_SIZE */
+	if (unlikely(pmd_none(*pmd))) {
+		if (unlikely(!slab_is_available())) {
+			pte = early_alloc_pgtable(PAGE_SIZE, node, 0, 0);
+			pmd_populate(&init_mm, pmd, pte);
+		} else
+			return pte_alloc_kernel(pmd, address);
+	}
+	return pte_offset_kernel(pmd, address);
+}
+
+
+
+int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node,
+				      struct vmem_altmap *altmap)
+{
+	unsigned long addr;
+	unsigned long next;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pmd_addr_end(addr, end);
+
+		pgd = pgd_offset_k(addr);
+		p4d = p4d_offset(pgd, addr);
+		pud = vmemmap_pud_alloc(p4d, node, addr);
+		if (!pud)
+			return -ENOMEM;
+		pmd = vmemmap_pmd_alloc(pud, node, addr);
+		if (!pmd)
+			return -ENOMEM;
+		if (pmd_none(READ_ONCE(*pmd))) {
+			void *p;
+
+			if (altmap && altmap_cross_boundary(altmap, start, PMD_SIZE)) {
+				/* make sure we don't create altmap mappings covery things outside. */
+				goto base_mapping;
+
+			}
+
+			p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
+			if (p) {
+				vmemmap_set_pmd(pmd, p, node, addr, next);
+				continue;
+			} else if (altmap) {
+				/*
+				 * No fallback: In any case we care about, the
+				 * altmap should be reasonably sized and aligned
+				 * such that vmemmap_alloc_block_buf() will always
+				 * succeed. For consistency with the PTE case,
+				 * return an error here as failure could indicate
+				 * a configuration issue with the size of the altmap.
+				 */
+				return -ENOMEM;
+			}
+		} else if (vmemmap_check_pmd(pmd, node, addr, next)) {
+			/*
+			 * If a huge mapping exist due to early call to
+			 * vmemmap_populate, let's try to use that.
+			 */
+			continue;
+		}
+base_mapping:
+		/*
+		 * Not able allocate higher order memory to back memmap
+		 * or we found a pointer to pte page. Allocate base page
+		 * size vmemmap
+		 */
+		pte = vmemmap_pte_alloc(pmd, node, addr);
+		if (!pte)
+			return -ENOMEM;
+
+		pte = radix__vmemmap_pte_populate(pmd, addr, node, altmap, NULL);
+		if (!pte)
+			return -ENOMEM;
+
+		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+		next = addr + PAGE_SIZE;
+	}
+	return 0;
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size)
 {
-	remove_pagetable(start, start + page_size, false);
+	remove_pagetable(start, start + page_size, true, NULL);
+}
+
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+			       struct vmem_altmap *altmap)
+{
+	remove_pagetable(start, end, false, altmap);
 }
 #endif
 #endif
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index fe1b83020e0d..5701faca39ef 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -92,7 +92,7 @@ static struct page * __meminit vmemmap_subsection_start(unsigned long vmemmap_ad
  * a page table lookup here because with the hash translation we don't keep
  * vmemmap details in linux page table.
  */
-static int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size)
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size)
 {
 	struct page *start;
 	unsigned long vmemmap_end = vmemmap_addr + vmemmap_map_size;
@@ -183,8 +183,8 @@ static __meminit int vmemmap_list_populate(unsigned long phys,
 	return 0;
 }
 
-static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
-				unsigned long page_size)
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+			   unsigned long page_size)
 {
 	unsigned long nr_pfn = page_size / sizeof(struct page);
 	unsigned long start_pfn = page_to_pfn((struct page *)start);
@@ -204,6 +204,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	bool altmap_alloc;
 	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (radix_enabled())
+		return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
 	/* Align to the page size of the linear mapping. */
 	start = ALIGN_DOWN(start, page_size);
 
@@ -303,8 +308,8 @@ static unsigned long vmemmap_list_free(unsigned long start)
 	return vmem_back->phys;
 }
 
-void __ref vmemmap_free(unsigned long start, unsigned long end,
-		struct vmem_altmap *altmap)
+void __ref __vmemmap_free(unsigned long start, unsigned long end,
+			  struct vmem_altmap *altmap)
 {
 	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 	unsigned long page_order = get_order(page_size);
@@ -362,6 +367,17 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 		vmemmap_remove_mapping(start, page_size);
 	}
 }
+
+void __ref vmemmap_free(unsigned long start, unsigned long end,
+			struct vmem_altmap *altmap)
+{
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (radix_enabled())
+		return radix__vmemmap_free(start, end, altmap);
+#endif
+	return __vmemmap_free(start, end, altmap);
+}
+
 #endif
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (13 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-07 23:54     ` kernel test robot
  2023-06-06  4:56 ` [PATCH 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize Aneesh Kumar K.V
  2023-06-14  4:11   ` Aneesh Kumar K.V
  16 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap page
can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence with 64K
page size, we don't use vmemmap deduplication for PMD-level mapping.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/mm/vmemmap_dedup.rst         |   1 +
 Documentation/powerpc/vmemmap_dedup.rst    | 101 ++++++++++
 arch/powerpc/Kconfig                       |   1 +
 arch/powerpc/include/asm/book3s/64/radix.h |   8 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 203 +++++++++++++++++++++
 5 files changed, 314 insertions(+)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index a4b12ff906c4..c573e08b5043 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -210,6 +210,7 @@ the device (altmap).
 
 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
 
 The differences with HugeTLB are relatively minor.
 
diff --git a/Documentation/powerpc/vmemmap_dedup.rst b/Documentation/powerpc/vmemmap_dedup.rst
new file mode 100644
index 000000000000..dc4db59fdf87
--- /dev/null
+++ b/Documentation/powerpc/vmemmap_dedup.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Device DAX
+==========
+
+The device-dax interface uses the tail deduplication technique explained in
+Documentation/mm/vmemmap_dedup.rst
+
+On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
+with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
+deduplication.
+
+With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
+page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
+vmemmap deduplication possible.
+
+With 1G PUD level mapping, we require 16384 struct pages and a single 64K
+vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
+require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PUD    |                     +-----------+                       | | |
+ |   level   |                     |     .     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     .     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     15    | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+
+With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
+4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
+require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PMD    |                     +-----------+                       | | |
+ |   level   |                     |     5     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     6     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     7     | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+With 1G PUD level mapping, we require 262144 struct pages and a single 4K
+vmemmap page can contain 64 struct pages (4K/sizeof(struct page)). Hence we
+require 4096 4K pages in vmemmap to map the struct pages for 1G PUD level
+mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PUD    |                     +-----------+                       | | |
+ |   level   |                     |     .     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     .     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |   4095    | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index bff5820b7cda..6bd9ca6f2448 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -175,6 +175,7 @@ config PPC
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WANT_LD_ORPHAN_WARN
+	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if PPC_RADIX_MMU
 	select ARCH_WANTS_MODULES_DATA_IN_VMALLOC	if PPC_BOOK3S_32 || PPC_8xx
 	select ARCH_WEAK_RELEASE_ACQUIRE
 	select BINFMT_ELF
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 87d4c1e62491..3195f268ed7f 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -364,5 +364,13 @@ int radix__remove_section_mapping(unsigned long start, unsigned long end);
 
 void radix__kernel_map_pages(struct page *page, int numpages, int enable);
 
+#define vmemmap_can_optimize vmemmap_can_optimize
+bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap);
+
+#define vmemmap_populate_compound_pages vmemmap_populate_compound_pages
+int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+					      unsigned long start,
+					      unsigned long end, int node,
+					      struct dev_pagemap *pgmap);
 #endif /* __ASSEMBLY__ */
 #endif
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 65de8630abcb..82d36df52a8d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -977,6 +977,15 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 	return 0;
 }
 
+
+bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap)
+{
+	if (radix_enabled())
+		return __vmemmap_can_optimize(altmap, pgmap);
+
+	return false;
+}
+
 int __meminit vmemmap_check_pmd(pmd_t *pmd, int node,
 				unsigned long addr, unsigned long next)
 {
@@ -1159,6 +1168,200 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 	return 0;
 }
 
+static pte_t * __meminit radix__vmemmap_populate_address(unsigned long addr, int node,
+							 struct vmem_altmap *altmap,
+							 struct page *reuse)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(addr);
+	p4d = p4d_offset(pgd, addr);
+	pud = vmemmap_pud_alloc(p4d, node, addr);
+	if (!pud)
+		return NULL;
+	pmd = vmemmap_pmd_alloc(pud, node, addr);
+	if (!pmd)
+		return NULL;
+	if (pmd_leaf(*pmd))
+		/*
+		 * The second page is mapped as a hugepage due to a nearby request.
+		 * Force our mapping to page size without deduplication
+		 */
+		return NULL;
+	pte = vmemmap_pte_alloc(pmd, node, addr);
+	if (!pte)
+		return NULL;
+	radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL);
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	return pte;
+}
+
+static pte_t * __meminit vmemmap_compound_tail_page(unsigned long addr,
+						    unsigned long pfn_offset, int node)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned long map_addr;
+
+	/* the second vmemmap page which we use for duplication */
+	map_addr = addr - pfn_offset * sizeof(struct page) + PAGE_SIZE;
+	pgd = pgd_offset_k(map_addr);
+	p4d = p4d_offset(pgd, map_addr);
+	pud = vmemmap_pud_alloc(p4d, node, map_addr);
+	if (!pud)
+		return NULL;
+	pmd = vmemmap_pmd_alloc(pud, node, map_addr);
+	if (!pmd)
+		return NULL;
+	if (pmd_leaf(*pmd))
+		/*
+		 * The second page is mapped as a hugepage due to a nearby request.
+		 * Force our mapping to page size without deduplication
+		 */
+		return NULL;
+	pte = vmemmap_pte_alloc(pmd, node, map_addr);
+	if (!pte)
+		return NULL;
+	/*
+	 * Check if there exist a mapping to the left
+	 */
+	if (pte_none(*pte)) {
+		/*
+		 * Populate the head page vmemmap page.
+		 * It can fall in different pmd, hence
+		 * vmemmap_populate_address()
+		 */
+		pte = radix__vmemmap_populate_address(map_addr - PAGE_SIZE, node, NULL, NULL);
+		if (!pte)
+			return NULL;
+		/*
+		 * Populate the tail pages vmemmap page
+		 */
+		pte = radix__vmemmap_pte_populate(pmd, map_addr, node, NULL, NULL);
+		if (!pte)
+			return NULL;
+		vmemmap_verify(pte, node, map_addr, map_addr + PAGE_SIZE);
+		return pte;
+	}
+	return pte;
+}
+
+int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+					      unsigned long start,
+					      unsigned long end, int node,
+					      struct dev_pagemap *pgmap)
+{
+	/*
+	 * we want to map things as base page size mapping so that
+	 * we can save space in vmemmap. We could have huge mapping
+	 * covering out both edges.
+	 */
+	unsigned long addr;
+	unsigned long addr_pfn = start_pfn;
+	unsigned long next;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	for (addr = start; addr < end; addr = next) {
+
+		pgd = pgd_offset_k(addr);
+		p4d = p4d_offset(pgd, addr);
+		pud = vmemmap_pud_alloc(p4d, node, addr);
+		if (!pud)
+			return -ENOMEM;
+		pmd = vmemmap_pmd_alloc(pud, node, addr);
+		if (!pmd)
+			return -ENOMEM;
+
+		if (pmd_leaf(READ_ONCE(*pmd))) {
+			/* existing huge mapping. Skip the range */
+			addr_pfn += (PMD_SIZE >> PAGE_SHIFT);
+			next = pmd_addr_end(addr, end);
+			continue;
+		}
+		pte = vmemmap_pte_alloc(pmd, node, addr);
+		if (!pte)
+			return -ENOMEM;
+		if (!pte_none(*pte)) {
+			/*
+			 * This could be because we already have a compound
+			 * page whose VMEMMAP_RESERVE_NR pages were mapped and
+			 * this request fall in those pages.
+			 */
+			addr_pfn += 1;
+			next = addr + PAGE_SIZE;
+			continue;
+		} else {
+			unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+			unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
+			pte_t *tail_page_pte;
+
+			/*
+			 * if the address is aligned to huge page size it is the
+			 * head mapping.
+			 */
+			if (pfn_offset == 0) {
+				/* Populate the head page vmemmap page */
+				pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL);
+				if (!pte)
+					return -ENOMEM;
+				vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+				/*
+				 * Populate the tail pages vmemmap page
+				 * It can fall in different pmd, hence
+				 * vmemmap_populate_address()
+				 */
+				pte = radix__vmemmap_populate_address(addr + PAGE_SIZE, node, NULL, NULL);
+				if (!pte)
+					return -ENOMEM;
+
+				addr_pfn += 2;
+				next = addr + 2 * PAGE_SIZE;
+				continue;
+			}
+			/*
+			 * get the 2nd mapping details
+			 * Also create it if that doesn't exist
+			 */
+			tail_page_pte = vmemmap_compound_tail_page(addr, pfn_offset, node);
+			if (!tail_page_pte) {
+
+				pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, NULL);
+				if (!pte)
+					return -ENOMEM;
+				vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+				addr_pfn += 1;
+				next = addr + PAGE_SIZE;
+				continue;
+			}
+
+			pte = radix__vmemmap_pte_populate(pmd, addr, node, NULL, pte_page(*tail_page_pte));
+			if (!pte)
+				return -ENOMEM;
+			vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+			addr_pfn += 1;
+			next = addr + PAGE_SIZE;
+			continue;
+		}
+	}
+	return 0;
+}
+
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
                   ` (14 preceding siblings ...)
  2023-06-06  4:56 ` [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix Aneesh Kumar K.V
@ 2023-06-06  4:56 ` Aneesh Kumar K.V
  2023-06-14  4:11   ` Aneesh Kumar K.V
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-06  4:56 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Aneesh Kumar K.V, Catalin Marinas,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

This is not used by radix anymore.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ----------
 arch/powerpc/mm/init_64.c                | 21 ++++++++++++++-------
 2 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 82d36df52a8d..b59219751599 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -601,16 +601,6 @@ void __init radix__early_init_mmu(void)
 	mmu_virtual_psize = MMU_PAGE_4K;
 #endif
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-	/* vmemmap mapping */
-	if (mmu_psize_defs[MMU_PAGE_2M].shift) {
-		/*
-		 * map vmemmap using 2M if available
-		 */
-		mmu_vmemmap_psize = MMU_PAGE_2M;
-	} else
-		mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 	/*
 	 * initialize page table size
 	 */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5701faca39ef..6db7a063ba63 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
 	return false;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-		struct vmem_altmap *altmap)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node,
+				 struct vmem_altmap *altmap)
 {
 	bool altmap_alloc;
 	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-	if (radix_enabled())
-		return radix__vmemmap_populate(start, end, node, altmap);
-#endif
-
 	/* Align to the page size of the linear mapping. */
 	start = ALIGN_DOWN(start, page_size);
 
@@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return 0;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
+			       struct vmem_altmap *altmap)
+{
+
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (radix_enabled())
+		return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
+	return __vmemmap_populate(start, end, node, altmap);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long vmemmap_list_free(unsigned long start)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  2023-06-06  4:56 ` [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix Aneesh Kumar K.V
@ 2023-06-07 23:54     ` kernel test robot
  0 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-06-07 23:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, npiggin, christophe.leroy
  Cc: oe-kbuild-all, mpe, linuxppc-dev, Oscar Salvador, Mike Kravetz,
	Dan Williams, Joao Martins, Catalin Marinas, Muchun Song,
	Will Deacon, Aneesh Kumar K.V

Hi Aneesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on powerpc/next]
[also build test WARNING on powerpc/fixes akpm-mm/mm-everything linus/master v6.4-rc5]
[cannot apply to nvdimm/libnvdimm-for-next tip/x86/core nvdimm/dax-misc next-20230607]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/powerpc-mm-book3s64-Use-pmdp_ptep-helper-instead-of-typecasting/20230606-125913
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:    https://lore.kernel.org/r/20230606045608.55127-16-aneesh.kumar%40linux.ibm.com
patch subject: [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
reproduce:
        git remote add powerpc https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
        git fetch powerpc next
        git checkout powerpc/next
        b4 shazam https://lore.kernel.org/r/20230606045608.55127-16-aneesh.kumar@linux.ibm.com
        make menuconfig
        # enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
        make htmldocs

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306080711.eEcyyPFk-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Documentation/powerpc/vmemmap_dedup.rst: WARNING: document isn't included in any toctree

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
@ 2023-06-07 23:54     ` kernel test robot
  0 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-06-07 23:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, npiggin, christophe.leroy
  Cc: Catalin Marinas, Will Deacon, Muchun Song, Aneesh Kumar K.V,
	oe-kbuild-all, Dan Williams, Oscar Salvador, linuxppc-dev,
	Joao Martins, Mike Kravetz

Hi Aneesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on powerpc/next]
[also build test WARNING on powerpc/fixes akpm-mm/mm-everything linus/master v6.4-rc5]
[cannot apply to nvdimm/libnvdimm-for-next tip/x86/core nvdimm/dax-misc next-20230607]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/powerpc-mm-book3s64-Use-pmdp_ptep-helper-instead-of-typecasting/20230606-125913
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:    https://lore.kernel.org/r/20230606045608.55127-16-aneesh.kumar%40linux.ibm.com
patch subject: [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
reproduce:
        git remote add powerpc https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
        git fetch powerpc next
        git checkout powerpc/next
        b4 shazam https://lore.kernel.org/r/20230606045608.55127-16-aneesh.kumar@linux.ibm.com
        make menuconfig
        # enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
        make htmldocs

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306080711.eEcyyPFk-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Documentation/powerpc/vmemmap_dedup.rst: WARNING: document isn't included in any toctree

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64
  2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
@ 2023-06-14  4:11   ` Aneesh Kumar K.V
  2023-06-06  4:55 ` [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix Aneesh Kumar K.V
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-14  4:11 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: mpe, linuxppc-dev, Oscar Salvador, Mike Kravetz, Dan Williams,
	Joao Martins, Catalin Marinas, Muchun Song, Will Deacon

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> This patch series implements changes required to support DAX vmemmap
> optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
> translation and 1GB PUD mapping with 64K page size. The patch series also split
> hugetlb vmemmap optimization as a separate Kconfig variable so that
> architectures can enable DAX vmemmap optimization without enabling hugetlb
> vmemmap optimization. This should enable architectures like arm64 to enable DAX
> vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
> details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
> d
>
> Aneesh Kumar K.V (16):
>   powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
>   powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
>   powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
>   powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
>   powerpc/mm/dax: Fix the condition when checking if altmap vmemap can
>     cross-boundary
>   mm/hugepage pud: Allow arch-specific helper function to check huge
>     page pud support
>   mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
>   mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
>     override
>   mm/vmemmap: Allow architectures to override how vmemmap optimization
>     works
>   mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
>   mm/huge pud: Use transparent huge pud helpers only with
>     CONFIG_TRANSPARENT_HUGEPAGE
>   mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
>   powerpc/book3s64/mm: Enable transparent pud hugepage
>   powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
>     handling function
>   powerpc/book3s64/radix: Add support for vmemmap optimization for radix
>   powerpc/book3s64/radix: Remove mmu_vmemmap_psize

   Gentle ping. Any objections for this series? 

   -aneesh


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64
@ 2023-06-14  4:11   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-14  4:11 UTC (permalink / raw)
  To: linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Muchun Song, Catalin Marinas, Dan Williams,
	Oscar Salvador, linuxppc-dev, Joao Martins, Mike Kravetz

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> This patch series implements changes required to support DAX vmemmap
> optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
> translation and 1GB PUD mapping with 64K page size. The patch series also split
> hugetlb vmemmap optimization as a separate Kconfig variable so that
> architectures can enable DAX vmemmap optimization without enabling hugetlb
> vmemmap optimization. This should enable architectures like arm64 to enable DAX
> vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
> details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
> d
>
> Aneesh Kumar K.V (16):
>   powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
>   powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
>   powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
>   powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
>   powerpc/mm/dax: Fix the condition when checking if altmap vmemap can
>     cross-boundary
>   mm/hugepage pud: Allow arch-specific helper function to check huge
>     page pud support
>   mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
>   mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
>     override
>   mm/vmemmap: Allow architectures to override how vmemmap optimization
>     works
>   mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
>   mm/huge pud: Use transparent huge pud helpers only with
>     CONFIG_TRANSPARENT_HUGEPAGE
>   mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
>   powerpc/book3s64/mm: Enable transparent pud hugepage
>   powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
>     handling function
>   powerpc/book3s64/radix: Add support for vmemmap optimization for radix
>   powerpc/book3s64/radix: Remove mmu_vmemmap_psize

   Gentle ping. Any objections for this series? 

   -aneesh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
  2023-06-06  4:56 ` [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function Aneesh Kumar K.V
@ 2023-06-14 10:50     ` Sachin Sant
  0 siblings, 0 replies; 29+ messages in thread
From: Sachin Sant @ 2023-06-14 10:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, Nicholas Piggin, Christophe Leroy,
	Will Deacon, Muchun Song, Catalin Marinas, Dan Williams,
	Oscar Salvador, linuxppc-dev, Joao Martins, Mike Kravetz


> 1. First try to map things using PMD (2M)
> 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE
> 3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE
> 
> On removing vmemmap mapping, check if every subsection that is using the vmemmap
> area is invalid. If found to be invalid, that implies we can safely free the
> vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
> page size, we need to do the above check even at the PAGE_SIZE granularity.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---

With this patch series applied I see the following warning

[  OK  ] Started Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
[    3.283884] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: nvdimm pmu didn't register rc=-2
[    3.284212] papr_scm ibm,persistent-memory:ibm,pmemory@44104002: nvdimm pmu didn't register rc=-2
[    3.563890] radix-mmu: Mapped 0x0000040010000000-0x0000040c90000000 with 64.0 KiB pages
[    3.703227] ------------[ cut here ]------------
[    3.703236] failed to free all reserved pages
[    3.703244] WARNING: CPU: 41 PID: 923 at mm/memremap.c:152 memunmap_pages+0x37c/0x3a0
[    3.703252] Modules linked in: device_dax(+) nd_pmem nd_btt dax_pmem papr_scm libnvdimm pseries_rng vmx_crypto aes_gcm_p10_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse
[    3.703272] CPU: 41 PID: 923 Comm: systemd-udevd Not tainted 6.4.0-rc6-00037-gb6dad5178cea-dirty #1
[    3.703276] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
[    3.703280] NIP:  c00000000057a18c LR: c00000000057a188 CTR: 00000000005ca81c
[    3.703283] REGS: c000000032a170d0 TRAP: 0700   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
[    3.703286] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48248824  XER: 00000002
[    3.703296] CFAR: c00000000015f0c0 IRQMASK: 0  [    3.703296] GPR00: c00000000057a188 c000000032a17370 c000000001421500 0000000000000021  [    3.703296] GPR04: 00000000ffff7fff c000000032a17140 c000000032a17138 0000000000000027  [    3.703296] GPR08: c0000015c91a7c10 0000000000000001 0000000000000027 c000000002a18a20  [    3.703296] GPR12: 0000000048248824 c0000015cb9f4300 c000000032a17d68 c000000001262b20  [    3.703296] GPR16: c008000001310000 000000000000ff20 000000000000fff2 c0080000012d7418  [    3.703296] GPR20: c000000032a17c30 0000000000000004 ffffffffffffc005 0000000001000200  [    3.703296] GPR24: c000000002f11570 c00000000e376870 0000000000000001 0000000000000001  [    3.703296] GPR28: c00000000e376840 c00000000e3768c8 0000000000000000 c00000000e376840  [    3.703333] NIP [c00000000057a18c] memunmap_pages+0x37c/0x3a0
[    3.703338] LR [c00000000057a188] memunmap_pages+0x378/0x3a0
[    3.703342] Call Trace:
[    3.703344] [c000000032a17370] [c00000000057a188] memunmap_pages+0x378/0x3a0 (unreliable)
[    3.703349] [c000000032a17420] [c00000000057a928] memremap_pages+0x4a8/0x890
[    3.703355] [c000000032a17500] [c00000000057ad4c] devm_memremap_pages+0x3c/0xd0
[    3.703359] [c000000032a17540] [c0080000011c084c] dev_dax_probe+0x134/0x3a0 [device_dax]
[    3.703366] [c000000032a175e0] [c0000000009f7e8c] dax_bus_probe+0xac/0x140
[    3.703371] [c000000032a17610] [c0000000009b5828] really_probe+0x108/0x530
[    3.703375] [c000000032a176a0] [c0000000009b5d04] __driver_probe_device+0xb4/0x200
[    3.703379] [c000000032a17720] [c0000000009b5ea8] driver_probe_device+0x58/0x120
[    3.703383] [c000000032a17760] [c0000000009b6298] __driver_attach+0x148/0x250
[    3.703387] [c000000032a177e0] [c0000000009b1a58] bus_for_each_dev+0xa8/0x130
[    3.703392] [c000000032a17840] [c0000000009b4b34] driver_attach+0x34/0x50
[    3.703396] [c000000032a17860] [c0000000009b3b98] bus_add_driver+0x258/0x300
[    3.703400] [c000000032a178f0] [c0000000009b78d4] driver_register+0xa4/0x1b0
[    3.703404] [c000000032a17960] [c0000000009f9530] __dax_driver_register+0x50/0x70
[    3.703409] [c000000032a17980] [c0080000011c1374] dax_init+0x3c/0x58 [device_dax]
[    3.703414] [c000000032a179a0] [c000000000013260] do_one_initcall+0x60/0x2f0
[    3.703418] [c000000032a17a70] [c000000000248af8] do_init_module+0x78/0x310
[    3.703424] [c000000032a17af0] [c00000000024bcac] load_module+0x2a7c/0x2f30
[    3.703429] [c000000032a17d00] [c00000000024c4f0] __do_sys_finit_module+0xe0/0x180
[    3.703434] [c000000032a17e10] [c0000000000374c0] system_call_exception+0x140/0x350
[    3.703439] [c000000032a17e50] [c00000000000d6a0] system_call_common+0x160/0x2e4
[    3.703444] --- interrupt: c00 at 0x7fff9af2fb34
[    3.703447] NIP:  00007fff9af2fb34 LR: 00007fff9b6dea9c CTR: 0000000000000000
[    3.703450] REGS: c000000032a17e80 TRAP: 0c00   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
[    3.703453] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28222204  XER: 00000000
[    3.703462] IRQMASK: 0  [    3.703462] GPR00: 0000000000000161 00007fffed351350 00007fff9b007300 000000000000000f  [    3.703462] GPR04: 00007fff9b6ead30 0000000000000000 000000000000000f 0000000000000000  [    3.703462] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000  [    3.703462] GPR12: 0000000000000000 00007fff9b7c6610 0000000000020000 000000011057db18  [    3.703462] GPR16: 00000001105c0108 0000000110585f48 0000000000000000 0000000000000000  [    3.703462] GPR20: 0000000000000000 0000000110585f80 0000000147985200 00007fffed351570  [    3.703462] GPR24: 00000001105c0128 0000000000020000 0000000000000000 0000000147981010  [    3.703462] GPR28: 00007fff9b6ead30 0000000000020000 0000000000000000 0000000147985200  [    3.703497] NIP [00007fff9af2fb34] 0x7fff9af2fb34
[    3.703499] LR [00007fff9b6dea9c] 0x7fff9b6dea9c
[    3.703502] --- interrupt: c00
[    3.703504] Code: 60000000 3d220170 8929b2b7 2f890000 409eff28 3c62ffe7 39200001 3d420170 3863c518 992ab2b7 4bbe4e55 60000000 <0fe00000> fac10060 fae10068 fb010070  [    3.703516] ---[ end trace 0000000000000000 ]---
[    3.703520] device_dax: probe of dax0.0 failed with error -12
[  OK  ] Created slice system-daxdev\x2dreconfigure.slice.
[  OK  ] Started udev Wait for Complete Device Initialization.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.

The warning appears after applying this patch. 

- Sachin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
@ 2023-06-14 10:50     ` Sachin Sant
  0 siblings, 0 replies; 29+ messages in thread
From: Sachin Sant @ 2023-06-14 10:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Joao Martins, Catalin Marinas, Muchun Song, linuxppc-dev,
	linux-mm, Nicholas Piggin, Andrew Morton, Mike Kravetz,
	Will Deacon, Dan Williams, Oscar Salvador


> 1. First try to map things using PMD (2M)
> 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE
> 3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE
> 
> On removing vmemmap mapping, check if every subsection that is using the vmemmap
> area is invalid. If found to be invalid, that implies we can safely free the
> vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
> page size, we need to do the above check even at the PAGE_SIZE granularity.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---

With this patch series applied I see the following warning

[  OK  ] Started Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
[    3.283884] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: nvdimm pmu didn't register rc=-2
[    3.284212] papr_scm ibm,persistent-memory:ibm,pmemory@44104002: nvdimm pmu didn't register rc=-2
[    3.563890] radix-mmu: Mapped 0x0000040010000000-0x0000040c90000000 with 64.0 KiB pages
[    3.703227] ------------[ cut here ]------------
[    3.703236] failed to free all reserved pages
[    3.703244] WARNING: CPU: 41 PID: 923 at mm/memremap.c:152 memunmap_pages+0x37c/0x3a0
[    3.703252] Modules linked in: device_dax(+) nd_pmem nd_btt dax_pmem papr_scm libnvdimm pseries_rng vmx_crypto aes_gcm_p10_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse
[    3.703272] CPU: 41 PID: 923 Comm: systemd-udevd Not tainted 6.4.0-rc6-00037-gb6dad5178cea-dirty #1
[    3.703276] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
[    3.703280] NIP:  c00000000057a18c LR: c00000000057a188 CTR: 00000000005ca81c
[    3.703283] REGS: c000000032a170d0 TRAP: 0700   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
[    3.703286] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48248824  XER: 00000002
[    3.703296] CFAR: c00000000015f0c0 IRQMASK: 0  [    3.703296] GPR00: c00000000057a188 c000000032a17370 c000000001421500 0000000000000021  [    3.703296] GPR04: 00000000ffff7fff c000000032a17140 c000000032a17138 0000000000000027  [    3.703296] GPR08: c0000015c91a7c10 0000000000000001 0000000000000027 c000000002a18a20  [    3.703296] GPR12: 0000000048248824 c0000015cb9f4300 c000000032a17d68 c000000001262b20  [    3.703296] GPR16: c008000001310000 000000000000ff20 000000000000fff2 c0080000012d7418  [    3.703296] GPR20: c000000032a17c30 0000000000000004 ffffffffffffc005 0000000001000200  [    3.703296] GPR24: c000000002f11570 c00000000e376870 0000000000000001 0000000000000001  [    3.703296] GPR28: c00000000e376840 c00000000e3768c8 0000000000000000 c00000000e376840  [    3.703333] NIP [c00000000057a18c] memunmap_pages+0x37c/0x3a0
[    3.703338] LR [c00000000057a188] memunmap_pages+0x378/0x3a0
[    3.703342] Call Trace:
[    3.703344] [c000000032a17370] [c00000000057a188] memunmap_pages+0x378/0x3a0 (unreliable)
[    3.703349] [c000000032a17420] [c00000000057a928] memremap_pages+0x4a8/0x890
[    3.703355] [c000000032a17500] [c00000000057ad4c] devm_memremap_pages+0x3c/0xd0
[    3.703359] [c000000032a17540] [c0080000011c084c] dev_dax_probe+0x134/0x3a0 [device_dax]
[    3.703366] [c000000032a175e0] [c0000000009f7e8c] dax_bus_probe+0xac/0x140
[    3.703371] [c000000032a17610] [c0000000009b5828] really_probe+0x108/0x530
[    3.703375] [c000000032a176a0] [c0000000009b5d04] __driver_probe_device+0xb4/0x200
[    3.703379] [c000000032a17720] [c0000000009b5ea8] driver_probe_device+0x58/0x120
[    3.703383] [c000000032a17760] [c0000000009b6298] __driver_attach+0x148/0x250
[    3.703387] [c000000032a177e0] [c0000000009b1a58] bus_for_each_dev+0xa8/0x130
[    3.703392] [c000000032a17840] [c0000000009b4b34] driver_attach+0x34/0x50
[    3.703396] [c000000032a17860] [c0000000009b3b98] bus_add_driver+0x258/0x300
[    3.703400] [c000000032a178f0] [c0000000009b78d4] driver_register+0xa4/0x1b0
[    3.703404] [c000000032a17960] [c0000000009f9530] __dax_driver_register+0x50/0x70
[    3.703409] [c000000032a17980] [c0080000011c1374] dax_init+0x3c/0x58 [device_dax]
[    3.703414] [c000000032a179a0] [c000000000013260] do_one_initcall+0x60/0x2f0
[    3.703418] [c000000032a17a70] [c000000000248af8] do_init_module+0x78/0x310
[    3.703424] [c000000032a17af0] [c00000000024bcac] load_module+0x2a7c/0x2f30
[    3.703429] [c000000032a17d00] [c00000000024c4f0] __do_sys_finit_module+0xe0/0x180
[    3.703434] [c000000032a17e10] [c0000000000374c0] system_call_exception+0x140/0x350
[    3.703439] [c000000032a17e50] [c00000000000d6a0] system_call_common+0x160/0x2e4
[    3.703444] --- interrupt: c00 at 0x7fff9af2fb34
[    3.703447] NIP:  00007fff9af2fb34 LR: 00007fff9b6dea9c CTR: 0000000000000000
[    3.703450] REGS: c000000032a17e80 TRAP: 0c00   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
[    3.703453] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28222204  XER: 00000000
[    3.703462] IRQMASK: 0  [    3.703462] GPR00: 0000000000000161 00007fffed351350 00007fff9b007300 000000000000000f  [    3.703462] GPR04: 00007fff9b6ead30 0000000000000000 000000000000000f 0000000000000000  [    3.703462] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000  [    3.703462] GPR12: 0000000000000000 00007fff9b7c6610 0000000000020000 000000011057db18  [    3.703462] GPR16: 00000001105c0108 0000000110585f48 0000000000000000 0000000000000000  [    3.703462] GPR20: 0000000000000000 0000000110585f80 0000000147985200 00007fffed351570  [    3.703462] GPR24: 00000001105c0128 0000000000020000 0000000000000000 0000000147981010  [    3.703462] GPR28: 00007fff9b6ead30 0000000000020000 0000000000000000 0000000147985200  [    3.703497] NIP [00007fff9af2fb34] 0x7fff9af2fb34
[    3.703499] LR [00007fff9b6dea9c] 0x7fff9b6dea9c
[    3.703502] --- interrupt: c00
[    3.703504] Code: 60000000 3d220170 8929b2b7 2f890000 409eff28 3c62ffe7 39200001 3d420170 3863c518 992ab2b7 4bbe4e55 60000000 <0fe00000> fac10060 fae10068 fb010070  [    3.703516] ---[ end trace 0000000000000000 ]---
[    3.703520] device_dax: probe of dax0.0 failed with error -12
[  OK  ] Created slice system-daxdev\x2dreconfigure.slice.
[  OK  ] Started udev Wait for Complete Device Initialization.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.

The warning appears after applying this patch. 

- Sachin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
  2023-06-14 10:50     ` Sachin Sant
@ 2023-06-15  2:23       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-15  2:23 UTC (permalink / raw)
  To: Sachin Sant
  Cc: linux-mm, Andrew Morton, Nicholas Piggin, Christophe Leroy,
	Will Deacon, Muchun Song, Catalin Marinas, Dan Williams,
	Oscar Salvador, linuxppc-dev, Joao Martins, Mike Kravetz

Sachin Sant <sachinp@linux.ibm.com> writes:

>> 1. First try to map things using PMD (2M)
>> 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE
>> 3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE
>> 
>> On removing vmemmap mapping, check if every subsection that is using the vmemmap
>> area is invalid. If found to be invalid, that implies we can safely free the
>> vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
>> page size, we need to do the above check even at the PAGE_SIZE granularity.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>
> With this patch series applied I see the following warning
>
> [  OK  ] Started Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
> [    3.283884] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: nvdimm pmu didn't register rc=-2
> [    3.284212] papr_scm ibm,persistent-memory:ibm,pmemory@44104002: nvdimm pmu didn't register rc=-2
> [    3.563890] radix-mmu: Mapped 0x0000040010000000-0x0000040c90000000 with 64.0 KiB pages
> [    3.703227] ------------[ cut here ]------------
> [    3.703236] failed to free all reserved pages
> [    3.703244] WARNING: CPU: 41 PID: 923 at mm/memremap.c:152 memunmap_pages+0x37c/0x3a0
> [    3.703252] Modules linked in: device_dax(+) nd_pmem nd_btt dax_pmem papr_scm libnvdimm pseries_rng vmx_crypto aes_gcm_p10_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse
> [    3.703272] CPU: 41 PID: 923 Comm: systemd-udevd Not tainted 6.4.0-rc6-00037-gb6dad5178cea-dirty #1
> [    3.703276] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
> [    3.703280] NIP:  c00000000057a18c LR: c00000000057a188 CTR: 00000000005ca81c
> [    3.703283] REGS: c000000032a170d0 TRAP: 0700   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
> [    3.703286] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48248824  XER: 00000002
> [    3.703296] CFAR: c00000000015f0c0 IRQMASK: 0  [    3.703296] GPR00: c00000000057a188 c000000032a17370 c000000001421500 0000000000000021  [    3.703296] GPR04: 00000000ffff7fff c000000032a17140 c000000032a17138 0000000000000027  [    3.703296] GPR08: c0000015c91a7c10 0000000000000001 0000000000000027 c000000002a18a20  [    3.703296] GPR12: 0000000048248824 c0000015cb9f4300 c000000032a17d68 c000000001262b20  [    3.703296] GPR16: c008000001310000 000000000000ff20 000000000000fff2 c0080000012d7418  [    3.703296] GPR20: c000000032a17c30 0000000000000004 ffffffffffffc005 0000000001000200  [    3.703296] GPR24: c000000002f11570 c00000000e376870 0000000000000001 0000000000000001  [    3.703296] GPR28: c00000000e376840 c00000000e3768c8 0000000000000000 c00000000e376840  [    3.703333] NIP [c00000000057a18c] memunmap_pages+0x37c/0x3a0
> [    3.703338] LR [c00000000057a188] memunmap_pages+0x378/0x3a0
> [    3.703342] Call Trace:
> [    3.703344] [c000000032a17370] [c00000000057a188] memunmap_pages+0x378/0x3a0 (unreliable)
> [    3.703349] [c000000032a17420] [c00000000057a928] memremap_pages+0x4a8/0x890
> [    3.703355] [c000000032a17500] [c00000000057ad4c] devm_memremap_pages+0x3c/0xd0
> [    3.703359] [c000000032a17540] [c0080000011c084c] dev_dax_probe+0x134/0x3a0 [device_dax]
> [    3.703366] [c000000032a175e0] [c0000000009f7e8c] dax_bus_probe+0xac/0x140
> [    3.703371] [c000000032a17610] [c0000000009b5828] really_probe+0x108/0x530
> [    3.703375] [c000000032a176a0] [c0000000009b5d04] __driver_probe_device+0xb4/0x200
> [    3.703379] [c000000032a17720] [c0000000009b5ea8] driver_probe_device+0x58/0x120
> [    3.703383] [c000000032a17760] [c0000000009b6298] __driver_attach+0x148/0x250
> [    3.703387] [c000000032a177e0] [c0000000009b1a58] bus_for_each_dev+0xa8/0x130
> [    3.703392] [c000000032a17840] [c0000000009b4b34] driver_attach+0x34/0x50
> [    3.703396] [c000000032a17860] [c0000000009b3b98] bus_add_driver+0x258/0x300
> [    3.703400] [c000000032a178f0] [c0000000009b78d4] driver_register+0xa4/0x1b0
> [    3.703404] [c000000032a17960] [c0000000009f9530] __dax_driver_register+0x50/0x70
> [    3.703409] [c000000032a17980] [c0080000011c1374] dax_init+0x3c/0x58 [device_dax]
> [    3.703414] [c000000032a179a0] [c000000000013260] do_one_initcall+0x60/0x2f0
> [    3.703418] [c000000032a17a70] [c000000000248af8] do_init_module+0x78/0x310
> [    3.703424] [c000000032a17af0] [c00000000024bcac] load_module+0x2a7c/0x2f30
> [    3.703429] [c000000032a17d00] [c00000000024c4f0] __do_sys_finit_module+0xe0/0x180
> [    3.703434] [c000000032a17e10] [c0000000000374c0] system_call_exception+0x140/0x350
> [    3.703439] [c000000032a17e50] [c00000000000d6a0] system_call_common+0x160/0x2e4
> [    3.703444] --- interrupt: c00 at 0x7fff9af2fb34
> [    3.703447] NIP:  00007fff9af2fb34 LR: 00007fff9b6dea9c CTR: 0000000000000000
> [    3.703450] REGS: c000000032a17e80 TRAP: 0c00   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
> [    3.703453] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28222204  XER: 00000000
> [    3.703462] IRQMASK: 0  [    3.703462] GPR00: 0000000000000161 00007fffed351350 00007fff9b007300 000000000000000f  [    3.703462] GPR04: 00007fff9b6ead30 0000000000000000 000000000000000f 0000000000000000  [    3.703462] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000  [    3.703462] GPR12: 0000000000000000 00007fff9b7c6610 0000000000020000 000000011057db18  [    3.703462] GPR16: 00000001105c0108 0000000110585f48 0000000000000000 0000000000000000  [    3.703462] GPR20: 0000000000000000 0000000110585f80 0000000147985200 00007fffed351570  [    3.703462] GPR24: 00000001105c0128 0000000000020000 0000000000000000 0000000147981010  [    3.703462] GPR28: 00007fff9b6ead30 0000000000020000 0000000000000000 0000000147985200  [    3.703497] NIP [00007fff9af2fb34] 0x7fff9af2fb34
> [    3.703499] LR [00007fff9b6dea9c] 0x7fff9b6dea9c
> [    3.703502] --- interrupt: c00
> [    3.703504] Code: 60000000 3d220170 8929b2b7 2f890000 409eff28 3c62ffe7 39200001 3d420170 3863c518 992ab2b7 4bbe4e55 60000000 <0fe00000> fac10060 fae10068 fb010070  [    3.703516] ---[ end trace 0000000000000000 ]---
> [    3.703520] device_dax: probe of dax0.0 failed with error -12
> [  OK  ] Created slice system-daxdev\x2dreconfigure.slice.
> [  OK  ] Started udev Wait for Complete Device Initialization.
> [  OK  ] Reached target Local File Systems (Pre).
> [  OK  ] Reached target Local File Systems.
>

The below change fixed the warning on the test machine you shared.

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 1c49af91fd9c..d884c1b39128 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -994,6 +994,7 @@ void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node,
 	pte_t entry;
 	pte_t *ptep = pmdp_ptep(pmdp);
 
+	VM_BUG_ON(!IS_ALIGNED((unsigned long)(addr), PMD_SIZE));
 	entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 	set_pte_at(&init_mm, addr, ptep, entry);
 	asm volatile("ptesync": : :"memory");
@@ -1012,6 +1013,10 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long a
 		void *p;
 
 		if (!reuse) {
+
+			if (altmap && altmap_cross_boundary(altmap, addr, PAGE_SIZE))
+				altmap = NULL;
+
 			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
 			if (!p)
 				return NULL;
@@ -1028,6 +1033,8 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long a
 			get_page(reuse);
 			p = page_to_virt(reuse);
 		}
+
+		VM_BUG_ON(!PAGE_ALIGNED(addr));
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 		asm volatile("ptesync": : :"memory");
@@ -1108,10 +1115,14 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 		pmd = vmemmap_pmd_alloc(pud, node, addr);
 		if (!pmd)
 			return -ENOMEM;
+
 		if (pmd_none(READ_ONCE(*pmd))) {
 			void *p;
 
-			if (altmap && altmap_cross_boundary(altmap, start, PMD_SIZE)) {
+			if (!IS_ALIGNED(addr, PMD_SIZE))
+				goto base_mapping;
+
+			if (altmap && altmap_cross_boundary(altmap, addr, PMD_SIZE)) {
 				/* make sure we don't create altmap mappings covery things outside. */
 				goto base_mapping;
 


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
@ 2023-06-15  2:23       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-15  2:23 UTC (permalink / raw)
  To: Sachin Sant
  Cc: Joao Martins, Catalin Marinas, Muchun Song, linuxppc-dev,
	linux-mm, Nicholas Piggin, Andrew Morton, Mike Kravetz,
	Will Deacon, Dan Williams, Oscar Salvador

Sachin Sant <sachinp@linux.ibm.com> writes:

>> 1. First try to map things using PMD (2M)
>> 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE
>> 3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE
>> 
>> On removing vmemmap mapping, check if every subsection that is using the vmemmap
>> area is invalid. If found to be invalid, that implies we can safely free the
>> vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
>> page size, we need to do the above check even at the PAGE_SIZE granularity.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>
> With this patch series applied I see the following warning
>
> [  OK  ] Started Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
> [    3.283884] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: nvdimm pmu didn't register rc=-2
> [    3.284212] papr_scm ibm,persistent-memory:ibm,pmemory@44104002: nvdimm pmu didn't register rc=-2
> [    3.563890] radix-mmu: Mapped 0x0000040010000000-0x0000040c90000000 with 64.0 KiB pages
> [    3.703227] ------------[ cut here ]------------
> [    3.703236] failed to free all reserved pages
> [    3.703244] WARNING: CPU: 41 PID: 923 at mm/memremap.c:152 memunmap_pages+0x37c/0x3a0
> [    3.703252] Modules linked in: device_dax(+) nd_pmem nd_btt dax_pmem papr_scm libnvdimm pseries_rng vmx_crypto aes_gcm_p10_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse
> [    3.703272] CPU: 41 PID: 923 Comm: systemd-udevd Not tainted 6.4.0-rc6-00037-gb6dad5178cea-dirty #1
> [    3.703276] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries
> [    3.703280] NIP:  c00000000057a18c LR: c00000000057a188 CTR: 00000000005ca81c
> [    3.703283] REGS: c000000032a170d0 TRAP: 0700   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
> [    3.703286] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48248824  XER: 00000002
> [    3.703296] CFAR: c00000000015f0c0 IRQMASK: 0  [    3.703296] GPR00: c00000000057a188 c000000032a17370 c000000001421500 0000000000000021  [    3.703296] GPR04: 00000000ffff7fff c000000032a17140 c000000032a17138 0000000000000027  [    3.703296] GPR08: c0000015c91a7c10 0000000000000001 0000000000000027 c000000002a18a20  [    3.703296] GPR12: 0000000048248824 c0000015cb9f4300 c000000032a17d68 c000000001262b20  [    3.703296] GPR16: c008000001310000 000000000000ff20 000000000000fff2 c0080000012d7418  [    3.703296] GPR20: c000000032a17c30 0000000000000004 ffffffffffffc005 0000000001000200  [    3.703296] GPR24: c000000002f11570 c00000000e376870 0000000000000001 0000000000000001  [    3.703296] GPR28: c00000000e376840 c00000000e3768c8 0000000000000000 c00000000e376840  [    3.703333] NIP [c00000000057a18c] memunmap_pages+0x37c/0x3a0
> [    3.703338] LR [c00000000057a188] memunmap_pages+0x378/0x3a0
> [    3.703342] Call Trace:
> [    3.703344] [c000000032a17370] [c00000000057a188] memunmap_pages+0x378/0x3a0 (unreliable)
> [    3.703349] [c000000032a17420] [c00000000057a928] memremap_pages+0x4a8/0x890
> [    3.703355] [c000000032a17500] [c00000000057ad4c] devm_memremap_pages+0x3c/0xd0
> [    3.703359] [c000000032a17540] [c0080000011c084c] dev_dax_probe+0x134/0x3a0 [device_dax]
> [    3.703366] [c000000032a175e0] [c0000000009f7e8c] dax_bus_probe+0xac/0x140
> [    3.703371] [c000000032a17610] [c0000000009b5828] really_probe+0x108/0x530
> [    3.703375] [c000000032a176a0] [c0000000009b5d04] __driver_probe_device+0xb4/0x200
> [    3.703379] [c000000032a17720] [c0000000009b5ea8] driver_probe_device+0x58/0x120
> [    3.703383] [c000000032a17760] [c0000000009b6298] __driver_attach+0x148/0x250
> [    3.703387] [c000000032a177e0] [c0000000009b1a58] bus_for_each_dev+0xa8/0x130
> [    3.703392] [c000000032a17840] [c0000000009b4b34] driver_attach+0x34/0x50
> [    3.703396] [c000000032a17860] [c0000000009b3b98] bus_add_driver+0x258/0x300
> [    3.703400] [c000000032a178f0] [c0000000009b78d4] driver_register+0xa4/0x1b0
> [    3.703404] [c000000032a17960] [c0000000009f9530] __dax_driver_register+0x50/0x70
> [    3.703409] [c000000032a17980] [c0080000011c1374] dax_init+0x3c/0x58 [device_dax]
> [    3.703414] [c000000032a179a0] [c000000000013260] do_one_initcall+0x60/0x2f0
> [    3.703418] [c000000032a17a70] [c000000000248af8] do_init_module+0x78/0x310
> [    3.703424] [c000000032a17af0] [c00000000024bcac] load_module+0x2a7c/0x2f30
> [    3.703429] [c000000032a17d00] [c00000000024c4f0] __do_sys_finit_module+0xe0/0x180
> [    3.703434] [c000000032a17e10] [c0000000000374c0] system_call_exception+0x140/0x350
> [    3.703439] [c000000032a17e50] [c00000000000d6a0] system_call_common+0x160/0x2e4
> [    3.703444] --- interrupt: c00 at 0x7fff9af2fb34
> [    3.703447] NIP:  00007fff9af2fb34 LR: 00007fff9b6dea9c CTR: 0000000000000000
> [    3.703450] REGS: c000000032a17e80 TRAP: 0c00   Not tainted  (6.4.0-rc6-00037-gb6dad5178cea-dirty)
> [    3.703453] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28222204  XER: 00000000
> [    3.703462] IRQMASK: 0  [    3.703462] GPR00: 0000000000000161 00007fffed351350 00007fff9b007300 000000000000000f  [    3.703462] GPR04: 00007fff9b6ead30 0000000000000000 000000000000000f 0000000000000000  [    3.703462] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000  [    3.703462] GPR12: 0000000000000000 00007fff9b7c6610 0000000000020000 000000011057db18  [    3.703462] GPR16: 00000001105c0108 0000000110585f48 0000000000000000 0000000000000000  [    3.703462] GPR20: 0000000000000000 0000000110585f80 0000000147985200 00007fffed351570  [    3.703462] GPR24: 00000001105c0128 0000000000020000 0000000000000000 0000000147981010  [    3.703462] GPR28: 00007fff9b6ead30 0000000000020000 0000000000000000 0000000147985200  [    3.703497] NIP [00007fff9af2fb34] 0x7fff9af2fb34
> [    3.703499] LR [00007fff9b6dea9c] 0x7fff9b6dea9c
> [    3.703502] --- interrupt: c00
> [    3.703504] Code: 60000000 3d220170 8929b2b7 2f890000 409eff28 3c62ffe7 39200001 3d420170 3863c518 992ab2b7 4bbe4e55 60000000 <0fe00000> fac10060 fae10068 fb010070  [    3.703516] ---[ end trace 0000000000000000 ]---
> [    3.703520] device_dax: probe of dax0.0 failed with error -12
> [  OK  ] Created slice system-daxdev\x2dreconfigure.slice.
> [  OK  ] Started udev Wait for Complete Device Initialization.
> [  OK  ] Reached target Local File Systems (Pre).
> [  OK  ] Reached target Local File Systems.
>

The below change fixed the warning on the test machine you shared.

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 1c49af91fd9c..d884c1b39128 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -994,6 +994,7 @@ void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node,
 	pte_t entry;
 	pte_t *ptep = pmdp_ptep(pmdp);
 
+	VM_BUG_ON(!IS_ALIGNED((unsigned long)(addr), PMD_SIZE));
 	entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 	set_pte_at(&init_mm, addr, ptep, entry);
 	asm volatile("ptesync": : :"memory");
@@ -1012,6 +1013,10 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long a
 		void *p;
 
 		if (!reuse) {
+
+			if (altmap && altmap_cross_boundary(altmap, addr, PAGE_SIZE))
+				altmap = NULL;
+
 			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
 			if (!p)
 				return NULL;
@@ -1028,6 +1033,8 @@ static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long a
 			get_page(reuse);
 			p = page_to_virt(reuse);
 		}
+
+		VM_BUG_ON(!PAGE_ALIGNED(addr));
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 		asm volatile("ptesync": : :"memory");
@@ -1108,10 +1115,14 @@ int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, in
 		pmd = vmemmap_pmd_alloc(pud, node, addr);
 		if (!pmd)
 			return -ENOMEM;
+
 		if (pmd_none(READ_ONCE(*pmd))) {
 			void *p;
 
-			if (altmap && altmap_cross_boundary(altmap, start, PMD_SIZE)) {
+			if (!IS_ALIGNED(addr, PMD_SIZE))
+				goto base_mapping;
+
+			if (altmap && altmap_cross_boundary(altmap, addr, PMD_SIZE)) {
 				/* make sure we don't create altmap mappings covery things outside. */
 				goto base_mapping;
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  2023-06-06  4:55 ` [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix Aneesh Kumar K.V
@ 2023-06-21  4:08     ` Michael Ellerman
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Ellerman @ 2023-06-21  4:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, npiggin, christophe.leroy
  Cc: linuxppc-dev, Oscar Salvador, Mike Kravetz, Dan Williams,
	Joao Martins, Catalin Marinas, Muchun Song, Will Deacon,
	Aneesh Kumar K.V

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
> on radix while mapping the vmemmap area.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
>  1 file changed, 2 deletions(-)

This breaks microwatt_defconfig, which does not enable CONFIG_PPC_64S_HASH_MMU:

  ../arch/powerpc/mm/book3s64/radix_pgtable.c: In function ‘radix__early_init_mmu’:
  ../arch/powerpc/mm/book3s64/radix_pgtable.c:601:27: error: lvalue required as left operand of assignment
    601 |         mmu_virtual_psize = MMU_PAGE_4K;
        |                           ^
  make[5]: *** [../scripts/Makefile.build:252: arch/powerpc/mm/book3s64/radix_pgtable.o] Error 1
  make[4]: *** [../scripts/Makefile.build:494: arch/powerpc/mm/book3s64] Error 2
  make[3]: *** [../scripts/Makefile.build:494: arch/powerpc/mm] Error 2
  make[2]: *** [../scripts/Makefile.build:494: arch/powerpc] Error 2
  make[2]: *** Waiting for unfinished jobs....
  make[1]: *** [/home/michael/linux/Makefile:2026: .] Error 2
  make: *** [Makefile:226: __sub-make] Error 2

Because mmu_virtual_psize is defined in hash_utils.c, which isn't built.

cheers


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
@ 2023-06-21  4:08     ` Michael Ellerman
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Ellerman @ 2023-06-21  4:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Catalin Marinas, Muchun Song, Aneesh Kumar K.V,
	Dan Williams, Oscar Salvador, linuxppc-dev, Joao Martins,
	Mike Kravetz

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
> on radix while mapping the vmemmap area.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
>  1 file changed, 2 deletions(-)

This breaks microwatt_defconfig, which does not enable CONFIG_PPC_64S_HASH_MMU:

  ../arch/powerpc/mm/book3s64/radix_pgtable.c: In function ‘radix__early_init_mmu’:
  ../arch/powerpc/mm/book3s64/radix_pgtable.c:601:27: error: lvalue required as left operand of assignment
    601 |         mmu_virtual_psize = MMU_PAGE_4K;
        |                           ^
  make[5]: *** [../scripts/Makefile.build:252: arch/powerpc/mm/book3s64/radix_pgtable.o] Error 1
  make[4]: *** [../scripts/Makefile.build:494: arch/powerpc/mm/book3s64] Error 2
  make[3]: *** [../scripts/Makefile.build:494: arch/powerpc/mm] Error 2
  make[2]: *** [../scripts/Makefile.build:494: arch/powerpc] Error 2
  make[2]: *** Waiting for unfinished jobs....
  make[1]: *** [/home/michael/linux/Makefile:2026: .] Error 2
  make: *** [Makefile:226: __sub-make] Error 2

Because mmu_virtual_psize is defined in hash_utils.c, which isn't built.

cheers

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  2023-06-21  4:08     ` Michael Ellerman
@ 2023-06-21  5:59       ` Aneesh Kumar K V
  -1 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K V @ 2023-06-21  5:59 UTC (permalink / raw)
  To: Michael Ellerman, linux-mm, akpm, npiggin, christophe.leroy
  Cc: linuxppc-dev, Oscar Salvador, Mike Kravetz, Dan Williams,
	Joao Martins, Catalin Marinas, Muchun Song, Will Deacon

On 6/21/23 9:38 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
>> on radix while mapping the vmemmap area.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
>>  1 file changed, 2 deletions(-)
> 
> This breaks microwatt_defconfig, which does not enable CONFIG_PPC_64S_HASH_MMU:
> 
>   ../arch/powerpc/mm/book3s64/radix_pgtable.c: In function ‘radix__early_init_mmu’:
>   ../arch/powerpc/mm/book3s64/radix_pgtable.c:601:27: error: lvalue required as left operand of assignment
>     601 |         mmu_virtual_psize = MMU_PAGE_4K;
>         |                           ^
>   make[5]: *** [../scripts/Makefile.build:252: arch/powerpc/mm/book3s64/radix_pgtable.o] Error 1
>   make[4]: *** [../scripts/Makefile.build:494: arch/powerpc/mm/book3s64] Error 2
>   make[3]: *** [../scripts/Makefile.build:494: arch/powerpc/mm] Error 2
>   make[2]: *** [../scripts/Makefile.build:494: arch/powerpc] Error 2
>   make[2]: *** Waiting for unfinished jobs....
>   make[1]: *** [/home/michael/linux/Makefile:2026: .] Error 2
>   make: *** [Makefile:226: __sub-make] Error 2
> 
> Because mmu_virtual_psize is defined in hash_utils.c, which isn't built.
> 

Ok i missed the mmu_virtual_psize dependency there. Will add microwatt_defconfig to build configs. 


modified   arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,12 +594,14 @@ void __init radix__early_init_mmu(void)
 {
 	unsigned long lpcr;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
 	/* PAGE_SIZE mappings */
 	mmu_virtual_psize = MMU_PAGE_64K;
 #else
 	mmu_virtual_psize = MMU_PAGE_4K;
 #endif
+#endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 	/* vmemmap mapping */



> cheers



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
@ 2023-06-21  5:59       ` Aneesh Kumar K V
  0 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K V @ 2023-06-21  5:59 UTC (permalink / raw)
  To: Michael Ellerman, linux-mm, akpm, npiggin, christophe.leroy
  Cc: Will Deacon, Catalin Marinas, Muchun Song, Dan Williams,
	Oscar Salvador, linuxppc-dev, Joao Martins, Mike Kravetz

On 6/21/23 9:38 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
>> on radix while mapping the vmemmap area.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
>>  1 file changed, 2 deletions(-)
> 
> This breaks microwatt_defconfig, which does not enable CONFIG_PPC_64S_HASH_MMU:
> 
>   ../arch/powerpc/mm/book3s64/radix_pgtable.c: In function ‘radix__early_init_mmu’:
>   ../arch/powerpc/mm/book3s64/radix_pgtable.c:601:27: error: lvalue required as left operand of assignment
>     601 |         mmu_virtual_psize = MMU_PAGE_4K;
>         |                           ^
>   make[5]: *** [../scripts/Makefile.build:252: arch/powerpc/mm/book3s64/radix_pgtable.o] Error 1
>   make[4]: *** [../scripts/Makefile.build:494: arch/powerpc/mm/book3s64] Error 2
>   make[3]: *** [../scripts/Makefile.build:494: arch/powerpc/mm] Error 2
>   make[2]: *** [../scripts/Makefile.build:494: arch/powerpc] Error 2
>   make[2]: *** Waiting for unfinished jobs....
>   make[1]: *** [/home/michael/linux/Makefile:2026: .] Error 2
>   make: *** [Makefile:226: __sub-make] Error 2
> 
> Because mmu_virtual_psize is defined in hash_utils.c, which isn't built.
> 

Ok i missed the mmu_virtual_psize dependency there. Will add microwatt_defconfig to build configs. 


modified   arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,12 +594,14 @@ void __init radix__early_init_mmu(void)
 {
 	unsigned long lpcr;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
 	/* PAGE_SIZE mappings */
 	mmu_virtual_psize = MMU_PAGE_64K;
 #else
 	mmu_virtual_psize = MMU_PAGE_4K;
 #endif
+#endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 	/* vmemmap mapping */



> cheers


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-06-21  6:01 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-06  4:55 [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
2023-06-06  4:55 ` [PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting Aneesh Kumar K.V
2023-06-06  4:55 ` [PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix Aneesh Kumar K.V
2023-06-21  4:08   ` Michael Ellerman
2023-06-21  4:08     ` Michael Ellerman
2023-06-21  5:59     ` Aneesh Kumar K V
2023-06-21  5:59       ` Aneesh Kumar K V
2023-06-06  4:55 ` [PATCH 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo Aneesh Kumar K.V
2023-06-06  4:55 ` [PATCH 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding Aneesh Kumar K.V
2023-06-06  4:55 ` [PATCH 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary Aneesh Kumar K.V
2023-06-06  4:55 ` [PATCH 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support Aneesh Kumar K.V
2023-06-06  4:55 ` [PATCH 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function Aneesh Kumar K.V
2023-06-14 10:50   ` Sachin Sant
2023-06-14 10:50     ` Sachin Sant
2023-06-15  2:23     ` Aneesh Kumar K.V
2023-06-15  2:23       ` Aneesh Kumar K.V
2023-06-06  4:56 ` [PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix Aneesh Kumar K.V
2023-06-07 23:54   ` kernel test robot
2023-06-07 23:54     ` kernel test robot
2023-06-06  4:56 ` [PATCH 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize Aneesh Kumar K.V
2023-06-14  4:11 ` [PATCH 00/16] Add support for DAX vmemmap optimization for ppc64 Aneesh Kumar K.V
2023-06-14  4:11   ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.