linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
@ 2024-03-25 14:55 Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate() Christophe Leroy
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

This series reimplements hugepages with hugepd on powerpc 8xx.

Unlike most architectures, powerpc 8xx HW requires a two-level
pagetable topology for all page sizes. So a leaf PMD-contig approach
is not feasible as such.

Possible sizes are 4k, 16k, 512k and 8M.

First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
must point to a single entry level-2 page table. Until now that was
done using hugepd. This series changes it to use standard page tables
where the entry is replicated 1024 times on each of the two pagetables
refered by the two associated PMD entries for that 8M page.

At the moment it has to look into each helper to know if the
hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
a lower size. I hope this can me handled by core-mm in the future.

There are probably several ways to implement stuff, so feedback is
very welcome.

Christophe Leroy (8):
  mm: Provide pagesize to pmd_populate()
  mm: Provide page size to pte_alloc_huge()
  mm: Provide pmd to pte_leaf_size()
  mm: Provide mm_struct and address to huge_ptep_get()
  powerpc/mm: Allow hugepages without hugepd
  powerpc/8xx: Fix size given to set_huge_pte_at()
  powerpc/8xx: Remove support for 8M pages
  powerpc/8xx: Add back support for 8M pages using contiguous PTE
    entries

 arch/arm64/include/asm/hugetlb.h              |  2 +-
 arch/arm64/include/asm/pgtable.h              |  2 +-
 arch/arm64/mm/hugetlbpage.c                   |  2 +-
 arch/parisc/mm/hugetlbpage.c                  |  2 +-
 arch/powerpc/Kconfig                          |  1 -
 arch/powerpc/include/asm/hugetlb.h            | 13 +++-
 .../include/asm/nohash/32/hugetlb-8xx.h       | 54 ++++++++---------
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  2 +
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 59 +++++++++++++------
 arch/powerpc/include/asm/nohash/pgtable.h     | 12 ++--
 arch/powerpc/include/asm/page.h               |  5 --
 arch/powerpc/include/asm/pgtable.h            |  1 +
 arch/powerpc/kernel/head_8xx.S                | 10 +---
 arch/powerpc/mm/hugetlbpage.c                 | 23 +++++++-
 arch/powerpc/mm/nohash/8xx.c                  | 46 +++++++--------
 arch/powerpc/mm/pgtable.c                     | 26 +++++---
 arch/powerpc/mm/pgtable_32.c                  |  2 +-
 arch/powerpc/platforms/Kconfig.cputype        |  2 +
 arch/riscv/include/asm/pgtable.h              |  2 +-
 arch/riscv/mm/hugetlbpage.c                   |  2 +-
 arch/sh/mm/hugetlbpage.c                      |  2 +-
 arch/sparc/include/asm/pgtable_64.h           |  2 +-
 arch/sparc/mm/hugetlbpage.c                   |  4 +-
 fs/hugetlbfs/inode.c                          |  2 +-
 fs/proc/task_mmu.c                            |  8 +--
 fs/userfaultfd.c                              |  2 +-
 include/asm-generic/hugetlb.h                 |  2 +-
 include/linux/hugetlb.h                       |  4 +-
 include/linux/mm.h                            | 12 ++--
 include/linux/pgtable.h                       |  2 +-
 include/linux/swapops.h                       |  2 +-
 kernel/events/core.c                          |  2 +-
 mm/damon/vaddr.c                              |  6 +-
 mm/filemap.c                                  |  2 +-
 mm/gup.c                                      |  2 +-
 mm/hmm.c                                      |  2 +-
 mm/hugetlb.c                                  | 46 +++++++--------
 mm/internal.h                                 |  2 +-
 mm/memory-failure.c                           |  2 +-
 mm/memory.c                                   | 19 +++---
 mm/mempolicy.c                                |  2 +-
 mm/migrate.c                                  |  4 +-
 mm/mincore.c                                  |  2 +-
 mm/pgalloc-track.h                            |  2 +-
 mm/userfaultfd.c                              |  6 +-
 45 files changed, 229 insertions(+), 180 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
@ 2024-03-25 14:55 ` Christophe Leroy
  2024-03-25 16:19   ` Jason Gunthorpe
  2024-03-25 14:55 ` [RFC PATCH 2/8] mm: Provide page size to pte_alloc_huge() Christophe Leroy
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

Unlike many architectures, powerpc 8xx hardware tablewalk requires
a two level process for all page sizes, allthough second level only
has one entry when pagesize is 8M.

To fit with Linux page table topology and without requiring special
page directory layout like hugepd, the page entry will be replicated
1024 times in the standard page table. However for large pages it is
necessary to set bits in the level-1 (PMD) entry. At the time being,
for 512k pages the flag is kept in the PTE and inserted in the PMD
entry at TLB miss exception, that is necessary because we can have
pages of different sizes in a page table. However the 12 PTE bits are
fully used and there is no room for an additional bit for page size.

For 8M pages, there will be only one page per PMD entry, it is
therefore possible to flag the pagesize in the PMD entry, with the
advantage that the information will already be at the right place for
the hardware.

To do so, add a new helper called pmd_populate_size() which takes the
page size as an additional argument, and modify __pte_alloc() to also
take that argument. pte_alloc() is left unmodified in order to
reduce churn on callers, and a pte_alloc_size() is added for use by
pte_alloc_huge().

When an architecture doesn't provide pmd_populate_size(),
pmd_populate() is used as a fallback.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 include/linux/mm.h | 12 +++++++-----
 mm/filemap.c       |  2 +-
 mm/internal.h      |  2 +-
 mm/memory.c        | 19 ++++++++++++-------
 mm/pgalloc-track.h |  2 +-
 mm/userfaultfd.c   |  4 ++--
 6 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2c0910bc3e4a..6c5c15955d4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2801,8 +2801,8 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
 static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz);
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz);
 
 #if defined(CONFIG_MMU)
 
@@ -2987,7 +2987,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+#define pte_alloc_size(mm, pmd, sz) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, sz))
+#define pte_alloc(mm, pmd) pte_alloc_size(mm, pmd, PAGE_SIZE)
 
 #define pte_alloc_map(mm, pmd, address)			\
 	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
@@ -2996,9 +2997,10 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	(pte_alloc(mm, pmd) ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
-#define pte_alloc_kernel(pmd, address)			\
-	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+#define pte_alloc_kernel_size(pmd, address, sz)			\
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, sz))? \
 		NULL: pte_offset_kernel(pmd, address))
+#define pte_alloc_kernel(pmd, address)	pte_alloc_kernel_size(pmd, address, PAGE_SIZE)
 
 #if USE_SPLIT_PMD_PTLOCKS
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7437b2bd75c1..b013000ea84f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3428,7 +3428,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
 	}
 
 	if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
-		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
 
 	return false;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 7e486f2c502c..b81c3ca59f45 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -206,7 +206,7 @@ void folio_activate(struct folio *folio);
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *start_vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz);
 
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..c846bb75746b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -409,7 +409,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 	} while (vma);
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+#ifndef pmd_populate_size
+#define pmd_populate_size(mm, pmdp, pte, sz) pmd_populate(mm, pmdp, pte)
+#define pmd_populate_kernel_size(mm, pmdp, pte, sz) pmd_populate_kernel(mm, pmdp, pte)
+#endif
+
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz)
 {
 	spinlock_t *ptl = pmd_lock(mm, pmd);
 
@@ -429,25 +434,25 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 		 * smp_rmb() barriers in page table walking code.
 		 */
 		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
-		pmd_populate(mm, pmd, *pte);
+		pmd_populate_size(mm, pmd, *pte, sz);
 		*pte = NULL;
 	}
 	spin_unlock(ptl);
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz)
 {
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
-	pmd_install(mm, pmd, &new);
+	pmd_install(mm, pmd, &new, sz);
 	if (new)
 		pte_free(mm, new);
 	return 0;
 }
 
-int __pte_alloc_kernel(pmd_t *pmd)
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz)
 {
 	pte_t *new = pte_alloc_one_kernel(&init_mm);
 	if (!new)
@@ -456,7 +461,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		smp_wmb(); /* See comment in pmd_install() */
-		pmd_populate_kernel(&init_mm, pmd, new);
+		pmd_populate_kernel_size(&init_mm, pmd, new, sz);
 		new = NULL;
 	}
 	spin_unlock(&init_mm.page_table_lock);
@@ -4738,7 +4743,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		}
 
 		if (vmf->prealloc_pte)
-			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
 		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
 			return VM_FAULT_OOM;
 	}
diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
index e9e879de8649..90e37de7ab77 100644
--- a/mm/pgalloc-track.h
+++ b/mm/pgalloc-track.h
@@ -45,7 +45,7 @@ static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
 
 #define pte_alloc_kernel_track(pmd, address, mask)			\
 	((unlikely(pmd_none(*(pmd))) &&					\
-	  (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
+	  (__pte_alloc_kernel(pmd, PAGE_SIZE) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
 		NULL: pte_offset_kernel(pmd, address))
 
 #endif /* _LINUX_PGALLOC_TRACK_H */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 712160cd41ec..9baf507ce193 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -764,7 +764,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 			break;
 		}
 		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
+		    unlikely(__pte_alloc(dst_mm, dst_pmd, PAGE_SIZE))) {
 			err = -ENOMEM;
 			break;
 		}
@@ -1686,7 +1686,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 					err = -ENOENT;
 					break;
 				}
-				if (unlikely(__pte_alloc(mm, src_pmd))) {
+				if (unlikely(__pte_alloc(mm, src_pmd, PAGE_SIZE))) {
 					err = -ENOMEM;
 					break;
 				}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/8] mm: Provide page size to pte_alloc_huge()
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate() Christophe Leroy
@ 2024-03-25 14:55 ` Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 3/8] mm: Provide pmd to pte_leaf_size() Christophe Leroy
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

In order to be able to flag the PMD entry with _PMD_HUGE_8M on
powerpc 8xx, provide page size to pte_alloc_huge() and use it
through the newly introduced pte_alloc_size().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/arm64/mm/hugetlbpage.c   | 2 +-
 arch/parisc/mm/hugetlbpage.c  | 2 +-
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 arch/riscv/mm/hugetlbpage.c   | 2 +-
 arch/sh/mm/hugetlbpage.c      | 2 +-
 arch/sparc/mm/hugetlbpage.c   | 2 +-
 include/linux/hugetlb.h       | 4 ++--
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 0f0e10bb0a95..71161c655fd6 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -289,7 +289,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			return NULL;
 
 		WARN_ON(addr & (sz - 1));
-		ptep = pte_alloc_huge(mm, pmdp, addr);
+		ptep = pte_alloc_huge(mm, pmdp, addr, sz);
 	} else if (sz == PMD_SIZE) {
 		if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
 			ptep = huge_pmd_share(mm, vma, addr, pudp);
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index a9f7e21f6656..2f4c6b440710 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (pud) {
 		pmd = pmd_alloc(mm, pud, addr);
 		if (pmd)
-			pte = pte_alloc_huge(mm, pmd, addr);
+			pte = pte_alloc_huge(mm, pmd, addr, sz);
 	}
 	return pte;
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 594a4b7b2ca2..66ac56b26007 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		return NULL;
 
 	if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
+		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
 
 	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 5ef2a6891158..dc77a58c6321 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
 	for_each_napot_order(order) {
 		if (napot_cont_size(order) == sz) {
-			pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
+			pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order), sz);
 			break;
 		}
 	}
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 6cb0ad73dbb9..26579429e5ed 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (pud) {
 				pmd = pmd_alloc(mm, pud, addr);
 				if (pmd)
-					pte = pte_alloc_huge(mm, pmd, addr);
+					pte = pte_alloc_huge(mm, pmd, addr, sz);
 			}
 		}
 	}
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index b432500c13a5..5a342199e837 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		return NULL;
 	if (sz >= PMD_SIZE)
 		return (pte_t *)pmd;
-	return pte_alloc_huge(mm, pmd, addr);
+	return pte_alloc_huge(mm, pmd, addr, sz);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 77b30a8c6076..d9c5d9daadc5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -193,9 +193,9 @@ static inline pte_t *pte_offset_huge(pmd_t *pmd, unsigned long address)
 	return pte_offset_kernel(pmd, address);
 }
 static inline pte_t *pte_alloc_huge(struct mm_struct *mm, pmd_t *pmd,
-				    unsigned long address)
+				    unsigned long address, unsigned long sz)
 {
-	return pte_alloc(mm, pmd) ? NULL : pte_offset_huge(pmd, address);
+	return pte_alloc_size(mm, pmd, sz) ? NULL : pte_offset_huge(pmd, address);
 }
 #endif
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/8] mm: Provide pmd to pte_leaf_size()
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate() Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 2/8] mm: Provide page size to pte_alloc_huge() Christophe Leroy
@ 2024-03-25 14:55 ` Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry. So provide it to pte_leaf_size().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/arm64/include/asm/pgtable.h             | 2 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
 arch/riscv/include/asm/pgtable.h             | 2 +-
 arch/sparc/include/asm/pgtable_64.h          | 2 +-
 arch/sparc/mm/hugetlbpage.c                  | 2 +-
 include/linux/pgtable.h                      | 2 +-
 kernel/events/core.c                         | 2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..57c40f2498ab 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -624,7 +624,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 #define pmd_bad(pmd)		(!pmd_table(pmd))
 
 #define pmd_leaf_size(pmd)	(pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
-#define pte_leaf_size(pte)	(pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+#define pte_leaf_size(pmd, pte)	(pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 137dc3c84e45..07df6b664861 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd)
 
 #define pgd_leaf_size pgd_leaf_size
 
-static inline unsigned long pte_leaf_size(pte_t pte)
+static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
 	pte_basic_t val = pte_val(pte);
 
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 20242402fc11..45fa27810f25 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -439,7 +439,7 @@ static inline pte_t pte_mkhuge(pte_t pte)
 	return pte;
 }
 
-#define pte_leaf_size(pte)	(pte_napot(pte) ?				\
+#define pte_leaf_size(pmd, pte)	(pte_napot(pte) ?				\
 					napot_cont_size(napot_cont_order(pte)) :\
 					PAGE_SIZE)
 
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..67063af2ff8f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1175,7 +1175,7 @@ extern unsigned long pud_leaf_size(pud_t pud);
 extern unsigned long pmd_leaf_size(pmd_t pmd);
 
 #define pte_leaf_size pte_leaf_size
-extern unsigned long pte_leaf_size(pte_t pte);
+extern unsigned long pte_leaf_size(pmd_t pmd, pte_t pte);
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 5a342199e837..60c845a15bee 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -276,7 +276,7 @@ static unsigned long huge_tte_to_size(pte_t pte)
 
 unsigned long pud_leaf_size(pud_t pud) { return 1UL << tte_to_shift(*(pte_t *)&pud); }
 unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << tte_to_shift(*(pte_t *)&pmd); }
-unsigned long pte_leaf_size(pte_t pte) { return 1UL << tte_to_shift(pte); }
+unsigned long pte_leaf_size(pmd_t pmd, pte_t pte) { return 1UL << tte_to_shift(pte); }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..e605a4149fc7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1802,7 +1802,7 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf_size(x) PMD_SIZE
 #endif
 #ifndef pte_leaf_size
-#define pte_leaf_size(x) PAGE_SIZE
+#define pte_leaf_size(x, y) PAGE_SIZE
 #endif
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 724e6d7e128f..5c1c083222b2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7585,7 +7585,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 
 	pte = ptep_get_lockless(ptep);
 	if (pte_present(pte))
-		size = pte_leaf_size(pte);
+		size = pte_leaf_size(pmd, pte);
 	pte_unmap(ptep);
 #endif /* CONFIG_HAVE_FAST_GUP */
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get()
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
                   ` (2 preceding siblings ...)
  2024-03-25 14:55 ` [RFC PATCH 3/8] mm: Provide pmd to pte_leaf_size() Christophe Leroy
@ 2024-03-25 14:55 ` Christophe Leroy
  2024-03-25 16:35   ` Jason Gunthorpe
  2024-03-25 14:55 ` [RFC PATCH 5/8] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

On powerpc 8xx huge_ptep_get() will need to know whether the given
ptep is a PTE entry or a PMD entry. This cannot be known with the
PMD entry itself because there is no easy way to know it from the
content of the entry.

So huge_ptep_get() will need to know either the size of the page
or get the pmd.

In order to be consistent with huge_ptep_get_and_clear(), give
mm and address to huge_ptep_get().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/arm64/include/asm/hugetlb.h |  2 +-
 fs/hugetlbfs/inode.c             |  2 +-
 fs/proc/task_mmu.c               |  8 +++---
 fs/userfaultfd.c                 |  2 +-
 include/asm-generic/hugetlb.h    |  2 +-
 include/linux/swapops.h          |  2 +-
 mm/damon/vaddr.c                 |  6 ++---
 mm/gup.c                         |  2 +-
 mm/hmm.c                         |  2 +-
 mm/hugetlb.c                     | 46 ++++++++++++++++----------------
 mm/memory-failure.c              |  2 +-
 mm/mempolicy.c                   |  2 +-
 mm/migrate.c                     |  4 +--
 mm/mincore.c                     |  2 +-
 mm/userfaultfd.c                 |  2 +-
 15 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 2ddc33d93b13..1af39a74e791 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -46,7 +46,7 @@ extern pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, unsigned long sz);
 #define __HAVE_ARCH_HUGE_PTEP_GET
-extern pte_t huge_ptep_get(pte_t *ptep);
+extern pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 
 void __init arm64_hugetlb_cma_reserve(void);
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6502c7e776d1..ec3ec87d29e7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -425,7 +425,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
 	if (!ptep)
 		return false;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(vma->vm_mm, addr, ptep);
 	if (huge_pte_none(pte) || !pte_present(pte))
 		return false;
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 23fbab954c20..b14081bcdafe 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1572,7 +1572,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 	if (vma->vm_flags & VM_SOFTDIRTY)
 		flags |= PM_SOFT_DIRTY;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(walk->mm, addr, ptep);
 	if (pte_present(pte)) {
 		struct page *page = pte_page(pte);
 
@@ -2256,7 +2256,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
 	if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
 		/* Go the short route when not write-protecting pages. */
 
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(walk->mm, start, ptep);
 		categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
 
 		if (!pagemap_scan_is_interesting_page(categories, p))
@@ -2268,7 +2268,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(walk->mm, start, ptep);
 	categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
 
 	if (!pagemap_scan_is_interesting_page(categories, p))
@@ -2663,7 +2663,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
-	pte_t huge_pte = huge_ptep_get(pte);
+	pte_t huge_pte = huge_ptep_get(walk->mm, addr, pte);
 	struct numa_maps *md;
 	struct page *page;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 60dcfafdc11a..177fe1ff14d7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -256,7 +256,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	ret = false;
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
 
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 6dcf4d576970..594d5905f615 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -144,7 +144,7 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_HUGE_PTEP_GET
-static inline pte_t huge_ptep_get(pte_t *ptep)
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	return ptep_get(ptep);
 }
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 48b700ba1d18..b9a9f5003bf5 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -334,7 +334,7 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
-extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, pte_t *pte);
 #else  /* CONFIG_MIGRATION */
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 381559e4a1fa..58829baf8b5d 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -339,7 +339,7 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
 {
 	bool referenced = false;
-	pte_t entry = huge_ptep_get(pte);
+	pte_t entry = huge_ptep_get(mm, addr, pte);
 	struct folio *folio = pfn_folio(pte_pfn(entry));
 	unsigned long psize = huge_page_size(hstate_vma(vma));
 
@@ -373,7 +373,7 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 	if (!pte_present(entry))
 		goto out;
 
@@ -509,7 +509,7 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 	if (!pte_present(entry))
 		goto out;
 
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..ae3f27fc9cc5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2800,7 +2800,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	if (pte_end < end)
 		end = pte_end;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(NULL, addr, ptep);
 
 	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
 		return 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index 277ddcab4947..91a0b57fcb2e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -485,7 +485,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 
 	i = (start - range->start) >> PAGE_SHIFT;
 	pfn_req_flags = range->hmm_pfns[i];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 23ef240ba48a..cfe1cf2f30dc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5334,7 +5334,7 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
 {
 	pte_t entry;
 
-	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)));
+	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(vma->vm_mm, address, ptep)));
 	if (huge_ptep_set_access_flags(vma, address, ptep, entry, 1))
 		update_mmu_cache(vma, address, ptep);
 }
@@ -5442,7 +5442,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
 		src_ptl = huge_pte_lockptr(h, src, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-		entry = huge_ptep_get(src_pte);
+		entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte);
 again:
 		if (huge_pte_none(entry)) {
 			/*
@@ -5480,7 +5480,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				set_huge_pte_at(dst, addr, dst_pte,
 						make_pte_marker(marker), sz);
 		} else {
-			entry = huge_ptep_get(src_pte);
+			entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte);
 			pte_folio = page_folio(pte_page(entry));
 			folio_get(pte_folio);
 
@@ -5522,7 +5522,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
 				src_ptl = huge_pte_lockptr(h, src, src_pte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-				entry = huge_ptep_get(src_pte);
+				entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte);
 				if (!pte_same(src_pte_old, entry)) {
 					restore_reserve_on_error(h, dst_vma, addr,
 								new_folio);
@@ -5632,7 +5632,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 			new_addr |= last_addr_mask;
 			continue;
 		}
-		if (huge_pte_none(huge_ptep_get(src_pte)))
+		if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte)))
 			continue;
 
 		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
@@ -5705,7 +5705,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			continue;
 		}
 
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(mm, address, ptep);
 		if (huge_pte_none(pte)) {
 			spin_unlock(ptl);
 			continue;
@@ -5942,7 +5942,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		       struct vm_fault *vmf)
 {
 	const bool unshare = flags & FAULT_FLAG_UNSHARE;
-	pte_t pte = huge_ptep_get(ptep);
+	pte_t pte = huge_ptep_get(mm, address, ptep);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *old_folio;
 	struct folio *new_folio;
@@ -6055,7 +6055,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 			spin_lock(ptl);
 			ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
 			if (likely(ptep &&
-				   pte_same(huge_ptep_get(ptep), pte)))
+				   pte_same(huge_ptep_get(mm, haddr, ptep), pte)))
 				goto retry_avoidcopy;
 			/*
 			 * race occurs while re-acquiring page table
@@ -6093,7 +6093,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	spin_lock(ptl);
 	ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
+	if (likely(ptep && pte_same(huge_ptep_get(mm, haddr, ptep), pte))) {
 		pte_t newpte = make_huge_pte(vma, &new_folio->page, !unshare);
 
 		/* Break COW or unshare */
@@ -6193,14 +6193,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_fault *vmf,
  * Recheck pte with pgtable lock.  Returns true if pte didn't change, or
  * false if pte changed or is changing.
  */
-static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
+static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned long addr,
 			       pte_t *ptep, pte_t old_pte)
 {
 	spinlock_t *ptl;
 	bool same;
 
 	ptl = huge_pte_lock(h, mm, ptep);
-	same = pte_same(huge_ptep_get(ptep), old_pte);
+	same = pte_same(huge_ptep_get(mm, addr, ptep), old_pte);
 	spin_unlock(ptl);
 
 	return same;
@@ -6265,7 +6265,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * never happen on the page after UFFDIO_COPY has
 			 * correctly installed the page and returned.
 			 */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, mm, haddr, ptep, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -6288,7 +6288,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+			if (hugetlb_pte_stable(h, mm, haddr, ptep, old_pte))
 				ret = vmf_error(PTR_ERR(folio));
 			else
 				ret = 0;
@@ -6338,7 +6338,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			folio_unlock(folio);
 			folio_put(folio);
 			/* See comment in userfaultfd_missing() block above */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, mm, haddr, ptep, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -6365,7 +6365,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	ptl = huge_pte_lock(h, mm, ptep);
 	ret = 0;
 	/* If pte changed from under us, retry */
-	if (!pte_same(huge_ptep_get(ptep), old_pte))
+	if (!pte_same(huge_ptep_get(mm, address, ptep), old_pte))
 		goto backout;
 
 	if (anon_rmap)
@@ -6488,7 +6488,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	entry = huge_ptep_get(ptep);
+	entry = huge_ptep_get(mm, address, ptep);
 	if (huge_pte_none_mostly(entry)) {
 		if (is_pte_marker(entry)) {
 			pte_marker marker =
@@ -6529,7 +6529,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * be released there.
 			 */
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			migration_entry_wait_huge(vma, ptep);
+			migration_entry_wait_huge(vma, haddr, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			ret = VM_FAULT_HWPOISON_LARGE |
@@ -6562,11 +6562,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptl = huge_pte_lock(h, mm, ptep);
 
 	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	if (unlikely(!pte_same(entry, huge_ptep_get(mm, address, ptep))))
 		goto out_ptl;
 
 	/* Handle userfault-wp first, before trying to lock more pages */
-	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(mm, address, ptep)) &&
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
 		if (!userfaultfd_wp_async(vma)) {
 			spin_unlock(ptl);
@@ -6689,7 +6689,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 		ptl = huge_pte_lock(h, dst_mm, dst_pte);
 
 		/* Don't overwrite any existing PTEs (even markers) */
-		if (!huge_pte_none(huge_ptep_get(dst_pte))) {
+		if (!huge_pte_none(huge_ptep_get(mm, dst_addr, dst_pte))) {
 			spin_unlock(ptl);
 			return -EEXIST;
 		}
@@ -6826,7 +6826,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 	 * page backing it, then access the page.
 	 */
 	ret = -EEXIST;
-	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+	if (!huge_pte_none_mostly(huge_ptep_get(mm, dst_addr, dst_pte)))
 		goto out_release_unlock;
 
 	if (folio_in_pagecache)
@@ -6901,7 +6901,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		goto out_unlock;
 
 	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(mm, address, pte);
 	if (pte_present(entry)) {
 		page = pte_page(entry);
 
@@ -7018,7 +7018,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 			address |= last_addr_mask;
 			continue;
 		}
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(mm, address, ptep);
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			/* Nothing to do. */
 		} else if (unlikely(is_hugetlb_entry_migration(pte))) {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9349948f1abf..7c65340b1adb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -820,7 +820,7 @@ static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			    struct mm_walk *walk)
 {
 	struct hwpoison_walk *hwp = walk->private;
-	pte_t pte = huge_ptep_get(ptep);
+	pte_t pte = huge_ptep_get(walk->mm, addr, ptep);
 	struct hstate *h = hstate_vma(walk->vma);
 
 	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0fe77738d971..50a79700f496 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -624,7 +624,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 	if (!pte_present(entry)) {
 		if (unlikely(is_hugetlb_entry_migration(entry)))
 			qp->nr_failed++;
diff --git a/mm/migrate.c b/mm/migrate.c
index 73a052a382f1..87f7aedb8ee2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -338,14 +338,14 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
  *
  * This function will release the vma lock before returning.
  */
-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep)
+void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
 {
 	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, ptep);
 	pte_t pte;
 
 	hugetlb_vma_assert_locked(vma);
 	spin_lock(ptl);
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(vma->vm_mm, addr, ptep);
 
 	if (unlikely(!is_hugetlb_entry_migration(pte))) {
 		spin_unlock(ptl);
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..b5735a4aaa7d 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -33,7 +33,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none_mostly(huge_ptep_get(pte));
+	present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte));
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9baf507ce193..2b2ce329552e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -555,7 +555,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 		}
 
 		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) &&
-		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+		    !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) {
 			err = -EEXIST;
 			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 5/8] powerpc/mm: Allow hugepages without hugepd
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
                   ` (3 preceding siblings ...)
  2024-03-25 14:55 ` [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
@ 2024-03-25 14:55 ` Christophe Leroy
  2024-03-25 14:55 ` [RFC PATCH 6/8] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

In preparation of implementing huge pages on powerpc 8xx
without hugepd, enclose hugepd related code inside an
ifdef CONFIG_ARCH_HAS_HUGEPD

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/hugetlb.h        |  2 ++
 arch/powerpc/include/asm/nohash/pgtable.h |  8 +++++---
 arch/powerpc/mm/hugetlbpage.c             | 10 ++++++++++
 arch/powerpc/mm/pgtable.c                 |  2 ++
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index ea71f7245a63..a05657e5701b 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -30,10 +30,12 @@ static inline int is_hugepage_only_range(struct mm_struct *mm,
 }
 #define is_hugepage_only_range is_hugepage_only_range
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 #define __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE
 void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 			    unsigned long end, unsigned long floor,
 			    unsigned long ceiling);
+#endif
 
 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index 427db14292c9..ac3353f7f2ac 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -340,7 +340,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 #define pgprot_writecombine pgprot_noncached_wc
 
-#ifdef CONFIG_HUGETLB_PAGE
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 static inline int hugepd_ok(hugepd_t hpd)
 {
 #ifdef CONFIG_PPC_8xx
@@ -351,6 +351,10 @@ static inline int hugepd_ok(hugepd_t hpd)
 #endif
 }
 
+#define is_hugepd(hpd)		(hugepd_ok(hpd))
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
 static inline int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -360,8 +364,6 @@ static inline int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-#define is_hugepd(hpd)		(hugepd_ok(hpd))
 #endif
 
 int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot);
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 66ac56b26007..db73ad845a2a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -42,6 +42,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long s
 	return __find_linux_pte(mm->pgd, addr, NULL, NULL);
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int pdshift,
 			   unsigned int pshift, spinlock_t *ptl)
@@ -193,6 +194,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	return hugepte_offset(*hpdp, addr, pdshift);
 }
+#else
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, unsigned long sz)
+{
+	return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
+}
+#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
@@ -248,6 +256,7 @@ int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 	return __alloc_bootmem_huge_page(h, nid);
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 #ifndef CONFIG_PPC_BOOK3S_64
 #define HUGEPD_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
@@ -505,6 +514,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 		}
 	} while (addr = next, addr != end);
 }
+#endif
 
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9e7ba9c3851f..acdf64c9b93e 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -487,8 +487,10 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 	if (!hpdp)
 		return NULL;
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 	ret_pte = hugepte_offset(*hpdp, ea, pdshift);
 	pdshift = hugepd_shift(*hpdp);
+#endif
 out:
 	if (hpage_shift)
 		*hpage_shift = pdshift;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 6/8] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
                   ` (4 preceding siblings ...)
  2024-03-25 14:55 ` [RFC PATCH 5/8] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
@ 2024-03-25 14:55 ` Christophe Leroy
  2024-03-25 14:56 ` [RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages Christophe Leroy
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:55 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

set_huge_pte_at() expects the real page size, not the psize which is
the index of the page definition in table mmu_psize_defs[]

Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to set_huge_pte_at()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/nohash/8xx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 6be6421086ed..70b4d807fda5 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
 		return -EINVAL;
 
 	set_huge_pte_at(&init_mm, va, ptep,
-			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize);
+			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
+			1UL << mmu_psize_to_shift(psize));
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
                   ` (5 preceding siblings ...)
  2024-03-25 14:55 ` [RFC PATCH 6/8] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
@ 2024-03-25 14:56 ` Christophe Leroy
  2024-03-25 14:56 ` [RFC PATCH 8/8] powerpc/8xx: Add back support for 8M pages using contiguous PTE entries Christophe Leroy
  2024-03-25 16:38 ` [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Jason Gunthorpe
  8 siblings, 0 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:56 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

Remove support for 8M pages in order to stop using hugepd.

Support for 8M pages will be added back later using the same
approach as for 512k pages, in extenso using contiguous page
entries in the regular page table.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/Kconfig                          |  1 -
 .../include/asm/nohash/32/hugetlb-8xx.h       | 30 -------------------
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 14 +--------
 arch/powerpc/include/asm/nohash/pgtable.h     |  4 ---
 arch/powerpc/include/asm/page.h               |  5 ----
 arch/powerpc/kernel/head_8xx.S                |  9 +-----
 arch/powerpc/mm/hugetlbpage.c                 |  3 --
 arch/powerpc/mm/nohash/8xx.c                  | 28 ++---------------
 arch/powerpc/mm/nohash/tlb.c                  |  3 --
 arch/powerpc/platforms/Kconfig.cputype        |  2 ++
 10 files changed, 7 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a68b9e637eda..74c038cf770c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,6 @@ config PPC
 	select ARCH_HAS_DMA_MAP_DIRECT 		if PPC_PSERIES
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
-	select ARCH_HAS_HUGEPD			if HUGETLB_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_CALLBACKS
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 92df40c6cc6b..178ed9fdd353 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -4,42 +4,12 @@
 
 #define PAGE_SHIFT_8M		23
 
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-	BUG_ON(!hugepd_ok(hpd));
-
-	return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-	return PAGE_SHIFT_8M;
-}
-
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-				    unsigned int pdshift)
-{
-	unsigned long idx = (addr & (SZ_4M - 1)) >> PAGE_SHIFT;
-
-	return hugepd_page(hpd) + idx;
-}
-
 static inline void flush_hugetlb_page(struct vm_area_struct *vma,
 				      unsigned long vmaddr)
 {
 	flush_tlb_page(vma, vmaddr);
 }
 
-static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
-{
-	*hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
-}
-
-static inline void hugepd_populate_kernel(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
-{
-	*hpdp = __hugepd(__pa(new) | _PMD_PRESENT | _PMD_PAGE_8M);
-}
-
 static inline int check_and_get_huge_psize(int shift)
 {
 	return shift_to_mmu_psize(shift);
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 07df6b664861..004d7e825af2 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -142,15 +142,6 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *pt
 }
 #define __ptep_set_access_flags __ptep_set_access_flags
 
-static inline unsigned long pgd_leaf_size(pgd_t pgd)
-{
-	if (pgd_val(pgd) & _PMD_PAGE_8M)
-		return SZ_8M;
-	return SZ_4M;
-}
-
-#define pgd_leaf_size pgd_leaf_size
-
 static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
 	pte_basic_t val = pte_val(pte);
@@ -171,14 +162,11 @@ static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
  * For other page sizes, we have a single entry in the table.
  */
 static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
-static int hugepd_ok(hugepd_t hpd);
 
 static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
 {
 	if (!huge)
 		return PAGE_SIZE / SZ_4K;
-	else if (hugepd_ok(*((hugepd_t *)pmd)))
-		return 1;
 	else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE))
 		return SZ_16K / SZ_4K;
 	else
@@ -198,7 +186,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p
 
 	for (i = 0; i < num; i += PAGE_SIZE / SZ_4K, new += PAGE_SIZE) {
 		*entry++ = new;
-		if (IS_ENABLED(CONFIG_PPC_16K_PAGES) && num != 1) {
+		if (IS_ENABLED(CONFIG_PPC_16K_PAGES)) {
 			*entry++ = new;
 			*entry++ = new;
 			*entry++ = new;
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index ac3353f7f2ac..c4be7754e96f 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -343,12 +343,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static inline int hugepd_ok(hugepd_t hpd)
 {
-#ifdef CONFIG_PPC_8xx
-	return ((hpd_val(hpd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M);
-#else
 	/* We clear the top bit to indicate hugepd */
 	return (hpd_val(hpd) && (hpd_val(hpd) & PD_HUGE) == 0);
-#endif
 }
 
 #define is_hugepd(hpd)		(hugepd_ok(hpd))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index e411e5a70ea3..018c3d55232c 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -293,13 +293,8 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 /*
  * Some number of bits at the level of the page table that points to
  * a hugepte are used to encode the size.  This masks those bits.
- * On 8xx, HW assistance requires 4k alignment for the hugepte.
  */
-#ifdef CONFIG_PPC_8xx
-#define HUGEPD_SHIFT_MASK     0xfff
-#else
 #define HUGEPD_SHIFT_MASK     0x3f
-#endif
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 647b0b445e89..b53af565b132 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -416,13 +416,11 @@ FixupDAR:/* Entry point for dcbx workaround. */
 3:
 	lwz	r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11)	/* Get the level 1 entry */
 	mtspr	SPRN_MD_TWC, r11
-	mtcrf	0x01, r11
 	mfspr	r11, SPRN_MD_TWC
 	lwz	r11, 0(r11)	/* Get the pte */
-	bt	28,200f		/* bit 28 = Large page (8M) */
 	/* concat physical page address(r11) and page offset(r10) */
 	rlwimi	r11, r10, 0, 32 - PAGE_SHIFT, 31
-201:	lwz	r11,0(r11)
+	lwz	r11,0(r11)
 /* Check if it really is a dcbx instruction. */
 /* dcbt and dcbtst does not generate DTLB Misses/Errors,
  * no need to include them here */
@@ -441,11 +439,6 @@ FixupDAR:/* Entry point for dcbx workaround. */
 141:	mfspr	r10,SPRN_M_TW
 	b	DARFixed	/* Nope, go back to normal TLB processing */
 
-200:
-	/* concat physical page address(r11) and page offset(r10) */
-	rlwimi	r11, r10, 0, 32 - PAGE_SHIFT_8M, 31
-	b	201b
-
 144:	mfspr	r10, SPRN_DSISR
 	rlwinm	r10, r10,0,7,5	/* Clear store bit for buggy dcbst insn */
 	mtspr	SPRN_DSISR, r10
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index db73ad845a2a..4e9fbd5b895d 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,9 +183,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!hpdp)
 		return NULL;
 
-	if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
-
 	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
 	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 70b4d807fda5..fc10e08bcb85 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -48,42 +48,22 @@ unsigned long p_block_mapped(phys_addr_t pa)
 	return 0;
 }
 
-static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va)
-{
-	if (hpd_val(*pmdp) == 0) {
-		pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K);
-
-		if (!ptep)
-			return NULL;
-
-		hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M);
-		hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M);
-	}
-	return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
-}
-
 static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
 					     pgprot_t prot, int psize, bool new)
 {
 	pmd_t *pmdp = pmd_off_k(va);
 	pte_t *ptep;
 
-	if (WARN_ON(psize != MMU_PAGE_512K && psize != MMU_PAGE_8M))
+	if (WARN_ON(psize != MMU_PAGE_512K))
 		return -EINVAL;
 
 	if (new) {
 		if (WARN_ON(slab_is_available()))
 			return -EINVAL;
 
-		if (psize == MMU_PAGE_512K)
-			ptep = early_pte_alloc_kernel(pmdp, va);
-		else
-			ptep = early_hugepd_alloc_kernel((hugepd_t *)pmdp, va);
+		ptep = early_pte_alloc_kernel(pmdp, va);
 	} else {
-		if (psize == MMU_PAGE_512K)
-			ptep = pte_offset_kernel(pmdp, va);
-		else
-			ptep = hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
+		ptep = pte_offset_kernel(pmdp, va);
 	}
 
 	if (WARN_ON(!ptep))
@@ -130,8 +110,6 @@ static void mmu_mapin_ram_chunk(unsigned long offset, unsigned long top,
 
 	for (; p < ALIGN(p, SZ_8M) && p < top; p += SZ_512K, v += SZ_512K)
 		__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
-	for (; p < ALIGN_DOWN(top, SZ_8M) && p < top; p += SZ_8M, v += SZ_8M)
-		__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
 	for (; p < ALIGN_DOWN(top, SZ_512K) && p < top; p += SZ_512K, v += SZ_512K)
 		__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
 
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5ffa0af4328a..cb2afe39cee5 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -104,9 +104,6 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 	[MMU_PAGE_512K] = {
 		.shift	= 19,
 	},
-	[MMU_PAGE_8M] = {
-		.shift	= 23,
-	},
 };
 #endif
 
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index b2d8c0da2ad9..fa4bb096b3ae 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,6 +98,7 @@ config PPC_BOOK3S_64
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+	select ARCH_HAS_HUGEPD if HUGETLB_PAGE
 	select ARCH_SUPPORTS_HUGETLBFS
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select HAVE_MOVE_PMD
@@ -290,6 +291,7 @@ config PPC_BOOK3S
 config PPC_E500
 	select FSL_EMB_PERFMON
 	bool
+	select ARCH_HAS_HUGEPD if HUGETLB_PAGE
 	select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
 	select PPC_SMP_MUXED_IPI
 	select PPC_DOORBELL
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 8/8] powerpc/8xx: Add back support for 8M pages using contiguous PTE entries
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
                   ` (6 preceding siblings ...)
  2024-03-25 14:56 ` [RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages Christophe Leroy
@ 2024-03-25 14:56 ` Christophe Leroy
  2024-03-25 16:38 ` [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Jason Gunthorpe
  8 siblings, 0 replies; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 14:56 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu
  Cc: linux-mm, linuxppc-dev, linux-kernel

In order to fit better with standard Linux page tables layout, add
support for 8M pages using contiguous PTE entries in a standard
page table. Page tables will then be populated with 1024 similar
entries and two PMD entries will point to that page table.

The PMD entries also get a flag to tell it is addressing an 8M page,
this is required for the HW tablewalk assistance.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/hugetlb.h            | 11 ++++-
 .../include/asm/nohash/32/hugetlb-8xx.h       | 28 +++++++++++-
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  2 +
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 43 +++++++++++++++++--
 arch/powerpc/include/asm/pgtable.h            |  1 +
 arch/powerpc/kernel/head_8xx.S                |  1 +
 arch/powerpc/mm/hugetlbpage.c                 | 12 +++++-
 arch/powerpc/mm/nohash/8xx.c                  | 31 ++++++++++---
 arch/powerpc/mm/nohash/tlb.c                  |  3 ++
 arch/powerpc/mm/pgtable.c                     | 24 +++++++----
 arch/powerpc/mm/pgtable_32.c                  |  2 +-
 11 files changed, 134 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index a05657e5701b..bd60ea134f8e 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -41,7 +41,16 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
 {
-	return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
+	pmd_t *pmdp = (pmd_t *)ptep;
+	pte_t pte;
+
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+		pte = __pte(pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1));
+		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
+	} else {
+		pte = __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
+	}
+	return pte;
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 178ed9fdd353..1414cfd28987 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -15,6 +15,16 @@ static inline int check_and_get_huge_psize(int shift)
 	return shift_to_mmu_psize(shift);
 }
 
+#define __HAVE_ARCH_HUGE_PTEP_GET
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+	pmd_t *pmdp = (pmd_t *)ptep;
+
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))
+		ptep = pte_offset_kernel(pmdp, 0);
+	return ptep_get(ptep);
+}
+
 #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		     pte_t pte, unsigned long sz);
@@ -23,7 +33,14 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 				  pte_t *ptep, unsigned long sz)
 {
-	pte_update(mm, addr, ptep, ~0UL, 0, 1);
+	pmd_t *pmdp = (pmd_t *)ptep;
+
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1);
+		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
+	} else {
+		pte_update(mm, addr, ptep, ~0UL, 0, 1);
+	}
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
@@ -33,7 +50,14 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
 	unsigned long set = pte_val(pte_wrprotect(__pte(0)));
 
-	pte_update(mm, addr, ptep, clr, set, 1);
+	pmd_t *pmdp = (pmd_t *)ptep;
+
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), clr, set, 1);
+		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr, set, 1);
+	} else {
+		pte_update(mm, addr, ptep, clr, set, 1);
+	}
 }
 
 #ifdef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 11eac371e7e0..ff4f90cfb461 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -14,6 +14,7 @@
 #define __pmd_free_tlb(tlb,x,a)		do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)      BUG() */
 
+#ifndef CONFIG_PPC_8xx
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
 				       pte_t *pte)
 {
@@ -31,5 +32,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 	else
 		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT);
 }
+#endif
 
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 004d7e825af2..b05cc4f87713 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -129,14 +129,23 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 }
 #define ptep_set_wrprotect ptep_set_wrprotect
 
+static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
+static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address);
+
 static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
 					   pte_t entry, unsigned long address, int psize)
 {
 	unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_EXEC);
 	unsigned long clr = ~pte_val(entry) & _PAGE_RO;
 	int huge = psize > mmu_virtual_psize ? 1 : 0;
+	pmd_t *pmdp = (pmd_t *)ptep;
 
-	pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+	if (pmdp == pmd_off(vma->vm_mm, ALIGN_DOWN(address, SZ_8M))) {
+		pte_update(vma->vm_mm, address, pte_offset_kernel(pmdp, 0), clr, set, huge);
+		pte_update(vma->vm_mm, address, pte_offset_kernel(pmdp + 1, 0), clr, set, huge);
+	} else {
+		pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+	}
 
 	flush_tlb_page(vma, address);
 }
@@ -146,6 +155,8 @@ static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
 	pte_basic_t val = pte_val(pte);
 
+	if (pmd_val(pmd) & _PMD_PAGE_8M)
+		return SZ_8M;
 	if (val & _PAGE_HUGE)
 		return SZ_512K;
 	if (val & _PAGE_SPS)
@@ -159,14 +170,16 @@ static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
  * On the 8xx, the page tables are a bit special. For 16k pages, we have
  * 4 identical entries. For 512k pages, we have 128 entries as if it was
  * 4k pages, but they are flagged as 512k pages for the hardware.
- * For other page sizes, we have a single entry in the table.
+ * For 8M pages, we have 1024 entries as if it was
+ * 4M pages, but they are flagged as 8M pages for the hardware.
+ * For 4k pages, we have a single entry in the table.
  */
-static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
-
 static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
 {
 	if (!huge)
 		return PAGE_SIZE / SZ_4K;
+	else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M)
+		return SZ_4M / SZ_4K;
 	else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE))
 		return SZ_16K / SZ_4K;
 	else
@@ -209,6 +222,28 @@ static inline pte_t ptep_get(pte_t *ptep)
 }
 #endif /* CONFIG_PPC_16K_PAGES */
 
+static inline void pmd_populate_kernel_size(struct mm_struct *mm, pmd_t *pmdp,
+					    pte_t *pte, unsigned long sz)
+{
+	if (sz == SZ_8M)
+		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT | _PMD_PAGE_8M);
+	else
+		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT);
+}
+
+static inline void pmd_populate_size(struct mm_struct *mm, pmd_t *pmdp,
+				     pgtable_t pte_page, unsigned long sz)
+{
+	if (sz == SZ_8M)
+		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
+	else
+		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT);
+}
+#define pmd_populate_size pmd_populate_size
+
+#define pmd_populate(mm, pmdp, pte) pmd_populate_size(mm, pmdp, pte, PAGE_SIZE)
+#define pmd_populate_kernel(mm, pmdp, pte) pmd_populate_kernel_size(mm, pmdp, pte, PAGE_SIZE)
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 239709a2f68e..005dad336565 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -106,6 +106,7 @@ unsigned long vmalloc_to_phys(void *vmalloc_addr);
 
 void pgtable_cache_add(unsigned int shift);
 
+void __init *early_alloc_pgtable(unsigned long size);
 pte_t *early_pte_alloc_kernel(pmd_t *pmdp, unsigned long va);
 
 #if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_PPC32)
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index b53af565b132..43919ae0bd11 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -415,6 +415,7 @@ FixupDAR:/* Entry point for dcbx workaround. */
 	oris	r11, r11, (swapper_pg_dir - PAGE_OFFSET)@ha
 3:
 	lwz	r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11)	/* Get the level 1 entry */
+	rlwinm	r11, r11, 0, ~_PMD_PAGE_8M
 	mtspr	SPRN_MD_TWC, r11
 	mfspr	r11, SPRN_MD_TWC
 	lwz	r11, 0(r11)	/* Get the pte */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 4e9fbd5b895d..dd29845ce0ce 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -195,7 +195,17 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, unsigned long sz)
 {
-	return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
+	pmd_t *pmd = pmd_off(mm, addr);
+
+	if (sz == SZ_512M)
+		return pte_alloc_huge(mm, pmd, addr, sz);
+	if (sz != SZ_8M)
+		return NULL;
+	if (!pte_alloc_huge(mm, pmd, addr, sz))
+		return NULL;
+	if (!pte_alloc_huge(mm, pmd + 1, addr, sz))
+		return NULL;
+	return (pte_t *)pmd;
 }
 #endif
 
diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index fc10e08bcb85..b416bfc161d4 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -54,25 +54,40 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
 	pmd_t *pmdp = pmd_off_k(va);
 	pte_t *ptep;
 
-	if (WARN_ON(psize != MMU_PAGE_512K))
+	if (WARN_ON(psize != MMU_PAGE_512K && psize != MMU_PAGE_8M))
 		return -EINVAL;
 
 	if (new) {
 		if (WARN_ON(slab_is_available()))
 			return -EINVAL;
 
-		ptep = early_pte_alloc_kernel(pmdp, va);
+		if (psize == MMU_PAGE_8M) {
+			if (WARN_ON(!pmd_none(*pmdp) || !pmd_none(*(pmdp + 1))))
+				return -EINVAL;
+
+			ptep = early_alloc_pgtable(PTE_FRAG_SIZE);
+			pmd_populate_kernel_size(&init_mm, pmdp, ptep, SZ_8M);
+
+			ptep = early_alloc_pgtable(PTE_FRAG_SIZE);
+			pmd_populate_kernel_size(&init_mm, pmdp + 1, ptep, SZ_8M);
+
+			ptep = (pte_t *)pmdp;
+		} else {
+			ptep = early_pte_alloc_kernel(pmdp, va);
+			/* The PTE should never be already present */
+			if (WARN_ON(pte_present(*ptep) && pgprot_val(prot)))
+				return -EINVAL;
+		}
 	} else {
-		ptep = pte_offset_kernel(pmdp, va);
+		if (psize == MMU_PAGE_8M)
+			ptep = (pte_t *)pmdp;
+		else
+			ptep = pte_offset_kernel(pmdp, va);
 	}
 
 	if (WARN_ON(!ptep))
 		return -ENOMEM;
 
-	/* The PTE should never be already present */
-	if (new && WARN_ON(pte_present(*ptep) && pgprot_val(prot)))
-		return -EINVAL;
-
 	set_huge_pte_at(&init_mm, va, ptep,
 			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
 			1UL << mmu_psize_to_shift(psize));
@@ -110,6 +125,8 @@ static void mmu_mapin_ram_chunk(unsigned long offset, unsigned long top,
 
 	for (; p < ALIGN(p, SZ_8M) && p < top; p += SZ_512K, v += SZ_512K)
 		__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
+	for (; p < ALIGN_DOWN(top, SZ_8M) && p < top; p += SZ_8M, v += SZ_8M)
+		__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
 	for (; p < ALIGN_DOWN(top, SZ_512K) && p < top; p += SZ_512K, v += SZ_512K)
 		__early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
 
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index cb2afe39cee5..5ffa0af4328a 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -104,6 +104,9 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 	[MMU_PAGE_512K] = {
 		.shift	= 19,
 	},
+	[MMU_PAGE_8M] = {
+		.shift	= 23,
+	},
 };
 #endif
 
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index acdf64c9b93e..59f0d7706d2f 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -297,11 +297,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 }
 
 #if defined(CONFIG_PPC_8xx)
-void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
-		     pte_t pte, unsigned long sz)
+static void __set_huge_pte_at(pmd_t *pmd, pte_t *ptep, pte_basic_t val)
 {
-	pmd_t *pmd = pmd_off(mm, addr);
-	pte_basic_t val;
 	pte_basic_t *entry = (pte_basic_t *)ptep;
 	int num, i;
 
@@ -311,15 +308,26 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 	 */
 	VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
 
-	pte = set_pte_filter(pte, addr);
-
-	val = pte_val(pte);
-
 	num = number_of_cells_per_pte(pmd, val, 1);
 
 	for (i = 0; i < num; i++, entry++, val += SZ_4K)
 		*entry = val;
 }
+
+void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte, unsigned long sz)
+{
+	pmd_t *pmdp = pmd_off(mm, addr);
+
+	pte = set_pte_filter(pte, addr);
+
+	if (sz == SZ_8M) {
+		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp, 0), pte_val(pte));
+		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp + 1, 0), pte_val(pte) + SZ_4M);
+	} else {
+		__set_huge_pte_at(pmdp, ptep, pte_val(pte));
+	}
+}
 #endif
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index face94977cb2..0b1d68ef87cd 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -48,7 +48,7 @@ notrace void __init early_ioremap_init(void)
 	early_ioremap_setup();
 }
 
-static void __init *early_alloc_pgtable(unsigned long size)
+void __init *early_alloc_pgtable(unsigned long size)
 {
 	void *ptr = memblock_alloc(size, size);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-25 14:55 ` [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate() Christophe Leroy
@ 2024-03-25 16:19   ` Jason Gunthorpe
  2024-03-25 19:05     ` Christophe Leroy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2024-03-25 16:19 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu

On Mon, Mar 25, 2024 at 03:55:54PM +0100, Christophe Leroy wrote:
> Unlike many architectures, powerpc 8xx hardware tablewalk requires
> a two level process for all page sizes, allthough second level only
> has one entry when pagesize is 8M.
> 
> To fit with Linux page table topology and without requiring special
> page directory layout like hugepd, the page entry will be replicated
> 1024 times in the standard page table. However for large pages it is
> necessary to set bits in the level-1 (PMD) entry. At the time being,
> for 512k pages the flag is kept in the PTE and inserted in the PMD
> entry at TLB miss exception, that is necessary because we can have
> pages of different sizes in a page table. However the 12 PTE bits are
> fully used and there is no room for an additional bit for page size.
> 
> For 8M pages, there will be only one page per PMD entry, it is
> therefore possible to flag the pagesize in the PMD entry, with the
> advantage that the information will already be at the right place for
> the hardware.
> 
> To do so, add a new helper called pmd_populate_size() which takes the
> page size as an additional argument, and modify __pte_alloc() to also
> take that argument. pte_alloc() is left unmodified in order to
> reduce churn on callers, and a pte_alloc_size() is added for use by
> pte_alloc_huge().
> 
> When an architecture doesn't provide pmd_populate_size(),
> pmd_populate() is used as a fallback.

I think it would be a good idea to document what the semantic is
supposed to be for sz?

Just a general remark, probably nothing for this, but with these new
arguments the historical naming seems pretty tortured for
pte_alloc_size().. Something like pmd_populate_leaf(size) as a naming
scheme would make this more intuitive. Ie pmd_populate_leaf() gives
you a PMD entry where the entry points to a leaf page table able to
store folios of at least size.

Anyhow, I thought the edits to the mm helpers were fine, certainly
much nicer than hugepd. Do you see a path to remove hugepd entirely
from here?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get()
  2024-03-25 14:55 ` [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
@ 2024-03-25 16:35   ` Jason Gunthorpe
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2024-03-25 16:35 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu

On Mon, Mar 25, 2024 at 03:55:57PM +0100, Christophe Leroy wrote:

>  arch/arm64/include/asm/hugetlb.h |  2 +-
>  fs/hugetlbfs/inode.c             |  2 +-
>  fs/proc/task_mmu.c               |  8 +++---
>  fs/userfaultfd.c                 |  2 +-
>  include/asm-generic/hugetlb.h    |  2 +-
>  include/linux/swapops.h          |  2 +-
>  mm/damon/vaddr.c                 |  6 ++---
>  mm/gup.c                         |  2 +-
>  mm/hmm.c                         |  2 +-
>  mm/hugetlb.c                     | 46 ++++++++++++++++----------------
>  mm/memory-failure.c              |  2 +-
>  mm/mempolicy.c                   |  2 +-
>  mm/migrate.c                     |  4 +--
>  mm/mincore.c                     |  2 +-
>  mm/userfaultfd.c                 |  2 +-
>  15 files changed, 43 insertions(+), 43 deletions(-)
> 
> diff --git a/arch/qarm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
> index 2ddc33d93b13..1af39a74e791 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -46,7 +46,7 @@ extern pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>  extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
>  			   pte_t *ptep, unsigned long sz);
>  #define __HAVE_ARCH_HUGE_PTEP_GET
> -extern pte_t huge_ptep_get(pte_t *ptep);
> +extern pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);

The header changed but not the implementation? This will need to do
riscv and s390 too.

Though, really, I think the right path is to work toward removing
huge_ptep_get() from the arch code..

riscv and arm are doing the same thing - propogating dirty/young bits
from the contig PTEs to the results. The core code can do this, maybe
with a ARCH #define opt in.

s390.. Ouchy - is this because hugetlb wants to pretend that every
level is encoded as a PTE so it takes the PGD and recodes the flags to
the PTE layout??

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
                   ` (7 preceding siblings ...)
  2024-03-25 14:56 ` [RFC PATCH 8/8] powerpc/8xx: Add back support for 8M pages using contiguous PTE entries Christophe Leroy
@ 2024-03-25 16:38 ` Jason Gunthorpe
  2024-04-11 16:15   ` Peter Xu
  8 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2024-03-25 16:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu

On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> This series reimplements hugepages with hugepd on powerpc 8xx.
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> There are probably several ways to implement stuff, so feedback is
> very welcome.

I thought it looks pretty good!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-25 16:19   ` Jason Gunthorpe
@ 2024-03-25 19:05     ` Christophe Leroy
  2024-03-26 15:01       ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-03-25 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu



Le 25/03/2024 à 17:19, Jason Gunthorpe a écrit :
> On Mon, Mar 25, 2024 at 03:55:54PM +0100, Christophe Leroy wrote:
>> Unlike many architectures, powerpc 8xx hardware tablewalk requires
>> a two level process for all page sizes, allthough second level only
>> has one entry when pagesize is 8M.
>>
>> To fit with Linux page table topology and without requiring special
>> page directory layout like hugepd, the page entry will be replicated
>> 1024 times in the standard page table. However for large pages it is
>> necessary to set bits in the level-1 (PMD) entry. At the time being,
>> for 512k pages the flag is kept in the PTE and inserted in the PMD
>> entry at TLB miss exception, that is necessary because we can have
>> pages of different sizes in a page table. However the 12 PTE bits are
>> fully used and there is no room for an additional bit for page size.
>>
>> For 8M pages, there will be only one page per PMD entry, it is
>> therefore possible to flag the pagesize in the PMD entry, with the
>> advantage that the information will already be at the right place for
>> the hardware.
>>
>> To do so, add a new helper called pmd_populate_size() which takes the
>> page size as an additional argument, and modify __pte_alloc() to also
>> take that argument. pte_alloc() is left unmodified in order to
>> reduce churn on callers, and a pte_alloc_size() is added for use by
>> pte_alloc_huge().
>>
>> When an architecture doesn't provide pmd_populate_size(),
>> pmd_populate() is used as a fallback.
> 
> I think it would be a good idea to document what the semantic is
> supposed to be for sz?
> 
> Just a general remark, probably nothing for this, but with these new
> arguments the historical naming seems pretty tortured for
> pte_alloc_size().. Something like pmd_populate_leaf(size) as a naming
> scheme would make this more intuitive. Ie pmd_populate_leaf() gives
> you a PMD entry where the entry points to a leaf page table able to
> store folios of at least size.
> 
> Anyhow, I thought the edits to the mm helpers were fine, certainly
> much nicer than hugepd. Do you see a path to remove hugepd entirely
> from here?

Not looked into details yet, but I guess so.

By the way there is a wiki dedicated to huge pages on powerpc, you can 
have a look at it here : 
https://github.com/linuxppc/wiki/wiki/Huge-pages , maybe you'll find 
good ideas there to help me.

Christophe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-25 19:05     ` Christophe Leroy
@ 2024-03-26 15:01       ` Jason Gunthorpe
  2024-03-27  9:58         ` Christophe Leroy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2024-03-26 15:01 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu

On Mon, Mar 25, 2024 at 07:05:01PM +0000, Christophe Leroy wrote:

> Not looked into details yet, but I guess so.
> 
> By the way there is a wiki dedicated to huge pages on powerpc, you can 
> have a look at it here : 
> https://github.com/linuxppc/wiki/wiki/Huge-pages , maybe you'll find 
> good ideas there to help me.

There sure are alot of page tables types here

I'm a bit wondering about terminology, eg on the first diagram "huge
pte entry" means a PUD entry that is a leaf? Which ones are contiguous
replications?

Just general remarks on the ones with huge pages:

 hash 64k and hugepage 16M/16G
 radix 64k/radix hugepage 2M/1G
 radix 4k/radix hugepage 2M/1G
 nohash 32
  - I think this is just a normal x86 like scheme? PMD/PUD can be a
    leaf with the same size as a next level table. 

    Do any of these cases need to know the higher level to parse the
    lower? eg is there a 2M bit in the PUD indicating that the PMD 
    is a table of 2M leafs or does each PMD entry have a bit
    indicating it is a leaf?

 hash 4k and hugepage 16M/16G
 nohash 64
  - How does this work? I guess since 8xx explicitly calls out
    consecutive this is actually the pgd can point to 512 256M
    entries or 8 16G entries? Ie the table size at each level is
    varable? Or is it the same and the table size is still 512 and
    each 16G entry is replicated 64 times?

    Do the offset accessors already abstract this enough?

 8xx 4K
 8xx 16K
   - As this series does?

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-26 15:01       ` Jason Gunthorpe
@ 2024-03-27  9:58         ` Christophe Leroy
  2024-03-27 16:57           ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-03-27  9:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu



Le 26/03/2024 à 16:01, Jason Gunthorpe a écrit :
> On Mon, Mar 25, 2024 at 07:05:01PM +0000, Christophe Leroy wrote:
> 
>> Not looked into details yet, but I guess so.
>>
>> By the way there is a wiki dedicated to huge pages on powerpc, you can
>> have a look at it here :
>> https://github.com/linuxppc/wiki/wiki/Huge-pages , maybe you'll find
>> good ideas there to help me.
> 
> There sure are alot of page tables types here
> 
> I'm a bit wondering about terminology, eg on the first diagram "huge
> pte entry" means a PUD entry that is a leaf? Which ones are contiguous
> replications?

Yes, on the first diagram, a huge pte entry covering the same size as 
pud entry means a leaf PUD entry.

Contiguous replications are only on 8xx for the time being and are 
displayed as "consecutive entries".

> 
> Just general remarks on the ones with huge pages:
> 
>   hash 64k and hugepage 16M/16G
>   radix 64k/radix hugepage 2M/1G
>   radix 4k/radix hugepage 2M/1G
>   nohash 32
>    - I think this is just a normal x86 like scheme? PMD/PUD can be a
>      leaf with the same size as a next level table.
> 
>      Do any of these cases need to know the higher level to parse the
>      lower? eg is there a 2M bit in the PUD indicating that the PMD
>      is a table of 2M leafs or does each PMD entry have a bit
>      indicating it is a leaf?

For hash and radix there is a bit that tells it is leaf (_PAGE_PTE)

For nohash32/e500 I think the drawing is not full right, there is a huge 
page directory (hugepd) with a single entry. I think it should be 
possible to change it to a leaf entry, it seems we have bit _PAGE_SW1 
available in the PTE.

> 
>   hash 4k and hugepage 16M/16G
>   nohash 64
>    - How does this work? I guess since 8xx explicitly calls out
>      consecutive this is actually the pgd can point to 512 256M
>      entries or 8 16G entries? Ie the table size at each level is
>      varable? Or is it the same and the table size is still 512 and
>      each 16G entry is replicated 64 times?

For those it is using the huge page directory (hugepd) which can be 
hooked at any level and is a directory of huge pages on its own. There 
is no consecutive entries involved here I think, allthough I'm not 
completely sure.

For hash4k I'm not sure how it works, this was changed by commit 
e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a 
different page table format")

For the nohash/64, a PGD entry points either to a regular PUD directory 
or to a HUGEPD directory. The size of the HUGEPD directory is encoded in 
the 6 lower bits of the PGD entry.

> 
>      Do the offset accessors already abstract this enough?
> 
>   8xx 4K
>   8xx 16K
>     - As this series does?

This is how it is prior to the series, ie 16k and 512k pages are 
implemented as contiguous PTEs in a standard page table while 8M pages 
are implemented with hugepd and a single entry in it (with two PGD 
entries pointing to the same huge page directory.

Christophe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-27  9:58         ` Christophe Leroy
@ 2024-03-27 16:57           ` Jason Gunthorpe
  2024-04-03 18:24             ` Christophe Leroy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2024-03-27 16:57 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu

On Wed, Mar 27, 2024 at 09:58:35AM +0000, Christophe Leroy wrote:
> > Just general remarks on the ones with huge pages:
> > 
> >   hash 64k and hugepage 16M/16G
> >   radix 64k/radix hugepage 2M/1G
> >   radix 4k/radix hugepage 2M/1G
> >   nohash 32
> >    - I think this is just a normal x86 like scheme? PMD/PUD can be a
> >      leaf with the same size as a next level table.
> > 
> >      Do any of these cases need to know the higher level to parse the
> >      lower? eg is there a 2M bit in the PUD indicating that the PMD
> >      is a table of 2M leafs or does each PMD entry have a bit
> >      indicating it is a leaf?
> 
> For hash and radix there is a bit that tells it is leaf (_PAGE_PTE)
> 
> For nohash32/e500 I think the drawing is not full right, there is a huge 
> page directory (hugepd) with a single entry. I think it should be 
> possible to change it to a leaf entry, it seems we have bit _PAGE_SW1 
> available in the PTE.

It sounds to me like PPC breaks down into only a couple fundamental
behaviors
 - x86 like leaf in many page levels. Use the pgd/pud/pmd_leaf() and
   related to implement it
 - ARM like contig PTE within a single page table level. Use the
   contig sutff to implement it
 - Contig PTE across two page table levels with a bit in the
   PMD. Needs new support like you showed
 - Page table levels with a variable page size. Ie a PUD can point to
   a directory of 8 pages or 512 pages of different size. Probbaly
   needs some new core support, but I think your changes to the
   *_offset go a long way already.

> > 
> >   hash 4k and hugepage 16M/16G
> >   nohash 64
> >    - How does this work? I guess since 8xx explicitly calls out
> >      consecutive this is actually the pgd can point to 512 256M
> >      entries or 8 16G entries? Ie the table size at each level is
> >      varable? Or is it the same and the table size is still 512 and
> >      each 16G entry is replicated 64 times?
> 
> For those it is using the huge page directory (hugepd) which can be 
> hooked at any level and is a directory of huge pages on its own. There 
> is no consecutive entries involved here I think, allthough I'm not 
> completely sure.
> 
> For hash4k I'm not sure how it works, this was changed by commit 
> e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a 
> different page table format")
> 
> For the nohash/64, a PGD entry points either to a regular PUD directory 
> or to a HUGEPD directory. The size of the HUGEPD directory is encoded in 
> the 6 lower bits of the PGD entry.

If it is a software walker there might be value in just aligning to
the contig pte scheme in all levels and forgetting about the variable
size page table levels. That quarter page stuff is a PITA to manage
the memory allocation for on PPC anyhow..

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-03-27 16:57           ` Jason Gunthorpe
@ 2024-04-03 18:24             ` Christophe Leroy
  2024-04-04 11:46               ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-04-03 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu



Le 27/03/2024 à 17:57, Jason Gunthorpe a écrit :
> On Wed, Mar 27, 2024 at 09:58:35AM +0000, Christophe Leroy wrote:
>>> Just general remarks on the ones with huge pages:
>>>
>>>    hash 64k and hugepage 16M/16G
>>>    radix 64k/radix hugepage 2M/1G
>>>    radix 4k/radix hugepage 2M/1G
>>>    nohash 32
>>>     - I think this is just a normal x86 like scheme? PMD/PUD can be a
>>>       leaf with the same size as a next level table.
>>>
>>>       Do any of these cases need to know the higher level to parse the
>>>       lower? eg is there a 2M bit in the PUD indicating that the PMD
>>>       is a table of 2M leafs or does each PMD entry have a bit
>>>       indicating it is a leaf?
>>
>> For hash and radix there is a bit that tells it is leaf (_PAGE_PTE)
>>
>> For nohash32/e500 I think the drawing is not full right, there is a huge
>> page directory (hugepd) with a single entry. I think it should be
>> possible to change it to a leaf entry, it seems we have bit _PAGE_SW1
>> available in the PTE.
> 
> It sounds to me like PPC breaks down into only a couple fundamental
> behaviors
>   - x86 like leaf in many page levels. Use the pgd/pud/pmd_leaf() and
>     related to implement it
>   - ARM like contig PTE within a single page table level. Use the
>     contig sutff to implement it
>   - Contig PTE across two page table levels with a bit in the
>     PMD. Needs new support like you showed
>   - Page table levels with a variable page size. Ie a PUD can point to
>     a directory of 8 pages or 512 pages of different size. Probbaly
>     needs some new core support, but I think your changes to the
>     *_offset go a long way already.
> 
>>>
>>>    hash 4k and hugepage 16M/16G
>>>    nohash 64
>>>     - How does this work? I guess since 8xx explicitly calls out
>>>       consecutive this is actually the pgd can point to 512 256M
>>>       entries or 8 16G entries? Ie the table size at each level is
>>>       varable? Or is it the same and the table size is still 512 and
>>>       each 16G entry is replicated 64 times?
>>
>> For those it is using the huge page directory (hugepd) which can be
>> hooked at any level and is a directory of huge pages on its own. There
>> is no consecutive entries involved here I think, allthough I'm not
>> completely sure.
>>
>> For hash4k I'm not sure how it works, this was changed by commit
>> e2b3d202d1db ("powerpc: Switch 16GB and 16MB explicit hugepages to a
>> different page table format")
>>
>> For the nohash/64, a PGD entry points either to a regular PUD directory
>> or to a HUGEPD directory. The size of the HUGEPD directory is encoded in
>> the 6 lower bits of the PGD entry.
> 
> If it is a software walker there might be value in just aligning to
> the contig pte scheme in all levels and forgetting about the variable
> size page table levels. That quarter page stuff is a PITA to manage
> the memory allocation for on PPC anyhow..

Looking one step further, into nohash/32, I see a challenge: on that 
platform, a PTE is 64 bits while a PGD/PMD entry is 32 bits. It is 
therefore not possible as such to do PMD leaf or cont-PMD leaf.

I see two possible solutions:
- Double the size of PGD/PMD entries, but then we loose atomicity when 
reading or writing an entry, could this be a problem ?
- Do as for the 8xx, ie go down to PTEs even for pages greater than 4M.

Any thought ?

Christophe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate()
  2024-04-03 18:24             ` Christophe Leroy
@ 2024-04-04 11:46               ` Jason Gunthorpe
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2024-04-04 11:46 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Peter Xu

On Wed, Apr 03, 2024 at 06:24:38PM +0000, Christophe Leroy wrote:
> > If it is a software walker there might be value in just aligning to
> > the contig pte scheme in all levels and forgetting about the variable
> > size page table levels. That quarter page stuff is a PITA to manage
> > the memory allocation for on PPC anyhow..
> 
> Looking one step further, into nohash/32, I see a challenge: on that 
> platform, a PTE is 64 bits while a PGD/PMD entry is 32 bits. It is 
> therefore not possible as such to do PMD leaf or cont-PMD leaf.

Hmm, maybe not, I have a feeling you can hide this detail in the
pmd_offset routine if you pass in the PGD information too.

> - Double the size of PGD/PMD entries, but then we loose atomicity when 
> reading or writing an entry, could this be a problem ?

How does the 64 bit PTE work then? We have ignored this bug on x86 32
bit, but there is a general difficult race with 64 bit atomicity on 32
bit CPUs in the page tables.

Ideally you'd have 64 bit entries at the PMD level that encode the
page size the same as the PTE level. So you hit any level and you know
your size. This is less memory efficient (though every other arch
tolerates this) in general cases.

Can you throw away some bits of PA in the 32 bit entries to signal a
size?

> - Do as for the 8xx, ie go down to PTEs even for pages greater than 4M.

Aside from the memory waste, this is the most logical thing, go down
far enough that you can encode the desired page size in the PTE and
use the contig PTE scheme.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-03-25 16:38 ` [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Jason Gunthorpe
@ 2024-04-11 16:15   ` Peter Xu
  2024-04-12 14:08     ` Christophe Leroy
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-04-11 16:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel

On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> > This series reimplements hugepages with hugepd on powerpc 8xx.
> > 
> > Unlike most architectures, powerpc 8xx HW requires a two-level
> > pagetable topology for all page sizes. So a leaf PMD-contig approach
> > is not feasible as such.
> > 
> > Possible sizes are 4k, 16k, 512k and 8M.
> > 
> > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> > must point to a single entry level-2 page table. Until now that was
> > done using hugepd. This series changes it to use standard page tables
> > where the entry is replicated 1024 times on each of the two pagetables
> > refered by the two associated PMD entries for that 8M page.
> > 
> > At the moment it has to look into each helper to know if the
> > hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> > a lower size. I hope this can me handled by core-mm in the future.
> > 
> > There are probably several ways to implement stuff, so feedback is
> > very welcome.
> 
> I thought it looks pretty good!

I second it.

I saw the discussions in patch 1.  Christophe, I suppose you're exploring
the big hammer over hugepd, and perhaps went already with the 32bit pmd
solution for nohash/32bit challenge you mentioned?

I'm trying to position my next step; it seems like at least I should not
adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
or you're going to have an RFC soon then I can base on top?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-04-11 16:15   ` Peter Xu
@ 2024-04-12 14:08     ` Christophe Leroy
  2024-04-12 14:30       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-04-12 14:08 UTC (permalink / raw)
  To: Peter Xu, Jason Gunthorpe
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel



Le 11/04/2024 à 18:15, Peter Xu a écrit :
> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>
>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>> is not feasible as such.
>>>
>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>
>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>> must point to a single entry level-2 page table. Until now that was
>>> done using hugepd. This series changes it to use standard page tables
>>> where the entry is replicated 1024 times on each of the two pagetables
>>> refered by the two associated PMD entries for that 8M page.
>>>
>>> At the moment it has to look into each helper to know if the
>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>> a lower size. I hope this can me handled by core-mm in the future.
>>>
>>> There are probably several ways to implement stuff, so feedback is
>>> very welcome.
>>
>> I thought it looks pretty good!
> 
> I second it.
> 
> I saw the discussions in patch 1.  Christophe, I suppose you're exploring
> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> solution for nohash/32bit challenge you mentioned?
> 
> I'm trying to position my next step; it seems like at least I should not
> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> or you're going to have an RFC soon then I can base on top?

Depends on what you expect by "soon".

I sure won't be able to send any RFC before end of April.

Should be possible to have something during May.

Christophe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-04-12 14:08     ` Christophe Leroy
@ 2024-04-12 14:30       ` Peter Xu
  2024-04-15 19:12         ` Christophe Leroy
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-04-12 14:30 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Jason Gunthorpe

On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
> 
> 
> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> > On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>
> >>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>> is not feasible as such.
> >>>
> >>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>
> >>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> >>> must point to a single entry level-2 page table. Until now that was
> >>> done using hugepd. This series changes it to use standard page tables
> >>> where the entry is replicated 1024 times on each of the two pagetables
> >>> refered by the two associated PMD entries for that 8M page.
> >>>
> >>> At the moment it has to look into each helper to know if the
> >>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>> a lower size. I hope this can me handled by core-mm in the future.
> >>>
> >>> There are probably several ways to implement stuff, so feedback is
> >>> very welcome.
> >>
> >> I thought it looks pretty good!
> > 
> > I second it.
> > 
> > I saw the discussions in patch 1.  Christophe, I suppose you're exploring
> > the big hammer over hugepd, and perhaps went already with the 32bit pmd
> > solution for nohash/32bit challenge you mentioned?
> > 
> > I'm trying to position my next step; it seems like at least I should not
> > adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> > or you're going to have an RFC soon then I can base on top?
> 
> Depends on what you expect by "soon".
> 
> I sure won't be able to send any RFC before end of April.
> 
> Should be possible to have something during May.

That's good enough, thanks.  I'll see what is the best I can do.

Then do you think I can leave p4d/pgd leaves alone?  Please check the other
email where I'm not sure whether pgd leaves ever existed for any of
PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
recognize pud and lower for all leaves.  Then if Power can switch from
hugepd to this it should just work.

Even if pgd exists (then something I overlooked..), I'm wondering whether
we can push that downwards to be either pud/pmd (and looks like we all
agree p4d is never used on Power).  That may involve some pgtable
operations moving from pgd level to lower, e.g. my pure imagination would
look like starting with:

#define PTE_INDEX_SIZE	PTE_SHIFT
#define PMD_INDEX_SIZE	0
#define PUD_INDEX_SIZE	0
#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)

To:

#define PTE_INDEX_SIZE	PTE_SHIFT
#define PMD_INDEX_SIZE	(32 - PMD_SHIFT)
#define PUD_INDEX_SIZE	0
#define PGD_INDEX_SIZE	0

And the rest will need care too.  I hope moving downward is easier
(e.g. the walker should always exist for lower levels but not always for
higher levels), but I actually have little idea on whether there's any
other implications, so please bare with me on stupid mistakes.

I just hope pgd leaves don't exist already, then I think it'll be simpler.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-04-12 14:30       ` Peter Xu
@ 2024-04-15 19:12         ` Christophe Leroy
  2024-04-16 10:58           ` Christophe Leroy
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-04-15 19:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Jason Gunthorpe



Le 12/04/2024 à 16:30, Peter Xu a écrit :
> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>
>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>> is not feasible as such.
>>>>>
>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>
>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>>>> must point to a single entry level-2 page table. Until now that was
>>>>> done using hugepd. This series changes it to use standard page tables
>>>>> where the entry is replicated 1024 times on each of the two pagetables
>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>
>>>>> At the moment it has to look into each helper to know if the
>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>
>>>>> There are probably several ways to implement stuff, so feedback is
>>>>> very welcome.
>>>>
>>>> I thought it looks pretty good!
>>>
>>> I second it.
>>>
>>> I saw the discussions in patch 1.  Christophe, I suppose you're exploring
>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>> solution for nohash/32bit challenge you mentioned?
>>>
>>> I'm trying to position my next step; it seems like at least I should not
>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
>>> or you're going to have an RFC soon then I can base on top?
>>
>> Depends on what you expect by "soon".
>>
>> I sure won't be able to send any RFC before end of April.
>>
>> Should be possible to have something during May.
> 
> That's good enough, thanks.  I'll see what is the best I can do.
> 
> Then do you think I can leave p4d/pgd leaves alone?  Please check the other
> email where I'm not sure whether pgd leaves ever existed for any of
> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> recognize pud and lower for all leaves.  Then if Power can switch from
> hugepd to this it should just work.

Well, if I understand correctly, something with no PMD will include 
<asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd 
entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
(0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
case.

And therefore including <asm-generic/pgtable-nop4d.h> means .... you 
have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
checked.


> 
> Even if pgd exists (then something I overlooked..), I'm wondering whether
> we can push that downwards to be either pud/pmd (and looks like we all
> agree p4d is never used on Power).  That may involve some pgtable
> operations moving from pgd level to lower, e.g. my pure imagination would
> look like starting with:

Yes I think there is no doubt that p4d is never used:

arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
<asm-generic/pgtable-nopmd.h>
arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
<asm-generic/pgtable-nop4d.h>
arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
<asm-generic/pgtable-nopmd.h>
arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
<asm-generic/pgtable-nop4d.h>

But that means that PPC32 have pmd entries and PPC64 have p4d entries ...

> 
> #define PTE_INDEX_SIZE	PTE_SHIFT
> #define PMD_INDEX_SIZE	0
> #define PUD_INDEX_SIZE	0
> #define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
> 
> To:
> 
> #define PTE_INDEX_SIZE	PTE_SHIFT
> #define PMD_INDEX_SIZE	(32 - PMD_SHIFT)
> #define PUD_INDEX_SIZE	0
> #define PGD_INDEX_SIZE	0

But then you can't anymore have #define PTRS_PER_PMD 1 from 
<asm-generic/pgtable-nop4d.h>

> 
> And the rest will need care too.  I hope moving downward is easier
> (e.g. the walker should always exist for lower levels but not always for
> higher levels), but I actually have little idea on whether there's any
> other implications, so please bare with me on stupid mistakes.
> 
> I just hope pgd leaves don't exist already, then I think it'll be simpler.
> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-04-15 19:12         ` Christophe Leroy
@ 2024-04-16 10:58           ` Christophe Leroy
  2024-04-16 19:40             ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Leroy @ 2024-04-16 10:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Jason Gunthorpe



Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> 
> 
> Le 12/04/2024 à 16:30, Peter Xu a écrit :
>> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>>
>>>
>>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>>
>>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>>> is not feasible as such.
>>>>>>
>>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>>
>>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
>>>>>> entries
>>>>>> must point to a single entry level-2 page table. Until now that was
>>>>>> done using hugepd. This series changes it to use standard page tables
>>>>>> where the entry is replicated 1024 times on each of the two 
>>>>>> pagetables
>>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>>
>>>>>> At the moment it has to look into each helper to know if the
>>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>>
>>>>>> There are probably several ways to implement stuff, so feedback is
>>>>>> very welcome.
>>>>>
>>>>> I thought it looks pretty good!
>>>>
>>>> I second it.
>>>>
>>>> I saw the discussions in patch 1.  Christophe, I suppose you're 
>>>> exploring
>>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>>> solution for nohash/32bit challenge you mentioned?
>>>>
>>>> I'm trying to position my next step; it seems like at least I should 
>>>> not
>>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
>>>> checks,
>>>> or you're going to have an RFC soon then I can base on top?
>>>
>>> Depends on what you expect by "soon".
>>>
>>> I sure won't be able to send any RFC before end of April.
>>>
>>> Should be possible to have something during May.
>>
>> That's good enough, thanks.  I'll see what is the best I can do.
>>
>> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
>> other
>> email where I'm not sure whether pgd leaves ever existed for any of
>> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
>> recognize pud and lower for all leaves.  Then if Power can switch from
>> hugepd to this it should just work.
> 
> Well, if I understand correctly, something with no PMD will include 
> <asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd 
> entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> case.
> 
> And therefore including <asm-generic/pgtable-nop4d.h> means .... you 
> have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> checked.
> 
> 
>>
>> Even if pgd exists (then something I overlooked..), I'm wondering whether
>> we can push that downwards to be either pud/pmd (and looks like we all
>> agree p4d is never used on Power).  That may involve some pgtable
>> operations moving from pgd level to lower, e.g. my pure imagination would
>> look like starting with:
> 
> Yes I think there is no doubt that p4d is never used:
> 
> arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> <asm-generic/pgtable-nopmd.h>
> arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> <asm-generic/pgtable-nop4d.h>
> arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> <asm-generic/pgtable-nopmd.h>
> arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> <asm-generic/pgtable-nop4d.h>
> 
> But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> 
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    0
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
>>
>> To:
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    0
> 
> But then you can't anymore have #define PTRS_PER_PMD 1 from 
> <asm-generic/pgtable-nop4d.h>
> 
>>
>> And the rest will need care too.  I hope moving downward is easier
>> (e.g. the walker should always exist for lower levels but not always for
>> higher levels), but I actually have little idea on whether there's any
>> other implications, so please bare with me on stupid mistakes.
>>
>> I just hope pgd leaves don't exist already, then I think it'll be 
>> simpler.
>>
>> Thanks,
>>

Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
("riscv/mm: fix two page table check related issues")

So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() 
and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

Christophe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
  2024-04-16 10:58           ` Christophe Leroy
@ 2024-04-16 19:40             ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-04-16 19:40 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-mm, Andrew Morton, linuxppc-dev, linux-kernel, Jason Gunthorpe

On Tue, Apr 16, 2024 at 10:58:33AM +0000, Christophe Leroy wrote:
> 
> 
> Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> > 
> > 
> > Le 12/04/2024 à 16:30, Peter Xu a écrit :
> >> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
> >>>
> >>>
> >>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> >>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>>>>
> >>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>>>>> is not feasible as such.
> >>>>>>
> >>>>>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>>>>
> >>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
> >>>>>> entries
> >>>>>> must point to a single entry level-2 page table. Until now that was
> >>>>>> done using hugepd. This series changes it to use standard page tables
> >>>>>> where the entry is replicated 1024 times on each of the two 
> >>>>>> pagetables
> >>>>>> refered by the two associated PMD entries for that 8M page.
> >>>>>>
> >>>>>> At the moment it has to look into each helper to know if the
> >>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>>>>> a lower size. I hope this can me handled by core-mm in the future.
> >>>>>>
> >>>>>> There are probably several ways to implement stuff, so feedback is
> >>>>>> very welcome.
> >>>>>
> >>>>> I thought it looks pretty good!
> >>>>
> >>>> I second it.
> >>>>
> >>>> I saw the discussions in patch 1.  Christophe, I suppose you're 
> >>>> exploring
> >>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> >>>> solution for nohash/32bit challenge you mentioned?
> >>>>
> >>>> I'm trying to position my next step; it seems like at least I should 
> >>>> not
> >>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
> >>>> checks,
> >>>> or you're going to have an RFC soon then I can base on top?
> >>>
> >>> Depends on what you expect by "soon".
> >>>
> >>> I sure won't be able to send any RFC before end of April.
> >>>
> >>> Should be possible to have something during May.
> >>
> >> That's good enough, thanks.  I'll see what is the best I can do.
> >>
> >> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
> >> other
> >> email where I'm not sure whether pgd leaves ever existed for any of
> >> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> >> recognize pud and lower for all leaves.  Then if Power can switch from
> >> hugepd to this it should just work.
> > 
> > Well, if I understand correctly, something with no PMD will include 
> > <asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd 
> > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> > case.
> > 
> > And therefore including <asm-generic/pgtable-nop4d.h> means .... you 
> > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> > checked.
> > 
> > 
> >>
> >> Even if pgd exists (then something I overlooked..), I'm wondering whether
> >> we can push that downwards to be either pud/pmd (and looks like we all
> >> agree p4d is never used on Power).  That may involve some pgtable
> >> operations moving from pgd level to lower, e.g. my pure imagination would
> >> look like starting with:
> > 
> > Yes I think there is no doubt that p4d is never used:
> > 
> > arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> > <asm-generic/pgtable-nopmd.h>
> > arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> > <asm-generic/pgtable-nop4d.h>
> > arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> > <asm-generic/pgtable-nopmd.h>
> > arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> > <asm-generic/pgtable-nop4d.h>
> > 
> > But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> > 
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    0
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
> >>
> >> To:
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    0
> > 
> > But then you can't anymore have #define PTRS_PER_PMD 1 from 
> > <asm-generic/pgtable-nop4d.h>
> > 
> >>
> >> And the rest will need care too.  I hope moving downward is easier
> >> (e.g. the walker should always exist for lower levels but not always for
> >> higher levels), but I actually have little idea on whether there's any
> >> other implications, so please bare with me on stupid mistakes.
> >>
> >> I just hope pgd leaves don't exist already, then I think it'll be 
> >> simpler.
> >>
> >> Thanks,
> >>
> 
> Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
> pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
> ("riscv/mm: fix two page table check related issues")
> 
> So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() 

I think should be yes for this, and..

> and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

.. probably no for this?  It seems pgd is just slightly special.

Firstly, I notice that the -nopmd.h actually includes -nopud.h already, and
further that includes -nop4d.h.  It means we can only have below options:

  a) nopmd + nopud + nop4d
  b) nopud + nop4d
  c) nop4d

It should also mean we can't randomly choose which layer to skip with the
current header arrangements.. at the meantime, all 32bit PowerPC should
perhaps fall into a), while 64 bits fall into c).  That looks all fine for
now.

Above p4d_leaf()==false [1] should be fine when -nopud.h included, because
that already included nop4d.h, it means "p4d level is skipped" in the
pgtable.  Then it doesn't make sense if p4d_leaf() can ever return true.
That's the same when pud_leaf()==false looks sane when an arch included
-nopmd.h as that in turn implies -nopud too.

pgd seems different, because -nop4d.h didn't include anything else like
"-nopgd.h"..  so I don't see further implication on pgd sololy from the
headers. I guess that's also why 32bit Power uses pgd+pte for the two
levels; it looks like pgd is special among the others.

However I think it still didn't mean that we can't push pgd to pmd, making
pgd+pte into pmd+pte.  Though here we may want to move from:

  #include <asm-generic/pgtable-nopmd.h>
  (pmd/pud/p4d not used)

Into:

  #include <asm-generic/pgtable-nopud.h>
  (pud/p4d not used)

Then we may need to provide something similar to what's in -nopXd.h for
pgds.

But let's loop back to the very beginning: I don't think we have either pgd
leaves or p4d leaves for PowerPC.  Note that hugepd looks possible to exist
in pgd entries (per wiki page [1]), however I don't even think it's true in
reality, as I mentioned elsewhere on reading __find_linux_pte() and it
always go directly into p4d.  I highly doubt in reality the "pgd hugepd
entries" are actually processed by the p4d layer here:

	if (is_hugepd(__hugepd(p4d_val(p4d)))) {
		hpdp = (hugepd_t *)&p4d;
		goto out_huge;
	}

Because when with "nop4d" it doesn't mean "there is no p4d" but what it
really meant is "we have only one p4d entry (which is actually exactly the
pgd entry..)".  After all, is_hugepd() works for all levels.

Taking E500 nohash 32 as example, it has these:

  "pgd entry covering 2M" -> "hugepd with one hugepte covering one 4M hugepage"
                                   ^
  "pgd entry covering 2M" ---------+

I suspect this "pgd covering 2M" is processed by the p4d code above, rather
than pgd level.  Then in the future world where there's no hugepd, it'll
naturally become a pgtable page at pte level.

Thanks,

[1] https://github.com/linuxppc/wiki/wiki/Huge-pages

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-04-16 19:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 14:55 [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 1/8] mm: Provide pagesize to pmd_populate() Christophe Leroy
2024-03-25 16:19   ` Jason Gunthorpe
2024-03-25 19:05     ` Christophe Leroy
2024-03-26 15:01       ` Jason Gunthorpe
2024-03-27  9:58         ` Christophe Leroy
2024-03-27 16:57           ` Jason Gunthorpe
2024-04-03 18:24             ` Christophe Leroy
2024-04-04 11:46               ` Jason Gunthorpe
2024-03-25 14:55 ` [RFC PATCH 2/8] mm: Provide page size to pte_alloc_huge() Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 3/8] mm: Provide pmd to pte_leaf_size() Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 4/8] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
2024-03-25 16:35   ` Jason Gunthorpe
2024-03-25 14:55 ` [RFC PATCH 5/8] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
2024-03-25 14:55 ` [RFC PATCH 6/8] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
2024-03-25 14:56 ` [RFC PATCH 7/8] powerpc/8xx: Remove support for 8M pages Christophe Leroy
2024-03-25 14:56 ` [RFC PATCH 8/8] powerpc/8xx: Add back support for 8M pages using contiguous PTE entries Christophe Leroy
2024-03-25 16:38 ` [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx Jason Gunthorpe
2024-04-11 16:15   ` Peter Xu
2024-04-12 14:08     ` Christophe Leroy
2024-04-12 14:30       ` Peter Xu
2024-04-15 19:12         ` Christophe Leroy
2024-04-16 10:58           ` Christophe Leroy
2024-04-16 19:40             ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).