[PATCH v2 0/3] powerpc: implementation of huge pages for 8xx

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] powerpc: implementation of huge pages for 8xx
@ 2016-09-16  7:40 Christophe Leroy
  2016-09-16  7:40 ` [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits Christophe Leroy
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Christophe Leroy @ 2016-09-16  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

This is v2 of patch serie is the implementation of support of
hugepages for the 8xx.
v1 of the serie was including some other fixes and
optimisations/reorganisations for the 8xx. Now the patch has been
split and this part only focuses on the implementation of
hugepages.

Compared the v1, the last patch has been split in two parts.

This patch serie applies on top of the patch serie named
"Optimisation on 8xx prior to hugepage implementation"

Christophe Leroy (3):
  powerpc: port 64 bits pgtable_cache to 32 bits
  powerpc: get hugetlbpage handling more generic
  powerpc/8xx: Implement support of hugepages

 arch/powerpc/include/asm/book3s/32/pgalloc.h |  44 +++++-
 arch/powerpc/include/asm/book3s/32/pgtable.h |  43 +++---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   3 -
 arch/powerpc/include/asm/hugetlb.h           |  19 ++-
 arch/powerpc/include/asm/mmu-8xx.h           |  35 +++++
 arch/powerpc/include/asm/mmu.h               |  23 +--
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  44 +++++-
 arch/powerpc/include/asm/nohash/32/pgtable.h |  45 +++---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |   1 +
 arch/powerpc/include/asm/nohash/64/pgtable.h |   2 -
 arch/powerpc/include/asm/nohash/pgtable.h    |   4 +
 arch/powerpc/include/asm/pgtable.h           |   2 +
 arch/powerpc/include/asm/reg_8xx.h           |   2 +-
 arch/powerpc/kernel/head_8xx.S               | 119 ++++++++++++++-
 arch/powerpc/mm/Makefile                     |   3 +-
 arch/powerpc/mm/hugetlbpage.c                | 212 ++++++++++++---------------
 arch/powerpc/mm/init-common.c                | 147 +++++++++++++++++++
 arch/powerpc/mm/init_64.c                    |  77 ----------
 arch/powerpc/mm/pgtable_32.c                 |  37 -----
 arch/powerpc/mm/tlb_nohash.c                 |  21 ++-
 arch/powerpc/platforms/8xx/Kconfig           |   1 +
 arch/powerpc/platforms/Kconfig.cputype       |   1 +
 22 files changed, 572 insertions(+), 313 deletions(-)
 create mode 100644 arch/powerpc/mm/init-common.c

-- 
2.1.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits
  2016-09-16  7:40 [PATCH v2 0/3] powerpc: implementation of huge pages for 8xx Christophe Leroy
@ 2016-09-16  7:40 ` Christophe Leroy
  2016-09-19  5:22   ` Aneesh Kumar K.V
  2016-09-16  7:40 ` [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic Christophe Leroy
  2016-09-16  7:40 ` [PATCH v2 3/3] powerpc/8xx: Implement support of hugepages Christophe Leroy
  2 siblings, 1 reply; 15+ messages in thread
From: Christophe Leroy @ 2016-09-16  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

Today powerpc64 uses a set of pgtable_caches while powerpc32 uses
standard pages when using 4k pages and a single pgtable_cache
if using other size pages.

In preparation of implementing huge pages on the 8xx, this patch
replaces the specific powerpc32 handling by the 64 bits approach.

This is done by:
* moving 64 bits pgtable_cache_add() and pgtable_cache_init()
in a new file called init-common.c
* modifying pgtable_cache_init() to also handle the case
without PMD
* removing the 32 bits version of pgtable_cache_add() and
pgtable_cache_init()
* copying related header contents from 64 bits into both the
book3s/32 and nohash/32 header files

On the 8xx, the following cache sizes will be used:
* 4k pages mode:
- PGT_CACHE(10) for PGD
- PGT_CACHE(3) for 512k hugepage tables
* 16k pages mode:
- PGT_CACHE(6) for PGD
- PGT_CACHE(7) for 512k hugepage tables
- PGT_CACHE(3) for 8M hugepage tables

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v2: in v1, hugepte_cache was wrongly replaced by PGT_CACHE(1).
This modification has been removed from v2.

 arch/powerpc/include/asm/book3s/32/pgalloc.h |  44 ++++++--
 arch/powerpc/include/asm/book3s/32/pgtable.h |  43 ++++----
 arch/powerpc/include/asm/book3s/64/pgtable.h |   3 -
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  44 ++++++--
 arch/powerpc/include/asm/nohash/32/pgtable.h |  45 ++++----
 arch/powerpc/include/asm/nohash/64/pgtable.h |   2 -
 arch/powerpc/include/asm/pgtable.h           |   2 +
 arch/powerpc/mm/Makefile                     |   3 +-
 arch/powerpc/mm/init-common.c                | 147 +++++++++++++++++++++++++++
 arch/powerpc/mm/init_64.c                    |  77 --------------
 arch/powerpc/mm/pgtable_32.c                 |  37 -------
 11 files changed, 273 insertions(+), 174 deletions(-)
 create mode 100644 arch/powerpc/mm/init-common.c

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 8e21bb4..d310546 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -2,14 +2,42 @@
 #define _ASM_POWERPC_BOOK3S_32_PGALLOC_H
 
 #include <linux/threads.h>
+#include <linux/slab.h>
 
-/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
-#define MAX_PGTABLE_INDEX_SIZE	0
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
 
 extern void __bad_pte(pmd_t *pmd);
 
-extern pgd_t *pgd_alloc(struct mm_struct *mm);
-extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
+extern struct kmem_cache *pgtable_cache[];
+#define PGT_CACHE(shift) ({				\
+			BUG_ON(!(shift));		\
+			pgtable_cache[(shift) - 1];	\
+		})
+
+static inline pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
+}
+
+static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
+{
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
+}
 
 /*
  * We don't have any real pmd's, and this code never triggers because
@@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 
 static inline void pgtable_free(void *table, unsigned index_size)
 {
-	BUG_ON(index_size); /* 32-bit doesn't use this */
-	free_page((unsigned long)table);
+	if (!index_size) {
+		free_page((unsigned long)table);
+	} else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
 #define check_pgt_cache()	do { } while (0)
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 6b8b2d5..f887499 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -8,6 +8,26 @@
 /* And here we include common definitions */
 #include <asm/pte-common.h>
 
+#define PTE_INDEX_SIZE	PTE_SHIFT
+#define PMD_INDEX_SIZE	0
+#define PUD_INDEX_SIZE	0
+#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
+
+#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
+
+#ifndef __ASSEMBLY__
+#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_INDEX_SIZE)
+#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PTE_INDEX_SIZE)
+#define PUD_TABLE_SIZE	(sizeof(pud_t) << PTE_INDEX_SIZE)
+#define PGD_TABLE_SIZE	(sizeof(pgd_t) << PGD_INDEX_SIZE)
+#endif	/* __ASSEMBLY__ */
+
+#define PTRS_PER_PTE	(1 << PTE_INDEX_SIZE)
+#define PTRS_PER_PGD	(1 << PGD_INDEX_SIZE)
+
+/* With 4k base page size, hugepage PTEs go at the PMD level */
+#define MIN_HUGEPTE_SHIFT	PMD_SHIFT
+
 /*
  * The normal case is that PTEs are 32-bits and we have a 1-page
  * 1024-entry pgdir pointing to 1-page 1024-entry PTE pages.  -- paulus
@@ -19,14 +39,10 @@
  * -Matt
  */
 /* PGDIR_SHIFT determines what a top-level page table entry can map */
-#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_SHIFT)
+#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_INDEX_SIZE)
 #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE-1))
 
-#define PTRS_PER_PTE	(1 << PTE_SHIFT)
-#define PTRS_PER_PMD	1
-#define PTRS_PER_PGD	(1 << (32 - PGDIR_SHIFT))
-
 #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
 /*
  * This is the bottom of the PKMAP area with HIGHMEM or an arbitrary
@@ -82,12 +98,8 @@
 
 extern unsigned long ioremap_bot;
 
-/*
- * entries per page directory level: our page-table tree is two-level, so
- * we don't really have any PMD directory.
- */
-#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_SHIFT)
-#define PGD_TABLE_SIZE	(sizeof(pgd_t) << (32 - PGDIR_SHIFT))
+/* Bits to mask out from a PGD to get to the PUD page */
+#define PGD_MASKED_BITS		0
 
 #define pte_ERROR(e) \
 	pr_err("%s:%d: bad pte %llx.\n", __FILE__, __LINE__, \
@@ -283,15 +295,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) >> 3 })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val << 3 })
 
-#ifndef CONFIG_PPC_4K_PAGES
-void pgtable_cache_init(void);
-#else
-/*
- * No page table caches to initialise
- */
-#define pgtable_cache_init()	do { } while (0)
-#endif
-
 extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep,
 		      pmd_t **pmdp);
 
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9fd77f8..0a46a5f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -789,9 +789,6 @@ extern struct page *pgd_page(pgd_t pgd);
 #define pgd_ERROR(e) \
 	pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
 
-void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
-void pgtable_cache_init(void);
-
 static inline int map_kernel_page(unsigned long ea, unsigned long pa,
 				  unsigned long flags)
 {
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 76d6b9e..6331392 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -2,14 +2,42 @@
 #define _ASM_POWERPC_PGALLOC_32_H
 
 #include <linux/threads.h>
+#include <linux/slab.h>
 
-/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
-#define MAX_PGTABLE_INDEX_SIZE	0
+/*
+ * Functions that deal with pagetables that could be at any level of
+ * the table need to be passed an "index_size" so they know how to
+ * handle allocation.  For PTE pages (which are linked to a struct
+ * page for now, and drawn from the main get_free_pages() pool), the
+ * allocation size will be (2^index_size * sizeof(pointer)) and
+ * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
+ *
+ * The maximum index size needs to be big enough to allow any
+ * pagetable sizes we need, but small enough to fit in the low bits of
+ * any page table pointer.  In other words all pagetables, even tiny
+ * ones, must be aligned to allow at least enough low 0 bits to
+ * contain this value.  This value is also used as a mask, so it must
+ * be one less than a power of two.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
 
 extern void __bad_pte(pmd_t *pmd);
 
-extern pgd_t *pgd_alloc(struct mm_struct *mm);
-extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
+extern struct kmem_cache *pgtable_cache[];
+#define PGT_CACHE(shift) ({				\
+			BUG_ON(!(shift));		\
+			pgtable_cache[(shift) - 1];	\
+		})
+
+static inline pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
+}
+
+static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
+{
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
+}
 
 /*
  * We don't have any real pmd's, and this code never triggers because
@@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 
 static inline void pgtable_free(void *table, unsigned index_size)
 {
-	BUG_ON(index_size); /* 32-bit doesn't use this */
-	free_page((unsigned long)table);
+	if (!index_size) {
+		free_page((unsigned long)table);
+	} else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
 #define check_pgt_cache()	do { } while (0)
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index c219ef7..8cbe222 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -16,6 +16,26 @@ extern int icache_44x_need_flush;
 
 #endif /* __ASSEMBLY__ */
 
+#define PTE_INDEX_SIZE	PTE_SHIFT
+#define PMD_INDEX_SIZE	0
+#define PUD_INDEX_SIZE	0
+#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
+
+#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
+
+#ifndef __ASSEMBLY__
+#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_INDEX_SIZE)
+#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PTE_INDEX_SIZE)
+#define PUD_TABLE_SIZE	(sizeof(pud_t) << PTE_INDEX_SIZE)
+#define PGD_TABLE_SIZE	(sizeof(pgd_t) << PGD_INDEX_SIZE)
+#endif	/* __ASSEMBLY__ */
+
+#define PTRS_PER_PTE	(1 << PTE_INDEX_SIZE)
+#define PTRS_PER_PGD	(1 << PGD_INDEX_SIZE)
+
+/* With 4k base page size, hugepage PTEs go at the PMD level */
+#define MIN_HUGEPTE_SHIFT	PMD_SHIFT
+
 /*
  * The normal case is that PTEs are 32-bits and we have a 1-page
  * 1024-entry pgdir pointing to 1-page 1024-entry PTE pages.  -- paulus
@@ -27,22 +47,12 @@ extern int icache_44x_need_flush;
  * -Matt
  */
 /* PGDIR_SHIFT determines what a top-level page table entry can map */
-#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_SHIFT)
+#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_INDEX_SIZE)
 #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE-1))
 
-/*
- * entries per page directory level: our page-table tree is two-level, so
- * we don't really have any PMD directory.
- */
-#ifndef __ASSEMBLY__
-#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_SHIFT)
-#define PGD_TABLE_SIZE	(sizeof(pgd_t) << (32 - PGDIR_SHIFT))
-#endif	/* __ASSEMBLY__ */
-
-#define PTRS_PER_PTE	(1 << PTE_SHIFT)
-#define PTRS_PER_PMD	1
-#define PTRS_PER_PGD	(1 << (32 - PGDIR_SHIFT))
+/* Bits to mask out from a PGD to get to the PUD page */
+#define PGD_MASKED_BITS		0
 
 #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
 #define FIRST_USER_ADDRESS	0UL
@@ -328,15 +338,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) >> 3 })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val << 3 })
 
-#ifndef CONFIG_PPC_4K_PAGES
-void pgtable_cache_init(void);
-#else
-/*
- * No page table caches to initialise
- */
-#define pgtable_cache_init()	do { } while (0)
-#endif
-
 extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep,
 		      pmd_t **pmdp);
 
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 653a183..619018a 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -358,8 +358,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __swp_entry_to_pte(x)		__pte((x).val)
 
-void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
-void pgtable_cache_init(void);
 extern int map_kernel_page(unsigned long ea, unsigned long pa,
 			   unsigned long flags);
 extern int __meminit vmemmap_create_mapping(unsigned long start,
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 9bd87f2..dd01212 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -78,6 +78,8 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 
 unsigned long vmalloc_to_phys(void *vmalloc_addr);
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
+void pgtable_cache_init(void);
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 1a4e570..e8a86d2 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -7,7 +7,8 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 ccflags-$(CONFIG_PPC64)	:= $(NO_MINIMAL_TOC)
 
 obj-y				:= fault.o mem.o pgtable.o mmap.o \
-				   init_$(BITS).o pgtable_$(BITS).o
+				   init_$(BITS).o pgtable_$(BITS).o \
+				   init-common.o
 obj-$(CONFIG_PPC_MMU_NOHASH)	+= mmu_context_nohash.o tlb_nohash.o \
 				   tlb_nohash_low.o
 obj-$(CONFIG_PPC_BOOK3E)	+= tlb_low_$(BITS)e.o
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
new file mode 100644
index 0000000..ab2b947
--- /dev/null
+++ b/arch/powerpc/mm/init-common.c
@@ -0,0 +1,147 @@
+/*
+ *  PowerPC version
+ *    Copyright (C) 1995-1996 Gary Thomas (gdt@linuxppc.org)
+ *
+ *  Modifications by Paul Mackerras (PowerMac) (paulus@cs.anu.edu.au)
+ *  and Cort Dougan (PReP) (cort@cs.nmt.edu)
+ *    Copyright (C) 1996 Paul Mackerras
+ *
+ *  Derived from "arch/i386/mm/init.c"
+ *    Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
+ *
+ *  Dave Engebretsen <engebret@us.ibm.com>
+ *      Rework for PPC64 port.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ */
+
+#undef DEBUG
+
+#include <linux/signal.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/stddef.h>
+#include <linux/vmalloc.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/highmem.h>
+#include <linux/idr.h>
+#include <linux/nodemask.h>
+#include <linux/module.h>
+#include <linux/poison.h>
+#include <linux/memblock.h>
+#include <linux/hugetlb.h>
+#include <linux/slab.h>
+
+#include <asm/pgalloc.h>
+#include <asm/page.h>
+#include <asm/prom.h>
+#include <asm/rtas.h>
+#include <asm/io.h>
+#include <asm/mmu_context.h>
+#include <asm/pgtable.h>
+#include <asm/mmu.h>
+#include <asm/uaccess.h>
+#include <asm/smp.h>
+#include <asm/machdep.h>
+#include <asm/tlb.h>
+#include <asm/eeh.h>
+#include <asm/processor.h>
+#include <asm/mmzone.h>
+#include <asm/cputable.h>
+#include <asm/sections.h>
+#include <asm/iommu.h>
+#include <asm/vdso.h>
+
+#include "mmu_decl.h"
+
+static void pgd_ctor(void *addr)
+{
+	memset(addr, 0, PGD_TABLE_SIZE);
+}
+
+static void pud_ctor(void *addr)
+{
+	memset(addr, 0, PUD_TABLE_SIZE);
+}
+
+static void pmd_ctor(void *addr)
+{
+	memset(addr, 0, PMD_TABLE_SIZE);
+}
+
+struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
+
+/*
+ * Create a kmem_cache() for pagetables.  This is not used for PTE
+ * pages - they're linked to struct page, come from the normal free
+ * pages pool and have a different entry size (see real_pte_t) to
+ * everything else.  Caches created by this function are used for all
+ * the higher level pagetables, and for hugepage pagetables.
+ */
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
+
+	/* When batching pgtable pointers for RCU freeing, we store
+	 * the index size in the low bits.  Table alignment must be
+	 * big enough to fit it.
+	 *
+	 * Likewise, hugeapge pagetable pointers contain a (different)
+	 * shift value in the low bits.  All tables must be aligned so
+	 * as to leave enough 0 bits in the address to contain it. */
+	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
+				     HUGEPD_SHIFT_MASK + 1);
+	struct kmem_cache *new;
+
+	/* It would be nice if this was a BUILD_BUG_ON(), but at the
+	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
+	 * constant expression, so so much for that. */
+	BUG_ON(!is_power_of_2(minalign));
+	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+
+	align = max_t(unsigned long, align, minalign);
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, align, 0, ctor);
+	kfree(name);
+	pgtable_cache[shift - 1] = new;
+	pr_debug("Allocated pgtable cache for order %d\n", shift);
+}
+
+
+void pgtable_cache_init(void)
+{
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+
+	if (PMD_INDEX_SIZE && !PGT_CACHE(PMD_INDEX_SIZE))
+		pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
+	/*
+	 * In all current configs, when the PUD index exists it's the
+	 * same size as either the pgd or pmd index except with THP enabled
+	 * on book3s 64
+	 */
+	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
+		pgtable_cache_add(PUD_INDEX_SIZE, pud_ctor);
+
+	if (!PGT_CACHE(PGD_INDEX_SIZE))
+		panic("Couldn't allocate pgd cache");
+	if (PMD_INDEX_SIZE && !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pmd pgtable caches");
+	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
+		panic("Couldn't allocate pud pgtable caches");
+}
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 16ada1e..a000c35 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -80,83 +80,6 @@ EXPORT_SYMBOL_GPL(memstart_addr);
 phys_addr_t kernstart_addr;
 EXPORT_SYMBOL_GPL(kernstart_addr);
 
-static void pgd_ctor(void *addr)
-{
-	memset(addr, 0, PGD_TABLE_SIZE);
-}
-
-static void pud_ctor(void *addr)
-{
-	memset(addr, 0, PUD_TABLE_SIZE);
-}
-
-static void pmd_ctor(void *addr)
-{
-	memset(addr, 0, PMD_TABLE_SIZE);
-}
-
-struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
-
-/*
- * Create a kmem_cache() for pagetables.  This is not used for PTE
- * pages - they're linked to struct page, come from the normal free
- * pages pool and have a different entry size (see real_pte_t) to
- * everything else.  Caches created by this function are used for all
- * the higher level pagetables, and for hugepage pagetables.
- */
-void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
-{
-	char *name;
-	unsigned long table_size = sizeof(void *) << shift;
-	unsigned long align = table_size;
-
-	/* When batching pgtable pointers for RCU freeing, we store
-	 * the index size in the low bits.  Table alignment must be
-	 * big enough to fit it.
-	 *
-	 * Likewise, hugeapge pagetable pointers contain a (different)
-	 * shift value in the low bits.  All tables must be aligned so
-	 * as to leave enough 0 bits in the address to contain it. */
-	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
-				     HUGEPD_SHIFT_MASK + 1);
-	struct kmem_cache *new;
-
-	/* It would be nice if this was a BUILD_BUG_ON(), but at the
-	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
-	 * constant expression, so so much for that. */
-	BUG_ON(!is_power_of_2(minalign));
-	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
-
-	if (PGT_CACHE(shift))
-		return; /* Already have a cache of this size */
-
-	align = max_t(unsigned long, align, minalign);
-	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
-	new = kmem_cache_create(name, table_size, align, 0, ctor);
-	kfree(name);
-	pgtable_cache[shift - 1] = new;
-	pr_debug("Allocated pgtable cache for order %d\n", shift);
-}
-
-
-void pgtable_cache_init(void)
-{
-	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
-	pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
-	/*
-	 * In all current configs, when the PUD index exists it's the
-	 * same size as either the pgd or pmd index except with THP enabled
-	 * on book3s 64
-	 */
-	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
-		pgtable_cache_add(PUD_INDEX_SIZE, pud_ctor);
-
-	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_CACHE_INDEX))
-		panic("Couldn't allocate pgtable caches");
-	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
-		panic("Couldn't allocate pud pgtable caches");
-}
-
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Given an address within the vmemmap, determine the pfn of the page that
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 0ae0572..a65c0b4 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -42,43 +42,6 @@ EXPORT_SYMBOL(ioremap_bot);	/* aka VMALLOC_END */
 
 extern char etext[], _stext[], _sinittext[], _einittext[];
 
-#define PGDIR_ORDER	(32 + PGD_T_LOG2 - PGDIR_SHIFT)
-
-#ifndef CONFIG_PPC_4K_PAGES
-static struct kmem_cache *pgtable_cache;
-
-void pgtable_cache_init(void)
-{
-	pgtable_cache = kmem_cache_create("PGDIR cache", 1 << PGDIR_ORDER,
-					  1 << PGDIR_ORDER, 0, NULL);
-	if (pgtable_cache == NULL)
-		panic("Couldn't allocate pgtable caches");
-}
-#endif
-
-pgd_t *pgd_alloc(struct mm_struct *mm)
-{
-	pgd_t *ret;
-
-	/* pgdir take page or two with 4K pages and a page fraction otherwise */
-#ifndef CONFIG_PPC_4K_PAGES
-	ret = kmem_cache_alloc(pgtable_cache, GFP_KERNEL | __GFP_ZERO);
-#else
-	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO,
-			PGDIR_ORDER - PAGE_SHIFT);
-#endif
-	return ret;
-}
-
-void pgd_free(struct mm_struct *mm, pgd_t *pgd)
-{
-#ifndef CONFIG_PPC_4K_PAGES
-	kmem_cache_free(pgtable_cache, (void *)pgd);
-#else
-	free_pages((unsigned long)pgd, PGDIR_ORDER - PAGE_SHIFT);
-#endif
-}
-
 __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
 	pte_t *pte;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-16  7:40 [PATCH v2 0/3] powerpc: implementation of huge pages for 8xx Christophe Leroy
  2016-09-16  7:40 ` [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits Christophe Leroy
@ 2016-09-16  7:40 ` Christophe Leroy
  2016-09-19  5:45   ` Aneesh Kumar K.V
  2016-09-19  5:50   ` Aneesh Kumar K.V
  2016-09-16  7:40 ` [PATCH v2 3/3] powerpc/8xx: Implement support of hugepages Christophe Leroy
  2 siblings, 2 replies; 15+ messages in thread
From: Christophe Leroy @ 2016-09-16  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

Today there are two implementations of hugetlbpages which are managed
by exclusive #ifdefs:
* FSL_BOOKE: several directory entries points to the same single hugepage
* BOOK3S: one upper level directory entry points to a table of hugepages

In preparation of implementation of hugepage support on the 8xx, we
need a mix of the two above solutions, because the 8xx needs both cases
depending on the size of pages:
* In 4k page size mode, each PGD entry covers a 4M bytes area. It means
that 2 PGD entries will be necessary to cover an 8M hugepage while a
single PGD entry will cover 8x 512k hugepages.
* In 16 page size mode, each PGD entry covers a 64M bytes area. It means
that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
hugepages will be covers by one PGD entry.

This patch:
* removes #ifdefs in favor of if/else based on the range sizes
* merges the two huge_pte_alloc() functions as they are pretty similar
* merges the two hugetlbpage_init() functions as they are pretty similar

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v2: This part is new and results from a split of last patch of v1 serie in
two parts

 arch/powerpc/mm/hugetlbpage.c | 189 +++++++++++++++++-------------------------
 1 file changed, 77 insertions(+), 112 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8a512b1..2119f00 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -64,14 +64,16 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 {
 	struct kmem_cache *cachep;
 	pte_t *new;
-
-#ifdef CONFIG_PPC_FSL_BOOK3E
 	int i;
-	int num_hugepd = 1 << (pshift - pdshift);
-	cachep = hugepte_cache;
-#else
-	cachep = PGT_CACHE(pdshift - pshift);
-#endif
+	int num_hugepd;
+
+	if (pshift >= pdshift) {
+		cachep = hugepte_cache;
+		num_hugepd = 1 << (pshift - pdshift);
+	} else {
+		cachep = PGT_CACHE(pdshift - pshift);
+		num_hugepd = 1;
+	}
 
 	new = kmem_cache_zalloc(cachep, GFP_KERNEL);
 
@@ -89,7 +91,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 	smp_wmb();
 
 	spin_lock(&mm->page_table_lock);
-#ifdef CONFIG_PPC_FSL_BOOK3E
+
 	/*
 	 * We have multiple higher-level entries that point to the same
 	 * actual pte location.  Fill in each as we go and backtrack on error.
@@ -100,8 +102,13 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 		if (unlikely(!hugepd_none(*hpdp)))
 			break;
 		else
+#ifdef CONFIG_PPC_BOOK3S_64
+			hpdp->pd = __pa(new) |
+				   (shift_to_mmu_psize(pshift) << 2);
+#else
 			/* We use the old format for PPC_FSL_BOOK3E */
 			hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
+#endif
 	}
 	/* If we bailed from the for loop early, an error occurred, clean up */
 	if (i < num_hugepd) {
@@ -109,17 +116,6 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			hpdp->pd = 0;
 		kmem_cache_free(cachep, new);
 	}
-#else
-	if (!hugepd_none(*hpdp))
-		kmem_cache_free(cachep, new);
-	else {
-#ifdef CONFIG_PPC_BOOK3S_64
-		hpdp->pd = __pa(new) | (shift_to_mmu_psize(pshift) << 2);
-#else
-		hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
-#endif
-	}
-#endif
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
@@ -136,7 +132,6 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #define HUGEPD_PUD_SHIFT PMD_SHIFT
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
 /*
  * At this point we do the placement change only for BOOK3S 64. This would
  * possibly work on other subarchs.
@@ -153,6 +148,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	addr &= ~(sz-1);
 	pg = pgd_offset(mm, addr);
 
+#ifdef CONFIG_PPC_BOOK3S_64
 	if (pshift == PGDIR_SHIFT)
 		/* 16GB huge page */
 		return (pte_t *) pg;
@@ -178,32 +174,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 				hpdp = (hugepd_t *)pm;
 		}
 	}
-	if (!hpdp)
-		return NULL;
-
-	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
-		return NULL;
-
-	return hugepte_offset(*hpdp, addr, pdshift);
-}
-
 #else
-
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	unsigned pshift = __ffs(sz);
-	unsigned pdshift = PGDIR_SHIFT;
-
-	addr &= ~(sz-1);
-
-	pg = pgd_offset(mm, addr);
-
 	if (pshift >= HUGEPD_PGD_SHIFT) {
 		hpdp = (hugepd_t *)pg;
 	} else {
@@ -217,7 +188,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 			hpdp = (hugepd_t *)pm;
 		}
 	}
-
+#endif
 	if (!hpdp)
 		return NULL;
 
@@ -228,7 +199,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 
 	return hugepte_offset(*hpdp, addr, pdshift);
 }
-#endif
 
 #ifdef CONFIG_PPC_FSL_BOOK3E
 /* Build list of addresses of gigantic pages.  This function is used in early
@@ -310,7 +280,11 @@ static int __init do_gpage_early_setup(char *param, char *val,
 				npages = 0;
 			if (npages > MAX_NUMBER_GPAGES) {
 				pr_warn("MMU: %lu pages requested for page "
+#ifdef CONFIG_PHYS_ADDR_T_64BIT
 					"size %llu KB, limiting to "
+#else
+					"size %u KB, limiting to "
+#endif
 					__stringify(MAX_NUMBER_GPAGES) "\n",
 					npages, size / 1024);
 				npages = MAX_NUMBER_GPAGES;
@@ -442,6 +416,12 @@ static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
 	}
 	put_cpu_var(hugepd_freelist_cur);
 }
+#else
+static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
+{
+	BUG();
+}
+
 #endif
 
 static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
@@ -453,13 +433,11 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
 
 	unsigned long pdmask = ~((1UL << pdshift) - 1);
 	unsigned int num_hugepd = 1;
+	unsigned int shift = hugepd_shift(*hpdp);
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
 	/* Note: On fsl the hpdp may be the first of several */
-	num_hugepd = (1 << (hugepd_shift(*hpdp) - pdshift));
-#else
-	unsigned int shift = hugepd_shift(*hpdp);
-#endif
+	if (shift > pdshift)
+		num_hugepd = 1 << (shift - pdshift);
 
 	start &= pdmask;
 	if (start < floor)
@@ -475,11 +453,10 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
 	for (i = 0; i < num_hugepd; i++, hpdp++)
 		hpdp->pd = 0;
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
-	hugepd_free(tlb, hugepte);
-#else
-	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
-#endif
+	if (shift >= pdshift)
+		hugepd_free(tlb, hugepte);
+	else
+		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -492,6 +469,8 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 
 	start = addr;
 	do {
+		unsigned long more;
+
 		pmd = pmd_offset(pud, addr);
 		next = pmd_addr_end(addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
@@ -502,15 +481,16 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 			WARN_ON(!pmd_none_or_clear_bad(pmd));
 			continue;
 		}
-#ifdef CONFIG_PPC_FSL_BOOK3E
 		/*
 		 * Increment next by the size of the huge mapping since
 		 * there may be more than one entry at this level for a
 		 * single hugepage, but all of them point to
 		 * the same kmem cache that holds the hugepte.
 		 */
-		next = addr + (1 << hugepd_shift(*(hugepd_t *)pmd));
-#endif
+		more = addr + (1 << hugepd_shift(*(hugepd_t *)pmd));
+		if (more > next)
+			next = more;
+
 		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
 				  addr, next, floor, ceiling);
 	} while (addr = next, addr != end);
@@ -550,15 +530,17 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
 					       ceiling);
 		} else {
-#ifdef CONFIG_PPC_FSL_BOOK3E
+			unsigned long more;
 			/*
 			 * Increment next by the size of the huge mapping since
 			 * there may be more than one entry at this level for a
 			 * single hugepage, but all of them point to
 			 * the same kmem cache that holds the hugepte.
 			 */
-			next = addr + (1 << hugepd_shift(*(hugepd_t *)pud));
-#endif
+			more = addr + (1 << hugepd_shift(*(hugepd_t *)pud));
+			if (more > next)
+				next = more;
+
 			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
 					  addr, next, floor, ceiling);
 		}
@@ -615,15 +597,17 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
-#ifdef CONFIG_PPC_FSL_BOOK3E
+			unsigned long more;
 			/*
 			 * Increment next by the size of the huge mapping since
 			 * there may be more than one entry at the pgd level
 			 * for a single hugepage, but all of them point to the
 			 * same kmem cache that holds the hugepte.
 			 */
-			next = addr + (1 << hugepd_shift(*(hugepd_t *)pgd));
-#endif
+			more = addr + (1 << hugepd_shift(*(hugepd_t *)pgd));
+			if (more > next)
+				next = more;
+
 			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
 					  addr, next, floor, ceiling);
 		}
@@ -753,12 +737,13 @@ static int __init add_huge_page_size(unsigned long long size)
 
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable and slice limits. */
+	if (size <= PAGE_SIZE)
+		return -EINVAL;
 #ifdef CONFIG_PPC_FSL_BOOK3E
-	if ((size < PAGE_SIZE) || !is_power_of_4(size))
+	if (!is_power_of_4(size))
 		return -EINVAL;
 #else
-	if (!is_power_of_2(size)
-	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
+	if (!is_power_of_2(size) || (shift > SLICE_HIGH_SHIFT))
 		return -EINVAL;
 #endif
 
@@ -791,53 +776,15 @@ static int __init hugepage_setup_sz(char *str)
 }
 __setup("hugepagesz=", hugepage_setup_sz);
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
 struct kmem_cache *hugepte_cache;
 static int __init hugetlbpage_init(void)
 {
 	int psize;
 
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		unsigned shift;
-
-		if (!mmu_psize_defs[psize].shift)
-			continue;
-
-		shift = mmu_psize_to_shift(psize);
-
-		/* Don't treat normal page sizes as huge... */
-		if (shift != PAGE_SHIFT)
-			if (add_huge_page_size(1ULL << shift) < 0)
-				continue;
-	}
-
-	/*
-	 * Create a kmem cache for hugeptes.  The bottom bits in the pte have
-	 * size information encoded in them, so align them to allow this
-	 */
-	hugepte_cache =  kmem_cache_create("hugepte-cache", sizeof(pte_t),
-					   HUGEPD_SHIFT_MASK + 1, 0, NULL);
-	if (hugepte_cache == NULL)
-		panic("%s: Unable to create kmem cache for hugeptes\n",
-		      __func__);
-
-	/* Default hpage size = 4M */
-	if (mmu_psize_defs[MMU_PAGE_4M].shift)
-		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
-	else
-		panic("%s: Unable to set default huge page size\n", __func__);
-
-
-	return 0;
-}
-#else
-static int __init hugetlbpage_init(void)
-{
-	int psize;
-
+#if !defined(CONFIG_PPC_FSL_BOOK3E)
 	if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
 		return -ENODEV;
-
+#endif
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		unsigned shift;
 		unsigned pdshift;
@@ -860,16 +807,31 @@ static int __init hugetlbpage_init(void)
 		 * if we have pdshift and shift value same, we don't
 		 * use pgt cache for hugepd.
 		 */
-		if (pdshift != shift) {
+		if (pdshift > shift) {
 			pgtable_cache_add(pdshift - shift, NULL);
 			if (!PGT_CACHE(pdshift - shift))
 				panic("hugetlbpage_init(): could not create "
 				      "pgtable cache for %d bit pagesize\n", shift);
+		} else if (!hugepte_cache) {
+			/*
+			 * Create a kmem cache for hugeptes.  The bottom bits in
+			 * the pte have size information encoded in them, so
+			 * align them to allow this
+			 */
+			hugepte_cache = kmem_cache_create("hugepte-cache",
+							  sizeof(pte_t),
+							  HUGEPD_SHIFT_MASK + 1,
+							  0, NULL);
+			if (hugepte_cache == NULL)
+				panic("%s: Unable to create kmem cache "
+				      "for hugeptes\n", __func__);
+
 		}
 	}
 
 	/* Set default large page size. Currently, we pick 16M or 1M
 	 * depending on what is available
+	 * We select 4M on other ones.
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
@@ -877,11 +839,14 @@ static int __init hugetlbpage_init(void)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
 	else if (mmu_psize_defs[MMU_PAGE_2M].shift)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift;
-
+	else if (mmu_psize_defs[MMU_PAGE_4M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
+	else
+		panic("%s: Unable to set default huge page size\n", __func__);
 
 	return 0;
 }
-#endif
+
 arch_initcall(hugetlbpage_init);
 
 void flush_dcache_icache_hugepage(struct page *page)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/3] powerpc/8xx: Implement support of hugepages
  2016-09-16  7:40 [PATCH v2 0/3] powerpc: implementation of huge pages for 8xx Christophe Leroy
  2016-09-16  7:40 ` [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits Christophe Leroy
  2016-09-16  7:40 ` [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic Christophe Leroy
@ 2016-09-16  7:40 ` Christophe Leroy
  2 siblings, 0 replies; 15+ messages in thread
From: Christophe Leroy @ 2016-09-16  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

8xx uses a two level page table with two different linux page size
support (4k and 16k). 8xx also support two different hugepage sizes
512k and 8M. In order to support them on linux we define two different
page table layout.

The size of pages is in the PGD entry, using PS field (bits 28-29):
00 : Small pages (4k or 16k)
01 : 512k pages
10 : reserved
11 : 8M pages

For 512K hugepage size a pgd entry have the below format
[<hugepte address >0101] . The hugepte table allocated will contain 8
entries pointing to 512K huge pte in 4k pages mode and 64 entries in
16k pages mode.

For 8M in 16k mode, a pgd entry have the below format
[<hugepte address >1101] . The hugepte table allocated will contain 8
entries pointing to 8M huge pte.

For 8M in 4k mode, multiple pgd entries point to the same hugepte
address and pgd entry will have the below format
[<hugepte address>1101]. The hugepte table allocated will only have one
entry.

For the time being, we do not support CPU15 ERRATA when HUGETLB is
selected

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v2: This v1 was split in two parts. This part focuses on adding the
support on 8xx. It also fixes an error in TLBmiss handlers in the
case of 8M hugepages in 16k pages mode.

 arch/powerpc/include/asm/hugetlb.h           |  19 ++++-
 arch/powerpc/include/asm/mmu-8xx.h           |  35 ++++++++
 arch/powerpc/include/asm/mmu.h               |  23 +++---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |   1 +
 arch/powerpc/include/asm/nohash/pgtable.h    |   4 +
 arch/powerpc/include/asm/reg_8xx.h           |   2 +-
 arch/powerpc/kernel/head_8xx.S               | 119 +++++++++++++++++++++++++--
 arch/powerpc/mm/hugetlbpage.c                |  25 ++++--
 arch/powerpc/mm/tlb_nohash.c                 |  21 ++++-
 arch/powerpc/platforms/8xx/Kconfig           |   1 +
 arch/powerpc/platforms/Kconfig.cputype       |   1 +
 11 files changed, 223 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index c5517f4..3facdd4 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -51,12 +51,20 @@ static inline void __local_flush_hugetlb_page(struct vm_area_struct *vma,
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
 	BUG_ON(!hugepd_ok(hpd));
+#ifdef CONFIG_PPC_8xx
+	return (pte_t *)__va(hpd.pd & ~(_PMD_PAGE_MASK | _PMD_PRESENT_MASK));
+#else
 	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | PD_HUGE);
+#endif
 }
 
 static inline unsigned int hugepd_shift(hugepd_t hpd)
 {
+#ifdef CONFIG_PPC_8xx
+	return ((hpd.pd & _PMD_PAGE_MASK) >> 1) + 17;
+#else
 	return hpd.pd & HUGEPD_SHIFT_MASK;
+#endif
 }
 
 #endif /* CONFIG_PPC_BOOK3S_64 */
@@ -99,7 +107,15 @@ static inline int is_hugepage_only_range(struct mm_struct *mm,
 
 void book3e_hugetlb_preload(struct vm_area_struct *vma, unsigned long ea,
 			    pte_t pte);
+#ifdef CONFIG_PPC_8xx
+static inline void flush_hugetlb_page(struct vm_area_struct *vma,
+				      unsigned long vmaddr)
+{
+	flush_tlb_page(vma, vmaddr);
+}
+#else
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
+#endif
 
 void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 			    unsigned long end, unsigned long floor,
@@ -205,7 +221,8 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
  * are reserved early in the boot process by memblock instead of via
  * the .dts as on IBM platforms.
  */
-#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_PPC_FSL_BOOK3E)
+#if defined(CONFIG_HUGETLB_PAGE) && (defined(CONFIG_PPC_FSL_BOOK3E) || \
+    defined(CONFIG_PPC_8xx))
 extern void __init reserve_hugetlb_gpages(void);
 #else
 static inline void reserve_hugetlb_gpages(void)
diff --git a/arch/powerpc/include/asm/mmu-8xx.h b/arch/powerpc/include/asm/mmu-8xx.h
index 3e0e492..798b5bf 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -172,6 +172,41 @@ typedef struct {
 
 #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff80000)
 #define VIRT_IMMR_BASE (__fix_to_virt(FIX_IMMR_BASE))
+
+/* Page size definitions, common between 32 and 64-bit
+ *
+ *    shift : is the "PAGE_SHIFT" value for that page size
+ *    penc  : is the pte encoding mask
+ *
+ */
+struct mmu_psize_def {
+	unsigned int	shift;	/* number of bits */
+	unsigned int	enc;	/* PTE encoding */
+	unsigned int    ind;    /* Corresponding indirect page size shift */
+	unsigned int	flags;
+#define MMU_PAGE_SIZE_DIRECT	0x1	/* Supported as a direct size */
+#define MMU_PAGE_SIZE_INDIRECT	0x2	/* Supported as an indirect size */
+};
+
+extern struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
+
+static inline int shift_to_mmu_psize(unsigned int shift)
+{
+	int psize;
+
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize)
+		if (mmu_psize_defs[psize].shift == shift)
+			return psize;
+	return -1;
+}
+
+static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize)
+{
+	if (mmu_psize_defs[mmu_psize].shift)
+		return mmu_psize_defs[mmu_psize].shift;
+	BUG();
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #if defined(CONFIG_PPC_4K_PAGES)
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index b78e8d3..c81aafc 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -260,19 +260,20 @@ static inline bool early_radix_enabled(void)
 #define MMU_PAGE_64K	2
 #define MMU_PAGE_64K_AP	3	/* "Admixed pages" (hash64 only) */
 #define MMU_PAGE_256K	4
-#define MMU_PAGE_1M	5
-#define MMU_PAGE_2M	6
-#define MMU_PAGE_4M	7
-#define MMU_PAGE_8M	8
-#define MMU_PAGE_16M	9
-#define MMU_PAGE_64M	10
-#define MMU_PAGE_256M	11
-#define MMU_PAGE_1G	12
-#define MMU_PAGE_16G	13
-#define MMU_PAGE_64G	14
+#define MMU_PAGE_512K	5
+#define MMU_PAGE_1M	6
+#define MMU_PAGE_2M	7
+#define MMU_PAGE_4M	8
+#define MMU_PAGE_8M	9
+#define MMU_PAGE_16M	10
+#define MMU_PAGE_64M	11
+#define MMU_PAGE_256M	12
+#define MMU_PAGE_1G	13
+#define MMU_PAGE_16G	14
+#define MMU_PAGE_64G	15
 
 /* N.B. we need to change the type of hpte_page_sizes if this gets to be > 16 */
-#define MMU_PAGE_COUNT	15
+#define MMU_PAGE_COUNT	16
 
 #ifdef CONFIG_PPC_BOOK3S_64
 #include <asm/book3s/64/mmu.h>
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 3742b19..b4df273 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -49,6 +49,7 @@
 #define _PMD_BAD	0x0ff0
 #define _PMD_PAGE_MASK	0x000c
 #define _PMD_PAGE_8M	0x000c
+#define _PMD_PAGE_512K	0x0004
 
 /* Until my rework is finished, 8xx still needs atomic PTE updates */
 #define PTE_ATOMIC_UPDATES	1
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index 1263c22..1728497 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -226,7 +226,11 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 #ifdef CONFIG_HUGETLB_PAGE
 static inline int hugepd_ok(hugepd_t hpd)
 {
+#ifdef CONFIG_PPC_8xx
+	return ((hpd.pd & 0x4) != 0);
+#else
 	return (hpd.pd > 0);
+#endif
 }
 
 static inline int pmd_huge(pmd_t pmd)
diff --git a/arch/powerpc/include/asm/reg_8xx.h b/arch/powerpc/include/asm/reg_8xx.h
index 94d01f8..feaf641 100644
--- a/arch/powerpc/include/asm/reg_8xx.h
+++ b/arch/powerpc/include/asm/reg_8xx.h
@@ -4,7 +4,7 @@
 #ifndef _ASM_POWERPC_REG_8xx_H
 #define _ASM_POWERPC_REG_8xx_H
 
-#include <asm/mmu-8xx.h>
+#include <asm/mmu.h>
 
 /* Cache control on the MPC8xx is provided through some additional
  * special purpose registers.
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index bfe4907..82d43dd 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -73,6 +73,9 @@
 #define RPN_PATTERN	0x00f0
 #endif
 
+#define PAGE_SHIFT_512K		19
+#define PAGE_SHIFT_8M		23
+
 	__HEAD
 _ENTRY(_stext);
 _ENTRY(_start);
@@ -323,7 +326,7 @@ SystemCall:
 #endif
 
 InstructionTLBMiss:
-#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
+#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE)
 	mtspr	SPRN_SPRG_SCRATCH2, r3
 #endif
 	EXCEPTION_PROLOG_0
@@ -333,10 +336,12 @@ InstructionTLBMiss:
 	 */
 	mfspr	r10, SPRN_SRR0	/* Get effective address of fault */
 	INVALIDATE_ADJACENT_PAGES_CPU15(r11, r10)
-#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
 	/* Only modules will cause ITLB Misses as we always
 	 * pin the first 8MB of kernel memory */
+#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE)
 	mfcr	r3
+#endif
+#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
 	IS_KERNEL(r11, r10)
 #endif
 	mfspr	r11, SPRN_M_TW	/* Get level 1 table */
@@ -344,7 +349,6 @@ InstructionTLBMiss:
 	BRANCH_UNLESS_KERNEL(3f)
 	lis	r11, (swapper_pg_dir-PAGE_OFFSET)@ha
 3:
-	mtcr	r3
 #endif
 	/* Insert level 1 index */
 	rlwimi	r11, r10, 32 - ((PAGE_SHIFT - 2) << 1), (PAGE_SHIFT - 2) << 1, 29
@@ -352,14 +356,25 @@ InstructionTLBMiss:
 
 	/* Extract level 2 index */
 	rlwinm	r10, r10, 32 - (PAGE_SHIFT - 2), 32 - PAGE_SHIFT, 29
+#ifdef CONFIG_HUGETLB_PAGE
+	mtcr	r11
+	bt-	28, 10f		/* bit 28 = Large page (8M) */
+	bt-	29, 20f		/* bit 29 = Large page (8M or 512k) */
+#endif
 	rlwimi	r10, r11, 0, 0, 32 - PAGE_SHIFT - 1	/* Add level 2 base */
 	lwz	r10, 0(r10)	/* Get the pte */
-
+4:
+#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE)
+	mtcr	r3
+#endif
 	/* Insert the APG into the TWC from the Linux PTE. */
 	rlwimi	r11, r10, 0, 25, 26
 	/* Load the MI_TWC with the attributes for this "segment." */
 	MTSPR_CPU6(SPRN_MI_TWC, r11, r3)	/* Set segment attributes */
 
+#if defined (CONFIG_HUGETLB_PAGE) && defined (CONFIG_PPC_4K_PAGES)
+	rlwimi	r10, r11, 1, MI_SPS16K
+#endif
 #ifdef CONFIG_SWAP
 	rlwinm	r11, r10, 32-5, _PAGE_PRESENT
 	and	r11, r11, r10
@@ -372,16 +387,45 @@ InstructionTLBMiss:
 	 * set.  All other Linux PTE bits control the behavior
 	 * of the MMU.
 	 */
+#if defined (CONFIG_HUGETLB_PAGE) && defined (CONFIG_PPC_4K_PAGES)
+	rlwimi	r10, r11, 0, 0x0ff0	/* Set 24-27, clear 20-23 */
+#else
 	rlwimi	r10, r11, 0, 0x0ff8	/* Set 24-27, clear 20-23,28 */
+#endif
 	MTSPR_CPU6(SPRN_MI_RPN, r10, r3)	/* Update TLB entry */
 
 	/* Restore registers */
-#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
+#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE)
 	mfspr	r3, SPRN_SPRG_SCRATCH2
 #endif
 	EXCEPTION_EPILOG_0
 	rfi
 
+#ifdef CONFIG_HUGETLB_PAGE
+10:	/* 8M pages */
+#ifdef CONFIG_PPC_16K_PAGES
+	/* Extract level 2 index */
+	rlwinm	r10, r10, 32 - (PAGE_SHIFT_8M - PAGE_SHIFT), 32 + PAGE_SHIFT_8M - (PAGE_SHIFT << 1), 29
+	/* Add level 2 base */
+	rlwimi	r10, r11, 0, 0, 32 + PAGE_SHIFT_8M - (PAGE_SHIFT << 1) - 1
+#else
+	/* Level 2 base */
+	rlwinm	r10, r11, 0, ~HUGEPD_SHIFT_MASK
+#endif
+	lwz	r10, 0(r10)	/* Get the pte */
+	rlwinm	r11, r11, 0, 0xf
+	b	4b
+
+20:	/* 512k pages */
+	/* Extract level 2 index */
+	rlwinm	r10, r10, 32 - (PAGE_SHIFT_512K - PAGE_SHIFT), 32 + PAGE_SHIFT_512K - (PAGE_SHIFT << 1), 29
+	/* Add level 2 base */
+	rlwimi	r10, r11, 0, 0, 32 + PAGE_SHIFT_512K - (PAGE_SHIFT << 1) - 1
+	lwz	r10, 0(r10)	/* Get the pte */
+	rlwinm	r11, r11, 0, 0xf
+	b	4b
+#endif
+
 	. = 0x1200
 DataStoreTLBMiss:
 	mtspr	SPRN_SPRG_SCRATCH2, r3
@@ -408,7 +452,6 @@ _ENTRY(DTLBMiss_jmp)
 #endif
 	blt	cr7, DTLBMissLinear
 3:
-	mtcr	r3
 	mfspr	r10, SPRN_MD_EPN
 
 	/* Insert level 1 index */
@@ -419,8 +462,15 @@ _ENTRY(DTLBMiss_jmp)
 	 */
 	/* Extract level 2 index */
 	rlwinm	r10, r10, 32 - (PAGE_SHIFT - 2), 32 - PAGE_SHIFT, 29
+#ifdef CONFIG_HUGETLB_PAGE
+	mtcr	r11
+	bt-	28, 10f		/* bit 28 = Large page (8M) */
+	bt-	29, 20f		/* bit 29 = Large page (8M or 512k) */
+#endif
 	rlwimi	r10, r11, 0, 0, 32 - PAGE_SHIFT - 1	/* Add level 2 base */
 	lwz	r10, 0(r10)	/* Get the pte */
+4:
+	mtcr	r3
 
 	/* Insert the Guarded flag and APG into the TWC from the Linux PTE.
 	 * It is bit 26-27 of both the Linux PTE and the TWC (at least
@@ -435,6 +485,11 @@ _ENTRY(DTLBMiss_jmp)
 	rlwimi	r11, r10, 32-5, 30, 30
 	MTSPR_CPU6(SPRN_MD_TWC, r11, r3)
 
+	/* In 4k pages mode, SPS (bit 28) in RPN must match PS[1] (bit 29)
+	 * In 16k pages mode, SPS is always 1 */
+#if defined (CONFIG_HUGETLB_PAGE) && defined (CONFIG_PPC_4K_PAGES)
+	rlwimi	r10, r11, 1, MD_SPS16K
+#endif
 	/* Both _PAGE_ACCESSED and _PAGE_PRESENT has to be set.
 	 * We also need to know if the insn is a load/store, so:
 	 * Clear _PAGE_PRESENT and load that which will
@@ -456,7 +511,11 @@ _ENTRY(DTLBMiss_jmp)
 	 * of the MMU.
 	 */
 	li	r11, RPN_PATTERN
+#if defined (CONFIG_HUGETLB_PAGE) && defined (CONFIG_PPC_4K_PAGES)
+	rlwimi	r10, r11, 0, 24, 27	/* Set 24-27 */
+#else
 	rlwimi	r10, r11, 0, 24, 28	/* Set 24-27, clear 28 */
+#endif
 	rlwimi	r10, r11, 0, 20, 20	/* clear 20 */
 	MTSPR_CPU6(SPRN_MD_RPN, r10, r3)	/* Update TLB entry */
 
@@ -466,6 +525,30 @@ _ENTRY(DTLBMiss_jmp)
 	EXCEPTION_EPILOG_0
 	rfi
 
+#ifdef CONFIG_HUGETLB_PAGE
+10:	/* 8M pages */
+	/* Extract level 2 index */
+#ifdef CONFIG_PPC_16K_PAGES
+	rlwinm	r10, r10, 32 - (PAGE_SHIFT_8M - PAGE_SHIFT), 32 + PAGE_SHIFT_8M - (PAGE_SHIFT << 1), 29
+	/* Add level 2 base */
+	rlwimi	r10, r11, 0, 0, 32 + PAGE_SHIFT_8M - (PAGE_SHIFT << 1) - 1
+#else
+	/* Level 2 base */
+	rlwinm	r10, r11, 0, ~HUGEPD_SHIFT_MASK
+#endif
+	lwz	r10, 0(r10)	/* Get the pte */
+	rlwinm	r11, r11, 0, 0xf
+	b	4b
+
+20:	/* 512k pages */
+	/* Extract level 2 index */
+	rlwinm	r10, r10, 32 - (PAGE_SHIFT_512K - PAGE_SHIFT), 32 + PAGE_SHIFT_512K - (PAGE_SHIFT << 1), 29
+	/* Add level 2 base */
+	rlwimi	r10, r11, 0, 0, 32 + PAGE_SHIFT_512K - (PAGE_SHIFT << 1) - 1
+	lwz	r10, 0(r10)	/* Get the pte */
+	rlwinm	r11, r11, 0, 0xf
+	b	4b
+#endif
 
 /* This is an instruction TLB error on the MPC8xx.  This could be due
  * to many reasons, such as executing guarded memory or illegal instruction
@@ -587,6 +670,9 @@ _ENTRY(FixupDAR_cmp)
 	/* Insert level 1 index */
 3:	rlwimi	r11, r10, 32 - ((PAGE_SHIFT - 2) << 1), (PAGE_SHIFT - 2) << 1, 29
 	lwz	r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11)	/* Get the level 1 entry */
+	mtcr	r11
+	bt	28,200f		/* bit 28 = Large page (8M) */
+	bt	29,202f		/* bit 29 = Large page (8M or 512K) */
 	rlwinm	r11, r11,0,0,19	/* Extract page descriptor page address */
 	/* Insert level 2 index */
 	rlwimi	r11, r10, 32 - (PAGE_SHIFT - 2), 32 - PAGE_SHIFT, 29
@@ -612,6 +698,27 @@ _ENTRY(FixupDAR_cmp)
 141:	mfspr	r10,SPRN_SPRG_SCRATCH2
 	b	DARFixed	/* Nope, go back to normal TLB processing */
 
+	/* concat physical page address(r11) and page offset(r10) */
+200:
+#ifdef CONFIG_PPC_16K_PAGES
+	rlwinm	r11, r11, 0, 0, 32 + PAGE_SHIFT_8M - (PAGE_SHIFT << 1) - 1
+	rlwimi	r11, r10, 32 - (PAGE_SHIFT_8M - 2), 32 + PAGE_SHIFT_8M - (PAGE_SHIFT << 1), 29
+#else
+	rlwinm	r11, r10, 0, ~HUGEPD_SHIFT_MASK
+#endif
+	lwz	r11, 0(r11)	/* Get the pte */
+	/* concat physical page address(r11) and page offset(r10) */
+	rlwimi	r11, r10, 0, 32 - PAGE_SHIFT_8M, 31
+	b	201b
+
+202:
+	rlwinm	r11, r11, 0, 0, 32 + PAGE_SHIFT_512K - (PAGE_SHIFT << 1) - 1
+	rlwimi	r11, r10, 32 - (PAGE_SHIFT_512K - 2), 32 + PAGE_SHIFT_512K - (PAGE_SHIFT << 1), 29
+	lwz	r11, 0(r11)	/* Get the pte */
+	/* concat physical page address(r11) and page offset(r10) */
+	rlwimi	r11, r10, 0, 32 - PAGE_SHIFT_512K, 31
+	b	201b
+
 144:	mfspr	r10, SPRN_DSISR
 	rlwinm	r10, r10,0,7,5	/* Clear store bit for buggy dcbst insn */
 	mtspr	SPRN_DSISR, r10
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 2119f00..8001821 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -26,6 +26,8 @@
 #ifdef CONFIG_HUGETLB_PAGE
 
 #define PAGE_SHIFT_64K	16
+#define PAGE_SHIFT_512K	19
+#define PAGE_SHIFT_8M	23
 #define PAGE_SHIFT_16M	24
 #define PAGE_SHIFT_16G	34
 
@@ -38,7 +40,7 @@ unsigned int HPAGE_SHIFT;
  * implementations may have more than one gpage size, so we need multiple
  * arrays
  */
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
 #define MAX_NUMBER_GPAGES	128
 struct psize_gpages {
 	u64 gpage_list[MAX_NUMBER_GPAGES];
@@ -105,6 +107,11 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #ifdef CONFIG_PPC_BOOK3S_64
 			hpdp->pd = __pa(new) |
 				   (shift_to_mmu_psize(pshift) << 2);
+#elif defined(CONFIG_PPC_8xx)
+			hpdp->pd = __pa(new) |
+				   (pshift == PAGE_SHIFT_8M ? _PMD_PAGE_8M :
+							      _PMD_PAGE_512K) |
+				   _PMD_PRESENT;
 #else
 			/* We use the old format for PPC_FSL_BOOK3E */
 			hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
@@ -124,7 +131,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
  * These macros define how to determine which level of the page table holds
  * the hpdp.
  */
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
 #define HUGEPD_PGD_SHIFT PGDIR_SHIFT
 #define HUGEPD_PUD_SHIFT PUD_SHIFT
 #else
@@ -200,7 +207,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	return hugepte_offset(*hpdp, addr, pdshift);
 }
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy allocator is setup.
  */
@@ -366,7 +373,7 @@ int alloc_bootmem_huge_page(struct hstate *hstate)
 }
 #endif
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
 #define HUGEPD_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
 
@@ -739,10 +746,10 @@ static int __init add_huge_page_size(unsigned long long size)
 	 * that it fits within pagetable and slice limits. */
 	if (size <= PAGE_SIZE)
 		return -EINVAL;
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E)
 	if (!is_power_of_4(size))
 		return -EINVAL;
-#else
+#elif !defined(CONFIG_PPC_8xx)
 	if (!is_power_of_2(size) || (shift > SLICE_HIGH_SHIFT))
 		return -EINVAL;
 #endif
@@ -781,7 +788,7 @@ static int __init hugetlbpage_init(void)
 {
 	int psize;
 
-#if !defined(CONFIG_PPC_FSL_BOOK3E)
+#if !defined(CONFIG_PPC_FSL_BOOK3E) && !defined(CONFIG_PPC_8xx)
 	if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
 		return -ENODEV;
 #endif
@@ -830,7 +837,7 @@ static int __init hugetlbpage_init(void)
 	}
 
 	/* Set default large page size. Currently, we pick 16M or 1M
-	 * depending on what is available
+	 * depending on what is available. On PPC_8xx we select 512K.
 	 * We select 4M on other ones.
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
@@ -841,6 +848,8 @@ static int __init hugetlbpage_init(void)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift;
 	else if (mmu_psize_defs[MMU_PAGE_4M].shift)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
+	else if (mmu_psize_defs[MMU_PAGE_512K].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_512K].shift;
 	else
 		panic("%s: Unable to set default huge page size\n", __func__);
 
diff --git a/arch/powerpc/mm/tlb_nohash.c b/arch/powerpc/mm/tlb_nohash.c
index 050badc..ba28fcb 100644
--- a/arch/powerpc/mm/tlb_nohash.c
+++ b/arch/powerpc/mm/tlb_nohash.c
@@ -53,7 +53,7 @@
  * other sizes not listed here.   The .ind field is only used on MMUs that have
  * indirect page table entries.
  */
-#ifdef CONFIG_PPC_BOOK3E_MMU
+#if defined(CONFIG_PPC_BOOK3E_MMU) || defined(CONFIG_PPC_8xx)
 #ifdef CONFIG_PPC_FSL_BOOK3E
 struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 	[MMU_PAGE_4K] = {
@@ -85,6 +85,25 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 		.enc	= BOOK3E_PAGESZ_1GB,
 	},
 };
+#elif defined(CONFIG_PPC_8xx)
+struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
+	/* we only manage 4k and 16k pages as normal pages */
+#ifdef CONFIG_PPC_4K_PAGES
+	[MMU_PAGE_4K] = {
+		.shift	= 12,
+	},
+#else
+	[MMU_PAGE_16K] = {
+		.shift	= 14,
+	},
+#endif
+	[MMU_PAGE_512K] = {
+		.shift	= 19,
+	},
+	[MMU_PAGE_8M] = {
+		.shift	= 23,
+	},
+};
 #else
 struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 	[MMU_PAGE_4K] = {
diff --git a/arch/powerpc/platforms/8xx/Kconfig b/arch/powerpc/platforms/8xx/Kconfig
index 564d99b..80cbcb0 100644
--- a/arch/powerpc/platforms/8xx/Kconfig
+++ b/arch/powerpc/platforms/8xx/Kconfig
@@ -130,6 +130,7 @@ config 8xx_CPU6
 
 config 8xx_CPU15
 	bool "CPU15 Silicon Errata"
+	depends on !HUGETLB_PAGE
 	default y
 	help
 	  This enables a workaround for erratum CPU15 on MPC8xx chips.
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index f32edec..59887ad 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -34,6 +34,7 @@ config PPC_8xx
 	select FSL_SOC
 	select 8xx
 	select PPC_LIB_RHEAP
+	select SYS_SUPPORTS_HUGETLBFS
 
 config 40x
 	bool "AMCC 40x"
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits
  2016-09-16  7:40 ` [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits Christophe Leroy
@ 2016-09-19  5:22   ` Aneesh Kumar K.V
  2016-09-19 18:46     ` christophe leroy
  0 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2016-09-19  5:22 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

Christophe Leroy <christophe.leroy@c-s.fr> writes:

> Today powerpc64 uses a set of pgtable_caches while powerpc32 uses
> standard pages when using 4k pages and a single pgtable_cache
> if using other size pages.
>
> In preparation of implementing huge pages on the 8xx, this patch
> replaces the specific powerpc32 handling by the 64 bits approach.
>
> This is done by:
> * moving 64 bits pgtable_cache_add() and pgtable_cache_init()
> in a new file called init-common.c
> * modifying pgtable_cache_init() to also handle the case
> without PMD
> * removing the 32 bits version of pgtable_cache_add() and
> pgtable_cache_init()
> * copying related header contents from 64 bits into both the
> book3s/32 and nohash/32 header files
>
> On the 8xx, the following cache sizes will be used:
> * 4k pages mode:
> - PGT_CACHE(10) for PGD
> - PGT_CACHE(3) for 512k hugepage tables
> * 16k pages mode:
> - PGT_CACHE(6) for PGD
> - PGT_CACHE(7) for 512k hugepage tables
> - PGT_CACHE(3) for 8M hugepage tables
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> v2: in v1, hugepte_cache was wrongly replaced by PGT_CACHE(1).
> This modification has been removed from v2.
>
>  arch/powerpc/include/asm/book3s/32/pgalloc.h |  44 ++++++--
>  arch/powerpc/include/asm/book3s/32/pgtable.h |  43 ++++----
>  arch/powerpc/include/asm/book3s/64/pgtable.h |   3 -
>  arch/powerpc/include/asm/nohash/32/pgalloc.h |  44 ++++++--
>  arch/powerpc/include/asm/nohash/32/pgtable.h |  45 ++++----
>  arch/powerpc/include/asm/nohash/64/pgtable.h |   2 -
>  arch/powerpc/include/asm/pgtable.h           |   2 +
>  arch/powerpc/mm/Makefile                     |   3 +-
>  arch/powerpc/mm/init-common.c                | 147 +++++++++++++++++++++++++++
>  arch/powerpc/mm/init_64.c                    |  77 --------------
>  arch/powerpc/mm/pgtable_32.c                 |  37 -------
>  11 files changed, 273 insertions(+), 174 deletions(-)
>  create mode 100644 arch/powerpc/mm/init-common.c
>
> diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
> index 8e21bb4..d310546 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
> @@ -2,14 +2,42 @@
>  #define _ASM_POWERPC_BOOK3S_32_PGALLOC_H
>  
>  #include <linux/threads.h>
> +#include <linux/slab.h>
>  
> -/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
> -#define MAX_PGTABLE_INDEX_SIZE	0
> +/*
> + * Functions that deal with pagetables that could be at any level of
> + * the table need to be passed an "index_size" so they know how to
> + * handle allocation.  For PTE pages (which are linked to a struct
> + * page for now, and drawn from the main get_free_pages() pool), the
> + * allocation size will be (2^index_size * sizeof(pointer)) and
> + * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
> + *
> + * The maximum index size needs to be big enough to allow any
> + * pagetable sizes we need, but small enough to fit in the low bits of
> + * any page table pointer.  In other words all pagetables, even tiny
> + * ones, must be aligned to allow at least enough low 0 bits to
> + * contain this value.  This value is also used as a mask, so it must
> + * be one less than a power of two.
> + */
> +#define MAX_PGTABLE_INDEX_SIZE	0xf
>  
>  extern void __bad_pte(pmd_t *pmd);
>  
> -extern pgd_t *pgd_alloc(struct mm_struct *mm);
> -extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
> +extern struct kmem_cache *pgtable_cache[];
> +#define PGT_CACHE(shift) ({				\
> +			BUG_ON(!(shift));		\
> +			pgtable_cache[(shift) - 1];	\
> +		})
> +
> +static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> +{
> +	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
> +}
> +
> +static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
> +{
> +	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
> +}
>  
>  /*
>   * We don't have any real pmd's, and this code never triggers because
> @@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>  
>  static inline void pgtable_free(void *table, unsigned index_size)
>  {
> -	BUG_ON(index_size); /* 32-bit doesn't use this */
> -	free_page((unsigned long)table);
> +	if (!index_size) {
> +		free_page((unsigned long)table);
> +	} else {
> +		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
> +		kmem_cache_free(PGT_CACHE(index_size), table);
> +	}
>  }
>  
>  #define check_pgt_cache()	do { } while (0)
> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
> index 6b8b2d5..f887499 100644
> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
> @@ -8,6 +8,26 @@
>  /* And here we include common definitions */
>  #include <asm/pte-common.h>
>  
> +#define PTE_INDEX_SIZE	PTE_SHIFT
> +#define PMD_INDEX_SIZE	0
> +#define PUD_INDEX_SIZE	0
> +#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
> +
> +#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
> +
> +#ifndef __ASSEMBLY__
> +#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_INDEX_SIZE)
> +#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PTE_INDEX_SIZE)
> +#define PUD_TABLE_SIZE	(sizeof(pud_t) << PTE_INDEX_SIZE)
> +#define PGD_TABLE_SIZE	(sizeof(pgd_t) << PGD_INDEX_SIZE)
> +#endif	/* __ASSEMBLY__ */

Are these table size correct ? IIUC, We will have only PGD and PTE
tables right ?


> +
> +#define PTRS_PER_PTE	(1 << PTE_INDEX_SIZE)
> +#define PTRS_PER_PGD	(1 << PGD_INDEX_SIZE)
> +
> +/* With 4k base page size, hugepage PTEs go at the PMD level */
> +#define MIN_HUGEPTE_SHIFT	PMD_SHIFT
> +

What does that comment mean ? I guess that came from copy-paste from
other headers. I am not sure what it means there either other than the
64k hash config, where we place hugepage PTE at the PMD level. (ie, no hugepd).


>  /*
>   * The normal case is that PTEs are 32-bits and we have a 1-page
>   * 1024-entry pgdir pointing to 1-page 1024-entry PTE pages.  -- paulus
> @@ -19,14 +39,10 @@
>   * -Matt
>   */
>  /* PGDIR_SHIFT determines what a top-level page table entry can map */
> -#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_SHIFT)
> +#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_INDEX_SIZE)
>  #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>  
> -#define PTRS_PER_PTE	(1 << PTE_SHIFT)
> -#define PTRS_PER_PMD	1
> -#define PTRS_PER_PGD	(1 << (32 - PGDIR_SHIFT))
> -
>  #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
>  /*
>   * This is the bottom of the PKMAP area with HIGHMEM or an arbitrary
> @@ -82,12 +98,8 @@
>  
>  extern unsigned long ioremap_bot;
>  
> -/*
> - * entries per page directory level: our page-table tree is two-level, so
> - * we don't really have any PMD directory.
> - */
> -#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_SHIFT)
> -#define PGD_TABLE_SIZE	(sizeof(pgd_t) << (32 - PGDIR_SHIFT))
> +/* Bits to mask out from a PGD to get to the PUD page */
> +#define PGD_MASKED_BITS		0
>  
>  #define pte_ERROR(e) \
>  	pr_err("%s:%d: bad pte %llx.\n", __FILE__, __LINE__, \
> @@ -283,15 +295,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) >> 3 })
>  #define __swp_entry_to_pte(x)		((pte_t) { (x).val << 3 })
>  
> -#ifndef CONFIG_PPC_4K_PAGES
> -void pgtable_cache_init(void);
> -#else
> -/*
> - * No page table caches to initialise
> - */
> -#define pgtable_cache_init()	do { } while (0)
> -#endif
> -
>  extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep,
>  		      pmd_t **pmdp);
>  
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 9fd77f8..0a46a5f 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -789,9 +789,6 @@ extern struct page *pgd_page(pgd_t pgd);
>  #define pgd_ERROR(e) \
>  	pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
>  
> -void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
> -void pgtable_cache_init(void);
> -
>  static inline int map_kernel_page(unsigned long ea, unsigned long pa,
>  				  unsigned long flags)
>  {
> diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
> index 76d6b9e..6331392 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
> @@ -2,14 +2,42 @@
>  #define _ASM_POWERPC_PGALLOC_32_H
>  
>  #include <linux/threads.h>
> +#include <linux/slab.h>
>  
> -/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
> -#define MAX_PGTABLE_INDEX_SIZE	0
> +/*
> + * Functions that deal with pagetables that could be at any level of
> + * the table need to be passed an "index_size" so they know how to
> + * handle allocation.  For PTE pages (which are linked to a struct
> + * page for now, and drawn from the main get_free_pages() pool), the
> + * allocation size will be (2^index_size * sizeof(pointer)) and
> + * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
> + *
> + * The maximum index size needs to be big enough to allow any
> + * pagetable sizes we need, but small enough to fit in the low bits of
> + * any page table pointer.  In other words all pagetables, even tiny
> + * ones, must be aligned to allow at least enough low 0 bits to
> + * contain this value.  This value is also used as a mask, so it must
> + * be one less than a power of two.
> + */
> +#define MAX_PGTABLE_INDEX_SIZE	0xf
>  
>  extern void __bad_pte(pmd_t *pmd);
>  
> -extern pgd_t *pgd_alloc(struct mm_struct *mm);
> -extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
> +extern struct kmem_cache *pgtable_cache[];
> +#define PGT_CACHE(shift) ({				\
> +			BUG_ON(!(shift));		\
> +			pgtable_cache[(shift) - 1];	\
> +		})
> +
> +static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> +{
> +	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
> +}
> +
> +static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
> +{
> +	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
> +}
>  
>  /*
>   * We don't have any real pmd's, and this code never triggers because
> @@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>  
>  static inline void pgtable_free(void *table, unsigned index_size)
>  {
> -	BUG_ON(index_size); /* 32-bit doesn't use this */
> -	free_page((unsigned long)table);
> +	if (!index_size) {
> +		free_page((unsigned long)table);
> +	} else {
> +		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
> +		kmem_cache_free(PGT_CACHE(index_size), table);
> +	}
>  }
>  
>  #define check_pgt_cache()	do { } while (0)
> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
> index c219ef7..8cbe222 100644
> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
> @@ -16,6 +16,26 @@ extern int icache_44x_need_flush;
>  
>  #endif /* __ASSEMBLY__ */
>  
> +#define PTE_INDEX_SIZE	PTE_SHIFT
> +#define PMD_INDEX_SIZE	0
> +#define PUD_INDEX_SIZE	0
> +#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
> +
> +#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
> +
> +#ifndef __ASSEMBLY__
> +#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_INDEX_SIZE)
> +#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PTE_INDEX_SIZE)
> +#define PUD_TABLE_SIZE	(sizeof(pud_t) << PTE_INDEX_SIZE)
> +#define PGD_TABLE_SIZE	(sizeof(pgd_t) << PGD_INDEX_SIZE)
> +#endif	/* __ASSEMBLY__ */
> +

Same, please comment on why those TABLE sizes ?


> +#define PTRS_PER_PTE	(1 << PTE_INDEX_SIZE)
> +#define PTRS_PER_PGD	(1 << PGD_INDEX_SIZE)
> +
> +/* With 4k base page size, hugepage PTEs go at the PMD level */
> +#define MIN_HUGEPTE_SHIFT	PMD_SHIFT
> +
>  /*
>   * The normal case is that PTEs are 32-bits and we have a 1-page
>   * 1024-entry pgdir pointing to 1-page 1024-entry PTE pages.  -- paulus
> @@ -27,22 +47,12 @@ extern int icache_44x_need_flush;
>   * -Matt
>   */
>  /* PGDIR_SHIFT determines what a top-level page table entry can map */
> -#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_SHIFT)
> +#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_INDEX_SIZE)
>  #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>  
> -/*
> - * entries per page directory level: our page-table tree is two-level, so
> - * we don't really have any PMD directory.
> - */
> -#ifndef __ASSEMBLY__
> -#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_SHIFT)
> -#define PGD_TABLE_SIZE	(sizeof(pgd_t) << (32 - PGDIR_SHIFT))
> -#endif	/* __ASSEMBLY__ */
> -
> -#define PTRS_PER_PTE	(1 << PTE_SHIFT)
> -#define PTRS_PER_PMD	1
> -#define PTRS_PER_PGD	(1 << (32 - PGDIR_SHIFT))
> +/* Bits to mask out from a PGD to get to the PUD page */
> +#define PGD_MASKED_BITS		0
>  
>  #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
>  #define FIRST_USER_ADDRESS	0UL
> @@ -328,15 +338,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) >> 3 })
>  #define __swp_entry_to_pte(x)		((pte_t) { (x).val << 3 })
>  
> -#ifndef CONFIG_PPC_4K_PAGES
> -void pgtable_cache_init(void);
> -#else
> -/*
> - * No page table caches to initialise
> - */
> -#define pgtable_cache_init()	do { } while (0)
> -#endif
> -
>  extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep,
>  		      pmd_t **pmdp);
>  
> diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
> index 653a183..619018a 100644
> --- a/arch/powerpc/include/asm/nohash/64/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
> @@ -358,8 +358,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
>  #define __swp_entry_to_pte(x)		__pte((x).val)
>  
> -void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
> -void pgtable_cache_init(void);
>  extern int map_kernel_page(unsigned long ea, unsigned long pa,
>  			   unsigned long flags);
>  extern int __meminit vmemmap_create_mapping(unsigned long start,
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 9bd87f2..dd01212 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -78,6 +78,8 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  
>  unsigned long vmalloc_to_phys(void *vmalloc_addr);
>  
> +void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
> +void pgtable_cache_init(void);
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* _ASM_POWERPC_PGTABLE_H */
> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> index 1a4e570..e8a86d2 100644
> --- a/arch/powerpc/mm/Makefile
> +++ b/arch/powerpc/mm/Makefile
> @@ -7,7 +7,8 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
>  ccflags-$(CONFIG_PPC64)	:= $(NO_MINIMAL_TOC)
>  
>  obj-y				:= fault.o mem.o pgtable.o mmap.o \
> -				   init_$(BITS).o pgtable_$(BITS).o
> +				   init_$(BITS).o pgtable_$(BITS).o \
> +				   init-common.o
>  obj-$(CONFIG_PPC_MMU_NOHASH)	+= mmu_context_nohash.o tlb_nohash.o \
>  				   tlb_nohash_low.o
>  obj-$(CONFIG_PPC_BOOK3E)	+= tlb_low_$(BITS)e.o
> diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
> new file mode 100644
> index 0000000..ab2b947
> --- /dev/null
> +++ b/arch/powerpc/mm/init-common.c
> @@ -0,0 +1,147 @@
> +/*
> + *  PowerPC version
> + *    Copyright (C) 1995-1996 Gary Thomas (gdt@linuxppc.org)
> + *
> + *  Modifications by Paul Mackerras (PowerMac) (paulus@cs.anu.edu.au)
> + *  and Cort Dougan (PReP) (cort@cs.nmt.edu)
> + *    Copyright (C) 1996 Paul Mackerras
> + *
> + *  Derived from "arch/i386/mm/init.c"
> + *    Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
> + *
> + *  Dave Engebretsen <engebret@us.ibm.com>
> + *      Rework for PPC64 port.
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License
> + *  as published by the Free Software Foundation; either version
> + *  2 of the License, or (at your option) any later version.
> + *
> + */
> +
> +#undef DEBUG
> +
> +#include <linux/signal.h>
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/string.h>
> +#include <linux/types.h>
> +#include <linux/mman.h>
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/stddef.h>
> +#include <linux/vmalloc.h>
> +#include <linux/init.h>
> +#include <linux/delay.h>
> +#include <linux/highmem.h>
> +#include <linux/idr.h>
> +#include <linux/nodemask.h>
> +#include <linux/module.h>
> +#include <linux/poison.h>
> +#include <linux/memblock.h>
> +#include <linux/hugetlb.h>
> +#include <linux/slab.h>
> +
> +#include <asm/pgalloc.h>
> +#include <asm/page.h>
> +#include <asm/prom.h>
> +#include <asm/rtas.h>
> +#include <asm/io.h>
> +#include <asm/mmu_context.h>
> +#include <asm/pgtable.h>
> +#include <asm/mmu.h>
> +#include <asm/uaccess.h>
> +#include <asm/smp.h>
> +#include <asm/machdep.h>
> +#include <asm/tlb.h>
> +#include <asm/eeh.h>
> +#include <asm/processor.h>
> +#include <asm/mmzone.h>
> +#include <asm/cputable.h>
> +#include <asm/sections.h>
> +#include <asm/iommu.h>
> +#include <asm/vdso.h>
> +
> +#include "mmu_decl.h"


Do you need all these headers to get it compiled ?


> +
> +static void pgd_ctor(void *addr)
> +{
> +	memset(addr, 0, PGD_TABLE_SIZE);
> +}
> +
> +static void pud_ctor(void *addr)
> +{
> +	memset(addr, 0, PUD_TABLE_SIZE);
> +}
> +
> +static void pmd_ctor(void *addr)
> +{
> +	memset(addr, 0, PMD_TABLE_SIZE);
> +}
> +
> +struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
> +
> +/*
> + * Create a kmem_cache() for pagetables.  This is not used for PTE
> + * pages - they're linked to struct page, come from the normal free
> + * pages pool and have a different entry size (see real_pte_t) to
> + * everything else.  Caches created by this function are used for all
> + * the higher level pagetables, and for hugepage pagetables.
> + */
> +void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
> +{
> +	char *name;
> +	unsigned long table_size = sizeof(void *) << shift;
> +	unsigned long align = table_size;
> +
> +	/* When batching pgtable pointers for RCU freeing, we store
> +	 * the index size in the low bits.  Table alignment must be
> +	 * big enough to fit it.
> +	 *
> +	 * Likewise, hugeapge pagetable pointers contain a (different)
> +	 * shift value in the low bits.  All tables must be aligned so
> +	 * as to leave enough 0 bits in the address to contain it. */
> +	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
> +				     HUGEPD_SHIFT_MASK + 1);
> +	struct kmem_cache *new;
> +
> +	/* It would be nice if this was a BUILD_BUG_ON(), but at the
> +	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
> +	 * constant expression, so so much for that. */
> +	BUG_ON(!is_power_of_2(minalign));
> +	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
> +
> +	if (PGT_CACHE(shift))
> +		return; /* Already have a cache of this size */
> +
> +	align = max_t(unsigned long, align, minalign);
> +	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
> +	new = kmem_cache_create(name, table_size, align, 0, ctor);
> +	kfree(name);
> +	pgtable_cache[shift - 1] = new;
> +	pr_debug("Allocated pgtable cache for order %d\n", shift);
> +}
> +
> +
> +void pgtable_cache_init(void)
> +{
> +	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
> +
> +	if (PMD_INDEX_SIZE && !PGT_CACHE(PMD_INDEX_SIZE))
> +		pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
> +	/*
> +	 * In all current configs, when the PUD index exists it's the
> +	 * same size as either the pgd or pmd index except with THP enabled
> +	 * on book3s 64
> +	 */
> +	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
> +		pgtable_cache_add(PUD_INDEX_SIZE, pud_ctor);
> +
> +	if (!PGT_CACHE(PGD_INDEX_SIZE))
> +		panic("Couldn't allocate pgd cache");
> +	if (PMD_INDEX_SIZE && !PGT_CACHE(PMD_INDEX_SIZE))
> +		panic("Couldn't allocate pmd pgtable caches");
> +	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
> +		panic("Couldn't allocate pud pgtable caches");
> +}
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 16ada1e..a000c35 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -80,83 +80,6 @@ EXPORT_SYMBOL_GPL(memstart_addr);
>  phys_addr_t kernstart_addr;
>  EXPORT_SYMBOL_GPL(kernstart_addr);
>  
> -static void pgd_ctor(void *addr)
> -{
> -	memset(addr, 0, PGD_TABLE_SIZE);
> -}
> -
> -static void pud_ctor(void *addr)
> -{
> -	memset(addr, 0, PUD_TABLE_SIZE);
> -}
> -
> -static void pmd_ctor(void *addr)
> -{
> -	memset(addr, 0, PMD_TABLE_SIZE);
> -}
> -
> -struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
> -
> -/*
> - * Create a kmem_cache() for pagetables.  This is not used for PTE
> - * pages - they're linked to struct page, come from the normal free
> - * pages pool and have a different entry size (see real_pte_t) to
> - * everything else.  Caches created by this function are used for all
> - * the higher level pagetables, and for hugepage pagetables.
> - */
> -void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
> -{
> -	char *name;
> -	unsigned long table_size = sizeof(void *) << shift;
> -	unsigned long align = table_size;
> -
> -	/* When batching pgtable pointers for RCU freeing, we store
> -	 * the index size in the low bits.  Table alignment must be
> -	 * big enough to fit it.
> -	 *
> -	 * Likewise, hugeapge pagetable pointers contain a (different)
> -	 * shift value in the low bits.  All tables must be aligned so
> -	 * as to leave enough 0 bits in the address to contain it. */
> -	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
> -				     HUGEPD_SHIFT_MASK + 1);
> -	struct kmem_cache *new;
> -
> -	/* It would be nice if this was a BUILD_BUG_ON(), but at the
> -	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
> -	 * constant expression, so so much for that. */
> -	BUG_ON(!is_power_of_2(minalign));
> -	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
> -
> -	if (PGT_CACHE(shift))
> -		return; /* Already have a cache of this size */
> -
> -	align = max_t(unsigned long, align, minalign);
> -	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
> -	new = kmem_cache_create(name, table_size, align, 0, ctor);
> -	kfree(name);
> -	pgtable_cache[shift - 1] = new;
> -	pr_debug("Allocated pgtable cache for order %d\n", shift);
> -}
> -
> -
> -void pgtable_cache_init(void)
> -{
> -	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
> -	pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
> -	/*
> -	 * In all current configs, when the PUD index exists it's the
> -	 * same size as either the pgd or pmd index except with THP enabled
> -	 * on book3s 64
> -	 */
> -	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
> -		pgtable_cache_add(PUD_INDEX_SIZE, pud_ctor);
> -
> -	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_CACHE_INDEX))
> -		panic("Couldn't allocate pgtable caches");
> -	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
> -		panic("Couldn't allocate pud pgtable caches");
> -}
> -
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  /*
>   * Given an address within the vmemmap, determine the pfn of the page that
> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
> index 0ae0572..a65c0b4 100644
> --- a/arch/powerpc/mm/pgtable_32.c
> +++ b/arch/powerpc/mm/pgtable_32.c
> @@ -42,43 +42,6 @@ EXPORT_SYMBOL(ioremap_bot);	/* aka VMALLOC_END */
>  
>  extern char etext[], _stext[], _sinittext[], _einittext[];
>  
> -#define PGDIR_ORDER	(32 + PGD_T_LOG2 - PGDIR_SHIFT)
> -
> -#ifndef CONFIG_PPC_4K_PAGES
> -static struct kmem_cache *pgtable_cache;
> -
> -void pgtable_cache_init(void)
> -{
> -	pgtable_cache = kmem_cache_create("PGDIR cache", 1 << PGDIR_ORDER,
> -					  1 << PGDIR_ORDER, 0, NULL);
> -	if (pgtable_cache == NULL)
> -		panic("Couldn't allocate pgtable caches");
> -}
> -#endif
> -
> -pgd_t *pgd_alloc(struct mm_struct *mm)
> -{
> -	pgd_t *ret;
> -
> -	/* pgdir take page or two with 4K pages and a page fraction otherwise */
> -#ifndef CONFIG_PPC_4K_PAGES
> -	ret = kmem_cache_alloc(pgtable_cache, GFP_KERNEL | __GFP_ZERO);
> -#else
> -	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO,
> -			PGDIR_ORDER - PAGE_SHIFT);
> -#endif
> -	return ret;
> -}
> -
> -void pgd_free(struct mm_struct *mm, pgd_t *pgd)
> -{
> -#ifndef CONFIG_PPC_4K_PAGES
> -	kmem_cache_free(pgtable_cache, (void *)pgd);
> -#else
> -	free_pages((unsigned long)pgd, PGDIR_ORDER - PAGE_SHIFT);
> -#endif
> -}
> -
>  __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
>  {
>  	pte_t *pte;
> -- 
> 2.1.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-16  7:40 ` [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic Christophe Leroy
@ 2016-09-19  5:45   ` Aneesh Kumar K.V
  2016-09-19 18:32     ` christophe leroy
  2016-09-21  6:13     ` Christophe Leroy
  2016-09-19  5:50   ` Aneesh Kumar K.V
  1 sibling, 2 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2016-09-19  5:45 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

Christophe Leroy <christophe.leroy@c-s.fr> writes:

> Today there are two implementations of hugetlbpages which are managed
> by exclusive #ifdefs:
> * FSL_BOOKE: several directory entries points to the same single hugepage
> * BOOK3S: one upper level directory entry points to a table of hugepages
>
> In preparation of implementation of hugepage support on the 8xx, we
> need a mix of the two above solutions, because the 8xx needs both cases
> depending on the size of pages:
> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
> that 2 PGD entries will be necessary to cover an 8M hugepage while a
> single PGD entry will cover 8x 512k hugepages.
> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
> hugepages will be covers by one PGD entry.
>
> This patch:
> * removes #ifdefs in favor of if/else based on the range sizes
> * merges the two huge_pte_alloc() functions as they are pretty similar
> * merges the two hugetlbpage_init() functions as they are pretty similar
>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---
> v2: This part is new and results from a split of last patch of v1 serie in
> two parts
>
>  arch/powerpc/mm/hugetlbpage.c | 189 +++++++++++++++++-------------------------
>  1 file changed, 77 insertions(+), 112 deletions(-)
>
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 8a512b1..2119f00 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -64,14 +64,16 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  {
>  	struct kmem_cache *cachep;
>  	pte_t *new;
> -
> -#ifdef CONFIG_PPC_FSL_BOOK3E
>  	int i;
> -	int num_hugepd = 1 << (pshift - pdshift);
> -	cachep = hugepte_cache;
> -#else
> -	cachep = PGT_CACHE(pdshift - pshift);
> -#endif
> +	int num_hugepd;
> +
> +	if (pshift >= pdshift) {
> +		cachep = hugepte_cache;
> +		num_hugepd = 1 << (pshift - pdshift);
> +	} else {
> +		cachep = PGT_CACHE(pdshift - pshift);
> +		num_hugepd = 1;
> +	}

Is there a way to hint likely/unlikely branch based on the page size
selected at build time ?



>  
>  	new = kmem_cache_zalloc(cachep, GFP_KERNEL);
>  
> @@ -89,7 +91,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  	smp_wmb();
>  
>  	spin_lock(&mm->page_table_lock);
> -#ifdef CONFIG_PPC_FSL_BOOK3E
> +
>  	/*
>  	 * We have multiple higher-level entries that point to the same
>  	 * actual pte location.  Fill in each as we go and backtrack on error.
> @@ -100,8 +102,13 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>  		if (unlikely(!hugepd_none(*hpdp)))
>  			break;
>  		else

....

> -#ifdef CONFIG_PPC_FSL_BOOK3E
>  struct kmem_cache *hugepte_cache;
>  static int __init hugetlbpage_init(void)
>  {
>  	int psize;
>  
> -	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
> -		unsigned shift;
> -
> -		if (!mmu_psize_defs[psize].shift)
> -			continue;
> -
> -		shift = mmu_psize_to_shift(psize);
> -
> -		/* Don't treat normal page sizes as huge... */
> -		if (shift != PAGE_SHIFT)
> -			if (add_huge_page_size(1ULL << shift) < 0)
> -				continue;
> -	}
> -
> -	/*
> -	 * Create a kmem cache for hugeptes.  The bottom bits in the pte have
> -	 * size information encoded in them, so align them to allow this
> -	 */
> -	hugepte_cache =  kmem_cache_create("hugepte-cache", sizeof(pte_t),
> -					   HUGEPD_SHIFT_MASK + 1, 0, NULL);
> -	if (hugepte_cache == NULL)
> -		panic("%s: Unable to create kmem cache for hugeptes\n",
> -		      __func__);
> -
> -	/* Default hpage size = 4M */
> -	if (mmu_psize_defs[MMU_PAGE_4M].shift)
> -		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
> -	else
> -		panic("%s: Unable to set default huge page size\n", __func__);
> -
> -
> -	return 0;
> -}
> -#else
> -static int __init hugetlbpage_init(void)
> -{
> -	int psize;
> -
> +#if !defined(CONFIG_PPC_FSL_BOOK3E)
>  	if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
>  		return -ENODEV;
> -
> +#endif

Do we need that #if ? radix_enabled() should become 0 and that if
condition should be removed at compile time isn't it ? or are you
finding errors with that ?


>  	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
>  		unsigned shift;
>  		unsigned pdshift;
> @@ -860,16 +807,31 @@ static int __init hugetlbpage_init(void)
>  		 * if we have pdshift and shift value same, we don't
>  		 * use pgt cache for hugepd.
>  		 */
> -		if (pdshift != shift) {
> +		if (pdshift > shift) {
>  			pgtable_cache_add(pdshift - shift, NULL);
>  			if (!PGT_CACHE(pdshift - shift))
>  				panic("hugetlbpage_init(): could not create "
>  				      "pgtable cache for %d bit pagesize\n", shift);
> +		} else if (!hugepte_cache) {
> +			/*
> +			 * Create a kmem cache for hugeptes.  The bottom bits in
> +			 * the pte have size information encoded in them, so
> +			 * align them to allow this
> +			 */
> +			hugepte_cache = kmem_cache_create("hugepte-cache",
> +							  sizeof(pte_t),
> +							  HUGEPD_SHIFT_MASK + 1,
> +							  0, NULL);
> +			if (hugepte_cache == NULL)
> +				panic("%s: Unable to create kmem cache "
> +				      "for hugeptes\n", __func__);
> +


We don't need hugepte_cache for book3s 64K. I guess we will endup
creating one here ?

>  		}
>  	}
>  
>  	/* Set default large page size. Currently, we pick 16M or 1M
>  	 * depending on what is available
> +	 * We select 4M on other ones.
>  	 */
>  	if (mmu_psize_defs[MMU_PAGE_16M].shift)
>  		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
> @@ -877,11 +839,14 @@ static int __init hugetlbpage_init(void)
>  		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
>  	else if (mmu_psize_defs[MMU_PAGE_2M].shift)
>  		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift;
> -
> +	else if (mmu_psize_defs[MMU_PAGE_4M].shift)
> +		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
> +	else
> +		panic("%s: Unable to set default huge page size\n", __func__);
>  
>  	return 0;
>  }
> -#endif
> +
>  arch_initcall(hugetlbpage_init);
>  
>  void flush_dcache_icache_hugepage(struct page *page)
> -- 
> 2.1.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-16  7:40 ` [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic Christophe Leroy
  2016-09-19  5:45   ` Aneesh Kumar K.V
@ 2016-09-19  5:50   ` Aneesh Kumar K.V
  2016-09-19 18:36     ` christophe leroy
  1 sibling, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2016-09-19  5:50 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev


Christophe Leroy <christophe.leroy@c-s.fr> writes:
> +#else
> +static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
> +{
> +	BUG();
> +}
> +
>  #endif


I was expecting that BUG will get removed in the next patch. But I don't
see it in the next patch. Considering

@@ -475,11 +453,10 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
        for (i = 0; i < num_hugepd; i++, hpdp++)
                hpdp->pd = 0;

-#ifdef CONFIG_PPC_FSL_BOOK3E
-	hugepd_free(tlb, hugepte);
-#else
-	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
-#endif
+	if (shift >= pdshift)
+		hugepd_free(tlb, hugepte);
+	else
+		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }

What is that I am missing ?

-aneesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-19  5:45   ` Aneesh Kumar K.V
@ 2016-09-19 18:32     ` christophe leroy
  2016-09-20  2:45       ` Aneesh Kumar K.V
  2016-09-21  6:13     ` Christophe Leroy
  1 sibling, 1 reply; 15+ messages in thread
From: christophe leroy @ 2016-09-19 18:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev



Le 19/09/2016 à 07:45, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>
>> Today there are two implementations of hugetlbpages which are managed
>> by exclusive #ifdefs:
>> * FSL_BOOKE: several directory entries points to the same single hugepage
>> * BOOK3S: one upper level directory entry points to a table of hugepages
>>
>> In preparation of implementation of hugepage support on the 8xx, we
>> need a mix of the two above solutions, because the 8xx needs both cases
>> depending on the size of pages:
>> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
>> that 2 PGD entries will be necessary to cover an 8M hugepage while a
>> single PGD entry will cover 8x 512k hugepages.
>> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
>> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
>> hugepages will be covers by one PGD entry.
>>
>> This patch:
>> * removes #ifdefs in favor of if/else based on the range sizes
>> * merges the two huge_pte_alloc() functions as they are pretty similar
>> * merges the two hugetlbpage_init() functions as they are pretty similar
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>> v2: This part is new and results from a split of last patch of v1 serie in
>> two parts
>>
>>  arch/powerpc/mm/hugetlbpage.c | 189 +++++++++++++++++-------------------------
>>  1 file changed, 77 insertions(+), 112 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
>> index 8a512b1..2119f00 100644
>> --- a/arch/powerpc/mm/hugetlbpage.c
>> +++ b/arch/powerpc/mm/hugetlbpage.c

[...]

>
>> -#ifdef CONFIG_PPC_FSL_BOOK3E
>>  struct kmem_cache *hugepte_cache;
>>  static int __init hugetlbpage_init(void)
>>  {
>>  	int psize;
>>
>> -	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
>> -		unsigned shift;
>> -
>> -		if (!mmu_psize_defs[psize].shift)
>> -			continue;
>> -
>> -		shift = mmu_psize_to_shift(psize);
>> -
>> -		/* Don't treat normal page sizes as huge... */
>> -		if (shift != PAGE_SHIFT)
>> -			if (add_huge_page_size(1ULL << shift) < 0)
>> -				continue;
>> -	}
>> -
>> -	/*
>> -	 * Create a kmem cache for hugeptes.  The bottom bits in the pte have
>> -	 * size information encoded in them, so align them to allow this
>> -	 */
>> -	hugepte_cache =  kmem_cache_create("hugepte-cache", sizeof(pte_t),
>> -					   HUGEPD_SHIFT_MASK + 1, 0, NULL);
>> -	if (hugepte_cache == NULL)
>> -		panic("%s: Unable to create kmem cache for hugeptes\n",
>> -		      __func__);
>> -
>> -	/* Default hpage size = 4M */
>> -	if (mmu_psize_defs[MMU_PAGE_4M].shift)
>> -		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
>> -	else
>> -		panic("%s: Unable to set default huge page size\n", __func__);
>> -
>> -
>> -	return 0;
>> -}
>> -#else
>> -static int __init hugetlbpage_init(void)
>> -{
>> -	int psize;
>> -
>> +#if !defined(CONFIG_PPC_FSL_BOOK3E)
>>  	if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
>>  		return -ENODEV;
>> -
>> +#endif
>
> Do we need that #if ? radix_enabled() should become 0 and that if
> condition should be removed at compile time isn't it ? or are you
> finding errors with that ?

Having radix_enabled() being 0, it becomes:

if (!mmu_has_feature(MMU_FTR_16M_PAGE))
	return -ENODEV;

Which means hugepage will only be handled by CPUs having 16M pages. 
That's the issue.

>
>
>>  	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
>>  		unsigned shift;
>>  		unsigned pdshift;
>> @@ -860,16 +807,31 @@ static int __init hugetlbpage_init(void)
>>  		 * if we have pdshift and shift value same, we don't
>>  		 * use pgt cache for hugepd.
>>  		 */
>> -		if (pdshift != shift) {
>> +		if (pdshift > shift) {
>>  			pgtable_cache_add(pdshift - shift, NULL);
>>  			if (!PGT_CACHE(pdshift - shift))
>>  				panic("hugetlbpage_init(): could not create "
>>  				      "pgtable cache for %d bit pagesize\n", shift);
>> +		} else if (!hugepte_cache) {
>> +			/*
>> +			 * Create a kmem cache for hugeptes.  The bottom bits in
>> +			 * the pte have size information encoded in them, so
>> +			 * align them to allow this
>> +			 */
>> +			hugepte_cache = kmem_cache_create("hugepte-cache",
>> +							  sizeof(pte_t),
>> +							  HUGEPD_SHIFT_MASK + 1,
>> +							  0, NULL);
>> +			if (hugepte_cache == NULL)
>> +				panic("%s: Unable to create kmem cache "
>> +				      "for hugeptes\n", __func__);
>> +
>
>
> We don't need hugepte_cache for book3s 64K. I guess we will endup
> creating one here ?

Should not, because on book3s 64k, we will have pdshift > shift
won't we ?

>
>>  		}
>>  	}
>>
>>  	/* Set default large page size. Currently, we pick 16M or 1M
>>  	 * depending on what is available
>> +	 * We select 4M on other ones.
>>  	 */
>>  	if (mmu_psize_defs[MMU_PAGE_16M].shift)
>>  		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
>> @@ -877,11 +839,14 @@ static int __init hugetlbpage_init(void)
>>  		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
>>  	else if (mmu_psize_defs[MMU_PAGE_2M].shift)
>>  		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift;
>> -
>> +	else if (mmu_psize_defs[MMU_PAGE_4M].shift)
>> +		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
>> +	else
>> +		panic("%s: Unable to set default huge page size\n", __func__);
>>
>>  	return 0;
>>  }
>> -#endif
>> +
>>  arch_initcall(hugetlbpage_init);
>>
>>  void flush_dcache_icache_hugepage(struct page *page)
>> --
>> 2.1.0

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-19  5:50   ` Aneesh Kumar K.V
@ 2016-09-19 18:36     ` christophe leroy
  2016-09-20  2:28         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 15+ messages in thread
From: christophe leroy @ 2016-09-19 18:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev



Le 19/09/2016 à 07:50, Aneesh Kumar K.V a écrit :
>
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>> +#else
>> +static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
>> +{
>> +	BUG();
>> +}
>> +
>>  #endif
>
>
> I was expecting that BUG will get removed in the next patch. But I don't
> see it in the next patch. Considering
>
> @@ -475,11 +453,10 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
>         for (i = 0; i < num_hugepd; i++, hpdp++)
>                 hpdp->pd = 0;
>
> -#ifdef CONFIG_PPC_FSL_BOOK3E
> -	hugepd_free(tlb, hugepte);
> -#else
> -	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
> -#endif
> +	if (shift >= pdshift)
> +		hugepd_free(tlb, hugepte);
> +	else
> +		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>  }
>
> What is that I am missing ?
>

Previously, call to hugepd_free() was compiled only when #ifdef 
CONFIG_PPC_FSL_BOOK3E
Now, it is compiled at all time, but it should never be called if not 
CONFIG_PPC_FSL_BOOK3E because we always have shift < pdshift in that case.
Then the function needs to be defined anyway but should never be called. 
Should I just define it static inline {} ?

Christophe

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits
  2016-09-19  5:22   ` Aneesh Kumar K.V
@ 2016-09-19 18:46     ` christophe leroy
  0 siblings, 0 replies; 15+ messages in thread
From: christophe leroy @ 2016-09-19 18:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev



Le 19/09/2016 à 07:22, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>
>> Today powerpc64 uses a set of pgtable_caches while powerpc32 uses
>> standard pages when using 4k pages and a single pgtable_cache
>> if using other size pages.
>>
>> In preparation of implementing huge pages on the 8xx, this patch
>> replaces the specific powerpc32 handling by the 64 bits approach.
>>
>> This is done by:
>> * moving 64 bits pgtable_cache_add() and pgtable_cache_init()
>> in a new file called init-common.c
>> * modifying pgtable_cache_init() to also handle the case
>> without PMD
>> * removing the 32 bits version of pgtable_cache_add() and
>> pgtable_cache_init()
>> * copying related header contents from 64 bits into both the
>> book3s/32 and nohash/32 header files
>>
>> On the 8xx, the following cache sizes will be used:
>> * 4k pages mode:
>> - PGT_CACHE(10) for PGD
>> - PGT_CACHE(3) for 512k hugepage tables
>> * 16k pages mode:
>> - PGT_CACHE(6) for PGD
>> - PGT_CACHE(7) for 512k hugepage tables
>> - PGT_CACHE(3) for 8M hugepage tables
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>> v2: in v1, hugepte_cache was wrongly replaced by PGT_CACHE(1).
>> This modification has been removed from v2.
>>
>>  arch/powerpc/include/asm/book3s/32/pgalloc.h |  44 ++++++--
>>  arch/powerpc/include/asm/book3s/32/pgtable.h |  43 ++++----
>>  arch/powerpc/include/asm/book3s/64/pgtable.h |   3 -
>>  arch/powerpc/include/asm/nohash/32/pgalloc.h |  44 ++++++--
>>  arch/powerpc/include/asm/nohash/32/pgtable.h |  45 ++++----
>>  arch/powerpc/include/asm/nohash/64/pgtable.h |   2 -
>>  arch/powerpc/include/asm/pgtable.h           |   2 +
>>  arch/powerpc/mm/Makefile                     |   3 +-
>>  arch/powerpc/mm/init-common.c                | 147 +++++++++++++++++++++++++++
>>  arch/powerpc/mm/init_64.c                    |  77 --------------
>>  arch/powerpc/mm/pgtable_32.c                 |  37 -------
>>  11 files changed, 273 insertions(+), 174 deletions(-)
>>  create mode 100644 arch/powerpc/mm/init-common.c
>>
>> diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
>> index 8e21bb4..d310546 100644
>> --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
>> +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
>> @@ -2,14 +2,42 @@
>>  #define _ASM_POWERPC_BOOK3S_32_PGALLOC_H
>>
>>  #include <linux/threads.h>
>> +#include <linux/slab.h>
>>
>> -/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
>> -#define MAX_PGTABLE_INDEX_SIZE	0
>> +/*
>> + * Functions that deal with pagetables that could be at any level of
>> + * the table need to be passed an "index_size" so they know how to
>> + * handle allocation.  For PTE pages (which are linked to a struct
>> + * page for now, and drawn from the main get_free_pages() pool), the
>> + * allocation size will be (2^index_size * sizeof(pointer)) and
>> + * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
>> + *
>> + * The maximum index size needs to be big enough to allow any
>> + * pagetable sizes we need, but small enough to fit in the low bits of
>> + * any page table pointer.  In other words all pagetables, even tiny
>> + * ones, must be aligned to allow at least enough low 0 bits to
>> + * contain this value.  This value is also used as a mask, so it must
>> + * be one less than a power of two.
>> + */
>> +#define MAX_PGTABLE_INDEX_SIZE	0xf
>>
>>  extern void __bad_pte(pmd_t *pmd);
>>
>> -extern pgd_t *pgd_alloc(struct mm_struct *mm);
>> -extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
>> +extern struct kmem_cache *pgtable_cache[];
>> +#define PGT_CACHE(shift) ({				\
>> +			BUG_ON(!(shift));		\
>> +			pgtable_cache[(shift) - 1];	\
>> +		})
>> +
>> +static inline pgd_t *pgd_alloc(struct mm_struct *mm)
>> +{
>> +	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
>> +}
>> +
>> +static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
>> +{
>> +	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
>> +}
>>
>>  /*
>>   * We don't have any real pmd's, and this code never triggers because
>> @@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>>
>>  static inline void pgtable_free(void *table, unsigned index_size)
>>  {
>> -	BUG_ON(index_size); /* 32-bit doesn't use this */
>> -	free_page((unsigned long)table);
>> +	if (!index_size) {
>> +		free_page((unsigned long)table);
>> +	} else {
>> +		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
>> +		kmem_cache_free(PGT_CACHE(index_size), table);
>> +	}
>>  }
>>
>>  #define check_pgt_cache()	do { } while (0)
>> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
>> index 6b8b2d5..f887499 100644
>> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
>> @@ -8,6 +8,26 @@
>>  /* And here we include common definitions */
>>  #include <asm/pte-common.h>
>>
>> +#define PTE_INDEX_SIZE	PTE_SHIFT
>> +#define PMD_INDEX_SIZE	0
>> +#define PUD_INDEX_SIZE	0
>> +#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
>> +
>> +#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
>> +
>> +#ifndef __ASSEMBLY__
>> +#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_INDEX_SIZE)
>> +#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PTE_INDEX_SIZE)
>> +#define PUD_TABLE_SIZE	(sizeof(pud_t) << PTE_INDEX_SIZE)
>> +#define PGD_TABLE_SIZE	(sizeof(pgd_t) << PGD_INDEX_SIZE)
>> +#endif	/* __ASSEMBLY__ */
>
> Are these table size correct ? IIUC, We will have only PGD and PTE
> tables right ?

Oops, copy/paste error.
We won't have those tables, they won't be created as PMD_INDEX_SIZE and 
PUD_INDEX_SIZE are 0. But they need to be defined to avoid compilation 
errors of pmd_ctor() and pud_ctor()
Then what should I do ? Define them as 0, or just keep the standard 
definition which follows ?
#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PMD_INDEX_SIZE)
#define PUD_TABLE_SIZE	(sizeof(pud_t) << PUD_INDEX_SIZE)

>
>
>> +
>> +#define PTRS_PER_PTE	(1 << PTE_INDEX_SIZE)
>> +#define PTRS_PER_PGD	(1 << PGD_INDEX_SIZE)
>> +
>> +/* With 4k base page size, hugepage PTEs go at the PMD level */
>> +#define MIN_HUGEPTE_SHIFT	PMD_SHIFT
>> +
>
> What does that comment mean ? I guess that came from copy-paste from
> other headers. I am not sure what it means there either other than the
> 64k hash config, where we place hugepage PTE at the PMD level. (ie, no hugepd).

Well, I'll remove the comment from here.

>
>
>>  /*
>>   * The normal case is that PTEs are 32-bits and we have a 1-page
>>   * 1024-entry pgdir pointing to 1-page 1024-entry PTE pages.  -- paulus
>> @@ -19,14 +39,10 @@
>>   * -Matt
>>   */
>>  /* PGDIR_SHIFT determines what a top-level page table entry can map */
>> -#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_SHIFT)
>> +#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_INDEX_SIZE)
>>  #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
>>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>>
>> -#define PTRS_PER_PTE	(1 << PTE_SHIFT)
>> -#define PTRS_PER_PMD	1
>> -#define PTRS_PER_PGD	(1 << (32 - PGDIR_SHIFT))
>> -
>>  #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
>>  /*
>>   * This is the bottom of the PKMAP area with HIGHMEM or an arbitrary
>> @@ -82,12 +98,8 @@
>>
>>  extern unsigned long ioremap_bot;
>>
>> -/*
>> - * entries per page directory level: our page-table tree is two-level, so
>> - * we don't really have any PMD directory.
>> - */
>> -#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_SHIFT)
>> -#define PGD_TABLE_SIZE	(sizeof(pgd_t) << (32 - PGDIR_SHIFT))
>> +/* Bits to mask out from a PGD to get to the PUD page */
>> +#define PGD_MASKED_BITS		0
>>
>>  #define pte_ERROR(e) \
>>  	pr_err("%s:%d: bad pte %llx.\n", __FILE__, __LINE__, \
>> @@ -283,15 +295,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
>>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) >> 3 })
>>  #define __swp_entry_to_pte(x)		((pte_t) { (x).val << 3 })
>>
>> -#ifndef CONFIG_PPC_4K_PAGES
>> -void pgtable_cache_init(void);
>> -#else
>> -/*
>> - * No page table caches to initialise
>> - */
>> -#define pgtable_cache_init()	do { } while (0)
>> -#endif
>> -
>>  extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep,
>>  		      pmd_t **pmdp);
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 9fd77f8..0a46a5f 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -789,9 +789,6 @@ extern struct page *pgd_page(pgd_t pgd);
>>  #define pgd_ERROR(e) \
>>  	pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
>>
>> -void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
>> -void pgtable_cache_init(void);
>> -
>>  static inline int map_kernel_page(unsigned long ea, unsigned long pa,
>>  				  unsigned long flags)
>>  {
>> diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
>> index 76d6b9e..6331392 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
>> @@ -2,14 +2,42 @@
>>  #define _ASM_POWERPC_PGALLOC_32_H
>>
>>  #include <linux/threads.h>
>> +#include <linux/slab.h>
>>
>> -/* For 32-bit, all levels of page tables are just drawn from get_free_page() */
>> -#define MAX_PGTABLE_INDEX_SIZE	0
>> +/*
>> + * Functions that deal with pagetables that could be at any level of
>> + * the table need to be passed an "index_size" so they know how to
>> + * handle allocation.  For PTE pages (which are linked to a struct
>> + * page for now, and drawn from the main get_free_pages() pool), the
>> + * allocation size will be (2^index_size * sizeof(pointer)) and
>> + * allocations are drawn from the kmem_cache in PGT_CACHE(index_size).
>> + *
>> + * The maximum index size needs to be big enough to allow any
>> + * pagetable sizes we need, but small enough to fit in the low bits of
>> + * any page table pointer.  In other words all pagetables, even tiny
>> + * ones, must be aligned to allow at least enough low 0 bits to
>> + * contain this value.  This value is also used as a mask, so it must
>> + * be one less than a power of two.
>> + */
>> +#define MAX_PGTABLE_INDEX_SIZE	0xf
>>
>>  extern void __bad_pte(pmd_t *pmd);
>>
>> -extern pgd_t *pgd_alloc(struct mm_struct *mm);
>> -extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
>> +extern struct kmem_cache *pgtable_cache[];
>> +#define PGT_CACHE(shift) ({				\
>> +			BUG_ON(!(shift));		\
>> +			pgtable_cache[(shift) - 1];	\
>> +		})
>> +
>> +static inline pgd_t *pgd_alloc(struct mm_struct *mm)
>> +{
>> +	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
>> +}
>> +
>> +static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
>> +{
>> +	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
>> +}
>>
>>  /*
>>   * We don't have any real pmd's, and this code never triggers because
>> @@ -68,8 +96,12 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
>>
>>  static inline void pgtable_free(void *table, unsigned index_size)
>>  {
>> -	BUG_ON(index_size); /* 32-bit doesn't use this */
>> -	free_page((unsigned long)table);
>> +	if (!index_size) {
>> +		free_page((unsigned long)table);
>> +	} else {
>> +		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
>> +		kmem_cache_free(PGT_CACHE(index_size), table);
>> +	}
>>  }
>>
>>  #define check_pgt_cache()	do { } while (0)
>> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
>> index c219ef7..8cbe222 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
>> @@ -16,6 +16,26 @@ extern int icache_44x_need_flush;
>>
>>  #endif /* __ASSEMBLY__ */
>>
>> +#define PTE_INDEX_SIZE	PTE_SHIFT
>> +#define PMD_INDEX_SIZE	0
>> +#define PUD_INDEX_SIZE	0
>> +#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
>> +
>> +#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
>> +
>> +#ifndef __ASSEMBLY__
>> +#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_INDEX_SIZE)
>> +#define PMD_TABLE_SIZE	(sizeof(pmd_t) << PTE_INDEX_SIZE)
>> +#define PUD_TABLE_SIZE	(sizeof(pud_t) << PTE_INDEX_SIZE)
>> +#define PGD_TABLE_SIZE	(sizeof(pgd_t) << PGD_INDEX_SIZE)
>> +#endif	/* __ASSEMBLY__ */
>> +
>
> Same, please comment on why those TABLE sizes ?
>

Yes same error, will fix it the same way.

>
>> +#define PTRS_PER_PTE	(1 << PTE_INDEX_SIZE)
>> +#define PTRS_PER_PGD	(1 << PGD_INDEX_SIZE)
>> +
>> +/* With 4k base page size, hugepage PTEs go at the PMD level */
>> +#define MIN_HUGEPTE_SHIFT	PMD_SHIFT
>> +
>>  /*
>>   * The normal case is that PTEs are 32-bits and we have a 1-page
>>   * 1024-entry pgdir pointing to 1-page 1024-entry PTE pages.  -- paulus
>> @@ -27,22 +47,12 @@ extern int icache_44x_need_flush;
>>   * -Matt
>>   */
>>  /* PGDIR_SHIFT determines what a top-level page table entry can map */
>> -#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_SHIFT)
>> +#define PGDIR_SHIFT	(PAGE_SHIFT + PTE_INDEX_SIZE)
>>  #define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
>>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>>
>> -/*
>> - * entries per page directory level: our page-table tree is two-level, so
>> - * we don't really have any PMD directory.
>> - */
>> -#ifndef __ASSEMBLY__
>> -#define PTE_TABLE_SIZE	(sizeof(pte_t) << PTE_SHIFT)
>> -#define PGD_TABLE_SIZE	(sizeof(pgd_t) << (32 - PGDIR_SHIFT))
>> -#endif	/* __ASSEMBLY__ */
>> -
>> -#define PTRS_PER_PTE	(1 << PTE_SHIFT)
>> -#define PTRS_PER_PMD	1
>> -#define PTRS_PER_PGD	(1 << (32 - PGDIR_SHIFT))
>> +/* Bits to mask out from a PGD to get to the PUD page */
>> +#define PGD_MASKED_BITS		0
>>
>>  #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
>>  #define FIRST_USER_ADDRESS	0UL
>> @@ -328,15 +338,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
>>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val(pte) >> 3 })
>>  #define __swp_entry_to_pte(x)		((pte_t) { (x).val << 3 })
>>
>> -#ifndef CONFIG_PPC_4K_PAGES
>> -void pgtable_cache_init(void);
>> -#else
>> -/*
>> - * No page table caches to initialise
>> - */
>> -#define pgtable_cache_init()	do { } while (0)
>> -#endif
>> -
>>  extern int get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep,
>>  		      pmd_t **pmdp);
>>
>> diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
>> index 653a183..619018a 100644
>> --- a/arch/powerpc/include/asm/nohash/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
>> @@ -358,8 +358,6 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
>>  #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
>>  #define __swp_entry_to_pte(x)		__pte((x).val)
>>
>> -void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
>> -void pgtable_cache_init(void);
>>  extern int map_kernel_page(unsigned long ea, unsigned long pa,
>>  			   unsigned long flags);
>>  extern int __meminit vmemmap_create_mapping(unsigned long start,
>> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
>> index 9bd87f2..dd01212 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -78,6 +78,8 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>>
>>  unsigned long vmalloc_to_phys(void *vmalloc_addr);
>>
>> +void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
>> +void pgtable_cache_init(void);
>>  #endif /* __ASSEMBLY__ */
>>
>>  #endif /* _ASM_POWERPC_PGTABLE_H */
>> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
>> index 1a4e570..e8a86d2 100644
>> --- a/arch/powerpc/mm/Makefile
>> +++ b/arch/powerpc/mm/Makefile
>> @@ -7,7 +7,8 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
>>  ccflags-$(CONFIG_PPC64)	:= $(NO_MINIMAL_TOC)
>>
>>  obj-y				:= fault.o mem.o pgtable.o mmap.o \
>> -				   init_$(BITS).o pgtable_$(BITS).o
>> +				   init_$(BITS).o pgtable_$(BITS).o \
>> +				   init-common.o
>>  obj-$(CONFIG_PPC_MMU_NOHASH)	+= mmu_context_nohash.o tlb_nohash.o \
>>  				   tlb_nohash_low.o
>>  obj-$(CONFIG_PPC_BOOK3E)	+= tlb_low_$(BITS)e.o
>> diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
>> new file mode 100644
>> index 0000000..ab2b947
>> --- /dev/null
>> +++ b/arch/powerpc/mm/init-common.c
>> @@ -0,0 +1,147 @@
>> +/*
>> + *  PowerPC version
>> + *    Copyright (C) 1995-1996 Gary Thomas (gdt@linuxppc.org)
>> + *
>> + *  Modifications by Paul Mackerras (PowerMac) (paulus@cs.anu.edu.au)
>> + *  and Cort Dougan (PReP) (cort@cs.nmt.edu)
>> + *    Copyright (C) 1996 Paul Mackerras
>> + *
>> + *  Derived from "arch/i386/mm/init.c"
>> + *    Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
>> + *
>> + *  Dave Engebretsen <engebret@us.ibm.com>
>> + *      Rework for PPC64 port.
>> + *
>> + *  This program is free software; you can redistribute it and/or
>> + *  modify it under the terms of the GNU General Public License
>> + *  as published by the Free Software Foundation; either version
>> + *  2 of the License, or (at your option) any later version.
>> + *
>> + */
>> +
>> +#undef DEBUG
>> +
>> +#include <linux/signal.h>
>> +#include <linux/sched.h>
>> +#include <linux/kernel.h>
>> +#include <linux/errno.h>
>> +#include <linux/string.h>
>> +#include <linux/types.h>
>> +#include <linux/mman.h>
>> +#include <linux/mm.h>
>> +#include <linux/swap.h>
>> +#include <linux/stddef.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/init.h>
>> +#include <linux/delay.h>
>> +#include <linux/highmem.h>
>> +#include <linux/idr.h>
>> +#include <linux/nodemask.h>
>> +#include <linux/module.h>
>> +#include <linux/poison.h>
>> +#include <linux/memblock.h>
>> +#include <linux/hugetlb.h>
>> +#include <linux/slab.h>
>> +
>> +#include <asm/pgalloc.h>
>> +#include <asm/page.h>
>> +#include <asm/prom.h>
>> +#include <asm/rtas.h>
>> +#include <asm/io.h>
>> +#include <asm/mmu_context.h>
>> +#include <asm/pgtable.h>
>> +#include <asm/mmu.h>
>> +#include <asm/uaccess.h>
>> +#include <asm/smp.h>
>> +#include <asm/machdep.h>
>> +#include <asm/tlb.h>
>> +#include <asm/eeh.h>
>> +#include <asm/processor.h>
>> +#include <asm/mmzone.h>
>> +#include <asm/cputable.h>
>> +#include <asm/sections.h>
>> +#include <asm/iommu.h>
>> +#include <asm/vdso.h>
>> +
>> +#include "mmu_decl.h"
>
>
> Do you need all these headers to get it compiled ?
>

Probably not, I just copied the ones from init_64.c
Will clean it up.

>
>> +
>> +static void pgd_ctor(void *addr)
>> +{
>> +	memset(addr, 0, PGD_TABLE_SIZE);
>> +}
>> +
>> +static void pud_ctor(void *addr)
>> +{
>> +	memset(addr, 0, PUD_TABLE_SIZE);
>> +}
>> +
>> +static void pmd_ctor(void *addr)
>> +{
>> +	memset(addr, 0, PMD_TABLE_SIZE);
>> +}
>> +
>> +struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
>> +
>> +/*
>> + * Create a kmem_cache() for pagetables.  This is not used for PTE
>> + * pages - they're linked to struct page, come from the normal free
>> + * pages pool and have a different entry size (see real_pte_t) to
>> + * everything else.  Caches created by this function are used for all
>> + * the higher level pagetables, and for hugepage pagetables.
>> + */
>> +void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
>> +{
>> +	char *name;
>> +	unsigned long table_size = sizeof(void *) << shift;
>> +	unsigned long align = table_size;
>> +
>> +	/* When batching pgtable pointers for RCU freeing, we store
>> +	 * the index size in the low bits.  Table alignment must be
>> +	 * big enough to fit it.
>> +	 *
>> +	 * Likewise, hugeapge pagetable pointers contain a (different)
>> +	 * shift value in the low bits.  All tables must be aligned so
>> +	 * as to leave enough 0 bits in the address to contain it. */
>> +	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
>> +				     HUGEPD_SHIFT_MASK + 1);
>> +	struct kmem_cache *new;
>> +
>> +	/* It would be nice if this was a BUILD_BUG_ON(), but at the
>> +	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
>> +	 * constant expression, so so much for that. */
>> +	BUG_ON(!is_power_of_2(minalign));
>> +	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
>> +
>> +	if (PGT_CACHE(shift))
>> +		return; /* Already have a cache of this size */
>> +
>> +	align = max_t(unsigned long, align, minalign);
>> +	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
>> +	new = kmem_cache_create(name, table_size, align, 0, ctor);
>> +	kfree(name);
>> +	pgtable_cache[shift - 1] = new;
>> +	pr_debug("Allocated pgtable cache for order %d\n", shift);
>> +}
>> +
>> +
>> +void pgtable_cache_init(void)
>> +{
>> +	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
>> +
>> +	if (PMD_INDEX_SIZE && !PGT_CACHE(PMD_INDEX_SIZE))
>> +		pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
>> +	/*
>> +	 * In all current configs, when the PUD index exists it's the
>> +	 * same size as either the pgd or pmd index except with THP enabled
>> +	 * on book3s 64
>> +	 */
>> +	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
>> +		pgtable_cache_add(PUD_INDEX_SIZE, pud_ctor);
>> +
>> +	if (!PGT_CACHE(PGD_INDEX_SIZE))
>> +		panic("Couldn't allocate pgd cache");
>> +	if (PMD_INDEX_SIZE && !PGT_CACHE(PMD_INDEX_SIZE))
>> +		panic("Couldn't allocate pmd pgtable caches");
>> +	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
>> +		panic("Couldn't allocate pud pgtable caches");
>> +}
>> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
>> index 16ada1e..a000c35 100644
>> --- a/arch/powerpc/mm/init_64.c
>> +++ b/arch/powerpc/mm/init_64.c
>> @@ -80,83 +80,6 @@ EXPORT_SYMBOL_GPL(memstart_addr);
>>  phys_addr_t kernstart_addr;
>>  EXPORT_SYMBOL_GPL(kernstart_addr);
>>
>> -static void pgd_ctor(void *addr)
>> -{
>> -	memset(addr, 0, PGD_TABLE_SIZE);
>> -}
>> -
>> -static void pud_ctor(void *addr)
>> -{
>> -	memset(addr, 0, PUD_TABLE_SIZE);
>> -}
>> -
>> -static void pmd_ctor(void *addr)
>> -{
>> -	memset(addr, 0, PMD_TABLE_SIZE);
>> -}
>> -
>> -struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
>> -
>> -/*
>> - * Create a kmem_cache() for pagetables.  This is not used for PTE
>> - * pages - they're linked to struct page, come from the normal free
>> - * pages pool and have a different entry size (see real_pte_t) to
>> - * everything else.  Caches created by this function are used for all
>> - * the higher level pagetables, and for hugepage pagetables.
>> - */
>> -void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
>> -{
>> -	char *name;
>> -	unsigned long table_size = sizeof(void *) << shift;
>> -	unsigned long align = table_size;
>> -
>> -	/* When batching pgtable pointers for RCU freeing, we store
>> -	 * the index size in the low bits.  Table alignment must be
>> -	 * big enough to fit it.
>> -	 *
>> -	 * Likewise, hugeapge pagetable pointers contain a (different)
>> -	 * shift value in the low bits.  All tables must be aligned so
>> -	 * as to leave enough 0 bits in the address to contain it. */
>> -	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
>> -				     HUGEPD_SHIFT_MASK + 1);
>> -	struct kmem_cache *new;
>> -
>> -	/* It would be nice if this was a BUILD_BUG_ON(), but at the
>> -	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
>> -	 * constant expression, so so much for that. */
>> -	BUG_ON(!is_power_of_2(minalign));
>> -	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
>> -
>> -	if (PGT_CACHE(shift))
>> -		return; /* Already have a cache of this size */
>> -
>> -	align = max_t(unsigned long, align, minalign);
>> -	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
>> -	new = kmem_cache_create(name, table_size, align, 0, ctor);
>> -	kfree(name);
>> -	pgtable_cache[shift - 1] = new;
>> -	pr_debug("Allocated pgtable cache for order %d\n", shift);
>> -}
>> -
>> -
>> -void pgtable_cache_init(void)
>> -{
>> -	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
>> -	pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
>> -	/*
>> -	 * In all current configs, when the PUD index exists it's the
>> -	 * same size as either the pgd or pmd index except with THP enabled
>> -	 * on book3s 64
>> -	 */
>> -	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
>> -		pgtable_cache_add(PUD_INDEX_SIZE, pud_ctor);
>> -
>> -	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_CACHE_INDEX))
>> -		panic("Couldn't allocate pgtable caches");
>> -	if (PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE))
>> -		panic("Couldn't allocate pud pgtable caches");
>> -}
>> -
>>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>>  /*
>>   * Given an address within the vmemmap, determine the pfn of the page that
>> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
>> index 0ae0572..a65c0b4 100644
>> --- a/arch/powerpc/mm/pgtable_32.c
>> +++ b/arch/powerpc/mm/pgtable_32.c
>> @@ -42,43 +42,6 @@ EXPORT_SYMBOL(ioremap_bot);	/* aka VMALLOC_END */
>>
>>  extern char etext[], _stext[], _sinittext[], _einittext[];
>>
>> -#define PGDIR_ORDER	(32 + PGD_T_LOG2 - PGDIR_SHIFT)
>> -
>> -#ifndef CONFIG_PPC_4K_PAGES
>> -static struct kmem_cache *pgtable_cache;
>> -
>> -void pgtable_cache_init(void)
>> -{
>> -	pgtable_cache = kmem_cache_create("PGDIR cache", 1 << PGDIR_ORDER,
>> -					  1 << PGDIR_ORDER, 0, NULL);
>> -	if (pgtable_cache == NULL)
>> -		panic("Couldn't allocate pgtable caches");
>> -}
>> -#endif
>> -
>> -pgd_t *pgd_alloc(struct mm_struct *mm)
>> -{
>> -	pgd_t *ret;
>> -
>> -	/* pgdir take page or two with 4K pages and a page fraction otherwise */
>> -#ifndef CONFIG_PPC_4K_PAGES
>> -	ret = kmem_cache_alloc(pgtable_cache, GFP_KERNEL | __GFP_ZERO);
>> -#else
>> -	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO,
>> -			PGDIR_ORDER - PAGE_SHIFT);
>> -#endif
>> -	return ret;
>> -}
>> -
>> -void pgd_free(struct mm_struct *mm, pgd_t *pgd)
>> -{
>> -#ifndef CONFIG_PPC_4K_PAGES
>> -	kmem_cache_free(pgtable_cache, (void *)pgd);
>> -#else
>> -	free_pages((unsigned long)pgd, PGDIR_ORDER - PAGE_SHIFT);
>> -#endif
>> -}
>> -
>>  __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
>>  {
>>  	pte_t *pte;
>> --
>> 2.1.0

---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-19 18:36     ` christophe leroy
@ 2016-09-20  2:28         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2016-09-20  2:28 UTC (permalink / raw)
  To: christophe leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

christophe leroy <christophe.leroy@c-s.fr> writes:

> Le 19/09/2016 à 07:50, Aneesh Kumar K.V a écrit :
>>
>> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>>> +#else
>>> +static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
>>> +{
>>> +	BUG();
>>> +}
>>> +
>>>  #endif
>>
>>
>> I was expecting that BUG will get removed in the next patch. But I don't
>> see it in the next patch. Considering
>>
>> @@ -475,11 +453,10 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
>>         for (i = 0; i < num_hugepd; i++, hpdp++)
>>                 hpdp->pd = 0;
>>
>> -#ifdef CONFIG_PPC_FSL_BOOK3E
>> -	hugepd_free(tlb, hugepte);
>> -#else
>> -	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>> -#endif
>> +	if (shift >= pdshift)
>> +		hugepd_free(tlb, hugepte);
>> +	else
>> +		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>>  }
>>
>> What is that I am missing ?
>>
>
> Previously, call to hugepd_free() was compiled only when #ifdef 
> CONFIG_PPC_FSL_BOOK3E
> Now, it is compiled at all time, but it should never be called if not 
> CONFIG_PPC_FSL_BOOK3E because we always have shift < pdshift in that case.
> Then the function needs to be defined anyway but should never be called. 
> Should I just define it static inline {} ?
>

For 8M with 4K mode, we have shift >= pdshift right ?

-aneesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
@ 2016-09-20  2:28         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2016-09-20  2:28 UTC (permalink / raw)
  To: christophe leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

christophe leroy <christophe.leroy@c-s.fr> writes:

> Le 19/09/2016 =C3=A0 07:50, Aneesh Kumar K.V a =C3=A9crit :
>>
>> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>>> +#else
>>> +static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
>>> +{
>>> +	BUG();
>>> +}
>>> +
>>>  #endif
>>
>>
>> I was expecting that BUG will get removed in the next patch. But I don't
>> see it in the next patch. Considering
>>
>> @@ -475,11 +453,10 @@ static void free_hugepd_range(struct mmu_gather *t=
lb, hugepd_t *hpdp, int pdshif
>>         for (i =3D 0; i < num_hugepd; i++, hpdp++)
>>                 hpdp->pd =3D 0;
>>
>> -#ifdef CONFIG_PPC_FSL_BOOK3E
>> -	hugepd_free(tlb, hugepte);
>> -#else
>> -	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>> -#endif
>> +	if (shift >=3D pdshift)
>> +		hugepd_free(tlb, hugepte);
>> +	else
>> +		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>>  }
>>
>> What is that I am missing ?
>>
>
> Previously, call to hugepd_free() was compiled only when #ifdef=20
> CONFIG_PPC_FSL_BOOK3E
> Now, it is compiled at all time, but it should never be called if not=20
> CONFIG_PPC_FSL_BOOK3E because we always have shift < pdshift in that case.
> Then the function needs to be defined anyway but should never be called.=
=20
> Should I just define it static inline {} ?
>

For 8M with 4K mode, we have shift >=3D pdshift right ?

-aneesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-19 18:32     ` christophe leroy
@ 2016-09-20  2:45       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2016-09-20  2:45 UTC (permalink / raw)
  To: christophe leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev

christophe leroy <christophe.leroy@c-s.fr> writes:

>>
>>
>>>  	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
>>>  		unsigned shift;
>>>  		unsigned pdshift;
>>> @@ -860,16 +807,31 @@ static int __init hugetlbpage_init(void)
>>>  		 * if we have pdshift and shift value same, we don't
>>>  		 * use pgt cache for hugepd.
>>>  		 */
>>> -		if (pdshift != shift) {
>>> +		if (pdshift > shift) {
>>>  			pgtable_cache_add(pdshift - shift, NULL);
>>>  			if (!PGT_CACHE(pdshift - shift))
>>>  				panic("hugetlbpage_init(): could not create "
>>>  				      "pgtable cache for %d bit pagesize\n", shift);
>>> +		} else if (!hugepte_cache) {
>>> +			/*
>>> +			 * Create a kmem cache for hugeptes.  The bottom bits in
>>> +			 * the pte have size information encoded in them, so
>>> +			 * align them to allow this
>>> +			 */
>>> +			hugepte_cache = kmem_cache_create("hugepte-cache",
>>> +							  sizeof(pte_t),
>>> +							  HUGEPD_SHIFT_MASK + 1,
>>> +							  0, NULL);
>>> +			if (hugepte_cache == NULL)
>>> +				panic("%s: Unable to create kmem cache "
>>> +				      "for hugeptes\n", __func__);
>>> +
>>
>>
>> We don't need hugepte_cache for book3s 64K. I guess we will endup
>> creating one here ?
>
> Should not, because on book3s 64k, we will have pdshift > shift
> won't we ?
>

on 64k book3s, we have pdshift == shift and we don't need to create 
hugepd cache on book3s 64k.

-aneesh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-20  2:28         ` Aneesh Kumar K.V
  (?)
@ 2016-09-20  5:22         ` Christophe Leroy
  -1 siblings, 0 replies; 15+ messages in thread
From: Christophe Leroy @ 2016-09-20  5:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev



Le 20/09/2016 à 04:28, Aneesh Kumar K.V a écrit :
> christophe leroy <christophe.leroy@c-s.fr> writes:
>
>> Le 19/09/2016 à 07:50, Aneesh Kumar K.V a écrit :
>>>
>>> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>>>> +#else
>>>> +static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
>>>> +{
>>>> +	BUG();
>>>> +}
>>>> +
>>>>  #endif
>>>
>>>
>>> I was expecting that BUG will get removed in the next patch. But I don't
>>> see it in the next patch. Considering
>>>
>>> @@ -475,11 +453,10 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
>>>         for (i = 0; i < num_hugepd; i++, hpdp++)
>>>                 hpdp->pd = 0;
>>>
>>> -#ifdef CONFIG_PPC_FSL_BOOK3E
>>> -	hugepd_free(tlb, hugepte);
>>> -#else
>>> -	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>>> -#endif
>>> +	if (shift >= pdshift)
>>> +		hugepd_free(tlb, hugepte);
>>> +	else
>>> +		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
>>>  }
>>>
>>> What is that I am missing ?
>>>
>>
>> Previously, call to hugepd_free() was compiled only when #ifdef
>> CONFIG_PPC_FSL_BOOK3E
>> Now, it is compiled at all time, but it should never be called if not
>> CONFIG_PPC_FSL_BOOK3E because we always have shift < pdshift in that case.
>> Then the function needs to be defined anyway but should never be called.
>> Should I just define it static inline {} ?
>>
>
> For 8M with 4K mode, we have shift >= pdshift right ?
>

Yes, thats the reason why in the following patch we get. That way we get 
a real hugepd_free() also for the 8xx.

@@ -366,7 +373,7 @@ int alloc_bootmem_huge_page(struct hstate *hstate)
  }
  #endif

-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
  #define HUGEPD_FREELIST_SIZE \
  	((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))



Christophe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic
  2016-09-19  5:45   ` Aneesh Kumar K.V
  2016-09-19 18:32     ` christophe leroy
@ 2016-09-21  6:13     ` Christophe Leroy
  1 sibling, 0 replies; 15+ messages in thread
From: Christophe Leroy @ 2016-09-21  6:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Scott Wood
  Cc: linux-kernel, linuxppc-dev



Le 19/09/2016 à 07:45, Aneesh Kumar K.V a écrit :
> Christophe Leroy <christophe.leroy@c-s.fr> writes:
>
>> Today there are two implementations of hugetlbpages which are managed
>> by exclusive #ifdefs:
>> * FSL_BOOKE: several directory entries points to the same single hugepage
>> * BOOK3S: one upper level directory entry points to a table of hugepages
>>
>> In preparation of implementation of hugepage support on the 8xx, we
>> need a mix of the two above solutions, because the 8xx needs both cases
>> depending on the size of pages:
>> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
>> that 2 PGD entries will be necessary to cover an 8M hugepage while a
>> single PGD entry will cover 8x 512k hugepages.
>> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
>> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
>> hugepages will be covers by one PGD entry.
>>
>> This patch:
>> * removes #ifdefs in favor of if/else based on the range sizes
>> * merges the two huge_pte_alloc() functions as they are pretty similar
>> * merges the two hugetlbpage_init() functions as they are pretty similar
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> ---
>> v2: This part is new and results from a split of last patch of v1 serie in
>> two parts
>>
>>  arch/powerpc/mm/hugetlbpage.c | 189 +++++++++++++++++-------------------------
>>  1 file changed, 77 insertions(+), 112 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
>> index 8a512b1..2119f00 100644
>> --- a/arch/powerpc/mm/hugetlbpage.c
>> +++ b/arch/powerpc/mm/hugetlbpage.c
>> @@ -64,14 +64,16 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
>>  {
>>  	struct kmem_cache *cachep;
>>  	pte_t *new;
>> -
>> -#ifdef CONFIG_PPC_FSL_BOOK3E
>>  	int i;
>> -	int num_hugepd = 1 << (pshift - pdshift);
>> -	cachep = hugepte_cache;
>> -#else
>> -	cachep = PGT_CACHE(pdshift - pshift);
>> -#endif
>> +	int num_hugepd;
>> +
>> +	if (pshift >= pdshift) {
>> +		cachep = hugepte_cache;
>> +		num_hugepd = 1 << (pshift - pdshift);
>> +	} else {
>> +		cachep = PGT_CACHE(pdshift - pshift);
>> +		num_hugepd = 1;
>> +	}
>
> Is there a way to hint likely/unlikely branch based on the page size
> selected at build time ?

Is that really worth it, won't it be negligeable compared to  other 
actions in that function (like for instance kmem_cache_zalloc()) ?
Can't we just trust GCC on that one ?

Christophe

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-09-21  6:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-16  7:40 [PATCH v2 0/3] powerpc: implementation of huge pages for 8xx Christophe Leroy
2016-09-16  7:40 ` [PATCH v2 1/3] powerpc: port 64 bits pgtable_cache to 32 bits Christophe Leroy
2016-09-19  5:22   ` Aneesh Kumar K.V
2016-09-19 18:46     ` christophe leroy
2016-09-16  7:40 ` [PATCH v2 2/3] powerpc: get hugetlbpage handling more generic Christophe Leroy
2016-09-19  5:45   ` Aneesh Kumar K.V
2016-09-19 18:32     ` christophe leroy
2016-09-20  2:45       ` Aneesh Kumar K.V
2016-09-21  6:13     ` Christophe Leroy
2016-09-19  5:50   ` Aneesh Kumar K.V
2016-09-19 18:36     ` christophe leroy
2016-09-20  2:28       ` Aneesh Kumar K.V
2016-09-20  2:28         ` Aneesh Kumar K.V
2016-09-20  5:22         ` Christophe Leroy
2016-09-16  7:40 ` [PATCH v2 3/3] powerpc/8xx: Implement support of hugepages Christophe Leroy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.