linux-riscv.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Introduce sv48 support
@ 2020-03-22 11:00 Alexandre Ghiti
  2020-03-22 11:00 ` [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE Alexandre Ghiti
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

This patchset implements sv48 support at runtime. The kernel will try to
boot with 4-level page table and will fallback to 3-level if the HW does not
support it.

The biggest advantage is that we only have one kernel for 64bit, which
is way easier to maintain.

Folding the 4th level into a 3-level page table has almost no cost at
runtime.

At the moment, there is no way to enforce 3-level if the HW supports
4-level page table: early parameters are parsed after the choice must be
made.

It is based on my relocatable patchset v3 that I have not posted yet,
you can try the sv48 support by using the branch
int/alex/riscv_sv48_runtime_v1 here:

https://github.com/AlexGhiti/riscv-linux

Any feedback appreciated,

Thanks,

Alexandre Ghiti (7):
  riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  riscv: Allow to dynamically define VA_BITS
  riscv: Simplify MAXPHYSMEM config
  riscv: Implement sv48 support
  riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  dt-bindings: riscv: Remove "riscv,svXX" property from device-tree
  riscv: Explicit comment about user virtual address space size

 .../devicetree/bindings/riscv/cpus.yaml       |  13 --
 arch/riscv/Kconfig                            |  34 ++---
 arch/riscv/boot/dts/sifive/fu540-c000.dtsi    |   4 -
 arch/riscv/include/asm/csr.h                  |   3 +-
 arch/riscv/include/asm/fixmap.h               |   1 +
 arch/riscv/include/asm/page.h                 |  15 +-
 arch/riscv/include/asm/pgalloc.h              |  36 +++++
 arch/riscv/include/asm/pgtable-64.h           |  98 +++++++++++-
 arch/riscv/include/asm/pgtable.h              |  24 ++-
 arch/riscv/include/asm/sparsemem.h            |   2 +-
 arch/riscv/kernel/cpu.c                       |  24 +--
 arch/riscv/kernel/head.S                      |  37 ++++-
 arch/riscv/mm/context.c                       |   4 +-
 arch/riscv/mm/init.c                          | 142 +++++++++++++++---
 14 files changed, 341 insertions(+), 96 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  6:10   ` Anup Patel
  2020-04-03 15:17   ` Palmer Dabbelt
  2020-03-22 11:00 ` [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS Alexandre Ghiti
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is less
than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
definition.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/mm/init.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 238bd0033c3f..18bbb426848e 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -247,13 +247,7 @@ static void __init create_pte_mapping(pte_t *ptep,
 
 pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
-
-#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
-#define NUM_EARLY_PMDS		1UL
-#else
-#define NUM_EARLY_PMDS		(1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
-#endif
-pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
+pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
 
 static pmd_t *__init get_pmd_virt(phys_addr_t pa)
 {
@@ -267,14 +261,12 @@ static pmd_t *__init get_pmd_virt(phys_addr_t pa)
 
 static phys_addr_t __init alloc_pmd(uintptr_t va)
 {
-	uintptr_t pmd_num;
-
 	if (mmu_enabled)
 		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 
-	pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
-	BUG_ON(pmd_num >= NUM_EARLY_PMDS);
-	return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
+	BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
+
+	return (uintptr_t)early_pmd;
 }
 
 static void __init create_pmd_mapping(pmd_t *pmdp,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
  2020-03-22 11:00 ` [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  6:12   ` Anup Patel
  2020-04-03 15:17   ` Palmer Dabbelt
  2020-03-22 11:00 ` [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config Alexandre Ghiti
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/Kconfig                 | 10 ----------
 arch/riscv/include/asm/pgtable.h   | 10 +++++++++-
 arch/riscv/include/asm/sparsemem.h |  2 +-
 3 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index f5f3d474504d..8e4b1cbcf2c2 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -99,16 +99,6 @@ config ZONE_DMA32
 	bool
 	default y if 64BIT
 
-config VA_BITS
-	int
-	default 32 if 32BIT
-	default 39 if 64BIT
-
-config PA_BITS
-	int
-	default 34 if 32BIT
-	default 56 if 64BIT
-
 config PAGE_OFFSET
 	hex
 	default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 185ffe3723ec..dce401eed1d3 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -26,6 +26,14 @@
 #endif /* CONFIG_64BIT */
 
 #ifdef CONFIG_MMU
+#ifdef CONFIG_64BIT
+#define VA_BITS		39
+#define PA_BITS		56
+#else
+#define VA_BITS		32
+#define PA_BITS		34
+#endif
+
 /* Number of entries in the page global directory */
 #define PTRS_PER_PGD    (PAGE_SIZE / sizeof(pgd_t))
 /* Number of entries in the page table */
@@ -108,7 +116,7 @@ extern pgd_t swapper_pg_dir[];
  * position vmemmap directly below the VMALLOC region.
  */
 #define VMEMMAP_SHIFT \
-	(CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
+	(VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
 #define VMEMMAP_SIZE	BIT(VMEMMAP_SHIFT)
 #define VMEMMAP_END	(VMALLOC_START - 1)
 #define VMEMMAP_START	(VMALLOC_START - VMEMMAP_SIZE)
diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
index 45a7018a8118..f08d72155bc8 100644
--- a/arch/riscv/include/asm/sparsemem.h
+++ b/arch/riscv/include/asm/sparsemem.h
@@ -4,7 +4,7 @@
 #define _ASM_RISCV_SPARSEMEM_H
 
 #ifdef CONFIG_SPARSEMEM
-#define MAX_PHYSMEM_BITS	CONFIG_PA_BITS
+#define MAX_PHYSMEM_BITS	PA_BITS
 #define SECTION_SIZE_BITS	27
 #endif /* CONFIG_SPARSEMEM */
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
  2020-03-22 11:00 ` [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE Alexandre Ghiti
  2020-03-22 11:00 ` [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  6:22   ` Anup Patel
                     ` (2 more replies)
  2020-03-22 11:00 ` [RFC PATCH 4/7] riscv: Implement sv48 support Alexandre Ghiti
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

Either the user specifies maximum physical memory size of 2GB or the
user lives with the system constraint which is 128GB in 64BIT for now.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/Kconfig | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8e4b1cbcf2c2..a475c78e66bc 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -104,7 +104,7 @@ config PAGE_OFFSET
 	default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
 	default 0x80000000 if 64BIT && !MMU
 	default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
-	default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
+	default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
 
 config ARCH_FLATMEM_ENABLE
 	def_bool y
@@ -216,19 +216,11 @@ config MODULE_SECTIONS
 	bool
 	select HAVE_MOD_ARCH_SPECIFIC
 
-choice
-	prompt "Maximum Physical Memory"
-	default MAXPHYSMEM_2GB if 32BIT
-	default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
-	default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
-
-	config MAXPHYSMEM_2GB
-		bool "2GiB"
-	config MAXPHYSMEM_128GB
-		depends on 64BIT && CMODEL_MEDANY
-		bool "128GiB"
-endchoice
-
+config MAXPHYSMEM_2GB
+	bool "Maximum Physical Memory 2GiB"
+	default y if 32BIT
+	default y if 64BIT && CMODEL_MEDLOW
+	default n
 
 config SMP
 	bool "Symmetric Multi-Processing"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2020-03-22 11:00 ` [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  7:00   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  2020-03-22 11:00 ` [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo Alexandre Ghiti
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that roughly
offers ~160TB of virtual address space to userspace and allows up to 64TB
of physical memory.

By default, the kernel will try to boot with a 4-level page table. If the
underlying hardware does not support it, we will automatically fallback to
a standard 3-level page table by folding the new PUD level into PGDIR
level.

Early page table preparation is too early in the boot process to use any
device-tree entry, then in order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode. The current
mode used by the kernel is then made available through cpuinfo.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/Kconfig                  |   6 +-
 arch/riscv/include/asm/csr.h        |   3 +-
 arch/riscv/include/asm/fixmap.h     |   1 +
 arch/riscv/include/asm/page.h       |  15 +++-
 arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
 arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
 arch/riscv/include/asm/pgtable.h    |   5 +-
 arch/riscv/kernel/head.S            |  37 ++++++--
 arch/riscv/mm/context.c             |   4 +-
 arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
 10 files changed, 302 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a475c78e66bc..79560e94cc7c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -66,6 +66,7 @@ config RISCV
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select HAVE_COPY_THREAD_TLS
 	select HAVE_ARCH_KASAN if MMU && 64BIT
+	select RELOCATABLE if 64BIT
 
 config ARCH_MMAP_RND_BITS_MIN
 	default 18 if 64BIT
@@ -104,7 +105,7 @@ config PAGE_OFFSET
 	default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
 	default 0x80000000 if 64BIT && !MMU
 	default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
-	default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
+	default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
 
 config ARCH_FLATMEM_ENABLE
 	def_bool y
@@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
 config FIX_EARLYCON_MEM
 	def_bool MMU
 
+# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
+# on a 3-level page table when sv48 is not supported.
 config PGTABLE_LEVELS
 	int
+	default 4 if 64BIT && RELOCATABLE
 	default 3 if 64BIT
 	default 2
 
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 435b65532e29..3828d55af85e 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -40,11 +40,10 @@
 #ifndef CONFIG_64BIT
 #define SATP_PPN	_AC(0x003FFFFF, UL)
 #define SATP_MODE_32	_AC(0x80000000, UL)
-#define SATP_MODE	SATP_MODE_32
 #else
 #define SATP_PPN	_AC(0x00000FFFFFFFFFFF, UL)
 #define SATP_MODE_39	_AC(0x8000000000000000, UL)
-#define SATP_MODE	SATP_MODE_39
+#define SATP_MODE_48	_AC(0x9000000000000000, UL)
 #endif
 
 /* Exception cause high bit - is an interrupt if set */
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 42d2c42f3cc9..26e7799c5675 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -27,6 +27,7 @@ enum fixed_addresses {
 	FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
 	FIX_PTE,
 	FIX_PMD,
+	FIX_PUD,
 	FIX_EARLYCON_MEM_BASE,
 	__end_of_fixed_addresses
 };
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 691f2f9ded2f..f1a26a0690ef 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -32,11 +32,19 @@
  * physical memory (aligned on a page boundary).
  */
 #ifdef CONFIG_RELOCATABLE
-extern unsigned long kernel_virt_addr;
 #define PAGE_OFFSET		kernel_virt_addr
+
+#ifdef CONFIG_64BIT
+/*
+ * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
+ * define the PAGE_OFFSET value for SV39.
+ */
+#define PAGE_OFFSET_L3		0xffffffe000000000
+#define PAGE_OFFSET_L4		_AC(CONFIG_PAGE_OFFSET, UL)
+#endif /* CONFIG_64BIT */
 #else
 #define PAGE_OFFSET		_AC(CONFIG_PAGE_OFFSET, UL)
-#endif
+#endif /* CONFIG_RELOCATABLE */
 
 #define KERN_VIRT_SIZE		-PAGE_OFFSET
 
@@ -104,6 +112,9 @@ extern unsigned long pfn_base;
 
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long kernel_virt_addr;
+#endif
 
 #define __pa_to_va_nodebug(x)	((void *)((unsigned long) (x) + va_pa_offset))
 #define __va_to_pa_nodebug(x)	((unsigned long)(x) - va_pa_offset)
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 3f601ee8233f..540eaa5a8658 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 
 	set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
 }
+
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
+{
+	if (pgtable_l4_enabled) {
+		unsigned long pfn = virt_to_pfn(pud);
+
+		set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+	}
+}
+
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
+				     pud_t *pud)
+{
+	if (pgtable_l4_enabled) {
+		unsigned long pfn = virt_to_pfn(pud);
+
+		set_p4d_safe(p4d,
+			     __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+	}
+}
+
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+	if (pgtable_l4_enabled)
+		return (pud_t *)__get_free_page(
+				GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
+	return NULL;
+}
+
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+	if (pgtable_l4_enabled)
+		free_page((unsigned long)pud);
+}
+
+#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 #define pmd_pgtable(pmd)	pmd_page(pmd)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index b15f70a1fdfa..cc4ffbe778f3 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -8,16 +8,32 @@
 
 #include <linux/const.h>
 
-#define PGDIR_SHIFT     30
+extern bool pgtable_l4_enabled;
+
+#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
 /* Size of region mapped by a page global directory */
 #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
 #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
 
+/* pud is folded into pgd in case of 3-level page table */
+#define PUD_SHIFT	30
+#define PUD_SIZE	(_AC(1, UL) << PUD_SHIFT)
+#define PUD_MASK	(~(PUD_SIZE - 1))
+
 #define PMD_SHIFT       21
 /* Size of region mapped by a page middle directory */
 #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
 #define PMD_MASK        (~(PMD_SIZE - 1))
 
+/* Page Upper Directory entry */
+typedef struct {
+	unsigned long pud;
+} pud_t;
+
+#define pud_val(x)      ((x).pud)
+#define __pud(x)        ((pud_t) { (x) })
+#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
+
 /* Page Middle Directory entry */
 typedef struct {
 	unsigned long pmd;
@@ -25,7 +41,6 @@ typedef struct {
 
 #define pmd_val(x)      ((x).pmd)
 #define __pmd(x)        ((pmd_t) { (x) })
-
 #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
 
 static inline int pud_present(pud_t pud)
@@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
 	set_pud(pudp, __pud(0));
 }
 
+static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
+{
+	return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
+}
+
+static inline unsigned long _pud_pfn(pud_t pud)
+{
+	return pud_val(pud) >> _PAGE_PFN_SHIFT;
+}
+
 static inline unsigned long pud_page_vaddr(pud_t pud)
 {
 	return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
@@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
 	return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
 }
 
+#define mm_pud_folded	mm_pud_folded
+static inline bool mm_pud_folded(struct mm_struct *mm)
+{
+	if (pgtable_l4_enabled)
+		return false;
+
+	return true;
+}
+
 #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
 
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
@@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
 #define pmd_ERROR(e) \
 	pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
 
+#define pud_ERROR(e)	\
+	pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+	if (pgtable_l4_enabled)
+		*p4dp = p4d;
+	else
+		set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
+}
+
+static inline int p4d_none(p4d_t p4d)
+{
+	if (pgtable_l4_enabled)
+		return (p4d_val(p4d) == 0);
+
+	return 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+	if (pgtable_l4_enabled)
+		return (p4d_val(p4d) & _PAGE_PRESENT);
+
+	return 1;
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+	if (pgtable_l4_enabled)
+		return !p4d_present(p4d);
+
+	return 0;
+}
+
+static inline void p4d_clear(p4d_t *p4d)
+{
+	if (pgtable_l4_enabled)
+		set_p4d(p4d, __p4d(0));
+}
+
+static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+{
+	if (pgtable_l4_enabled)
+		return (unsigned long)pfn_to_virt(
+				p4d_val(p4d) >> _PAGE_PFN_SHIFT);
+
+	return pud_page_vaddr((pud_t) { p4d_val(p4d) });
+}
+
+#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+	if (pgtable_l4_enabled)
+		return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+
+	return (pud_t *)p4d;
+}
+
 #endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index dce401eed1d3..06361db3f486 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -13,8 +13,7 @@
 
 #ifndef __ASSEMBLY__
 
-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable-nop4d.h>
 #include <asm/page.h>
 #include <asm/tlbflush.h>
 #include <linux/mm_types.h>
@@ -27,7 +26,7 @@
 
 #ifdef CONFIG_MMU
 #ifdef CONFIG_64BIT
-#define VA_BITS		39
+#define VA_BITS		(pgtable_l4_enabled ? 48 : 39)
 #define PA_BITS		56
 #else
 #define VA_BITS		32
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 1c2fbefb8786..22617bd7477f 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -113,6 +113,8 @@ clear_bss_done:
 	call setup_vm
 #ifdef CONFIG_MMU
 	la a0, early_pg_dir
+	la a1, satp_mode
+	REG_L a1, (a1)
 	call relocate
 #endif /* CONFIG_MMU */
 
@@ -131,24 +133,28 @@ clear_bss_done:
 #ifdef CONFIG_MMU
 relocate:
 #ifdef CONFIG_RELOCATABLE
-	/* Relocate return address */
-	la a1, kernel_virt_addr
-	REG_L a1, 0(a1)
+	/*
+	 * Relocate return address but save it in case 4-level page table is
+	 * not supported.
+	 */
+	mv s1, ra
+	la a3, kernel_virt_addr
+	REG_L a3, 0(a3)
 #else
-	li a1, PAGE_OFFSET
+	li a3, PAGE_OFFSET
 #endif
 	la a2, _start
-	sub a1, a1, a2
-	add ra, ra, a1
+	sub a3, a3, a2
+	add ra, ra, a3
 
 	/* Point stvec to virtual address of intruction after satp write */
 	la a2, 1f
-	add a2, a2, a1
+	add a2, a2, a3
 	csrw CSR_TVEC, a2
 
+	/* First try with a 4-level page table */
 	/* Compute satp for kernel page tables, but don't load it yet */
 	srl a2, a0, PAGE_SHIFT
-	li a1, SATP_MODE
 	or a2, a2, a1
 
 	/*
@@ -162,6 +168,19 @@ relocate:
 	or a0, a0, a1
 	sfence.vma
 	csrw CSR_SATP, a0
+#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
+	/*
+	 * If we fall through here, that means the HW does not support SV48.
+	 * We need a 3-level page table then simply fold pud into pgd level
+	 * and finally jump back to relocate with 3-level parameters.
+	 */
+	call setup_vm_fold_pud
+
+	la a0, early_pg_dir
+	li a1, SATP_MODE_39
+	mv ra, s1
+	tail relocate
+#endif
 .align 2
 1:
 	/* Set trap vector to spin forever to help debug */
@@ -213,6 +232,8 @@ relocate:
 #ifdef CONFIG_MMU
 	/* Enable virtual memory and relocate to virtual address */
 	la a0, swapper_pg_dir
+	la a1, satp_mode
+	REG_L a1, (a1)
 	call relocate
 #endif
 
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 613ec81a8979..152b423c02ea 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -9,6 +9,8 @@
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 
+extern uint64_t satp_mode;
+
 /*
  * When necessary, performs a deferred icache flush for the given MM context,
  * on the local CPU.  RISC-V has no direct mechanism for instruction cache
@@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	cpumask_set_cpu(cpu, mm_cpumask(next));
 
 #ifdef CONFIG_MMU
-	csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
+	csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
 	local_flush_tlb_all();
 #endif
 
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 18bbb426848e..ad96667d2ab6 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -24,6 +24,17 @@
 
 #include "../kernel/head.h"
 
+#ifdef CONFIG_64BIT
+uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
+				SATP_MODE_39 : SATP_MODE_48;
+bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : true;
+#else
+uint64_t satp_mode = SATP_MODE_32;
+bool pgtable_l4_enabled = false;
+#endif
+EXPORT_SYMBOL(pgtable_l4_enabled);
+EXPORT_SYMBOL(satp_mode);
+
 unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
 							__page_aligned_bss;
 EXPORT_SYMBOL(empty_zero_page);
@@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
+pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
 pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
 pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
+pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
 
 static pmd_t *__init get_pmd_virt(phys_addr_t pa)
 {
@@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
 	if (mmu_enabled)
 		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 
-	BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
+	/* Only one PMD is available for early mapping */
+	BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
 
 	return (uintptr_t)early_pmd;
 }
@@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
 	create_pte_mapping(ptep, va, pa, sz, prot);
 }
 
-#define pgd_next_t		pmd_t
-#define alloc_pgd_next(__va)	alloc_pmd(__va)
-#define get_pgd_next_virt(__pa)	get_pmd_virt(__pa)
+static pud_t *__init get_pud_virt(phys_addr_t pa)
+{
+	if (mmu_enabled) {
+		clear_fixmap(FIX_PUD);
+		return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
+	} else {
+		return (pud_t *)((uintptr_t)pa);
+	}
+}
+
+static phys_addr_t __init alloc_pud(uintptr_t va)
+{
+	if (mmu_enabled)
+		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+
+	/* Only one PUD is available for early mapping */
+	BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
+
+	return (uintptr_t)early_pud;
+}
+
+static void __init create_pud_mapping(pud_t *pudp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot)
+{
+	pmd_t *nextp;
+	phys_addr_t next_phys;
+	uintptr_t pud_index = pud_index(va);
+
+	if (sz == PUD_SIZE) {
+		if (pud_val(pudp[pud_index]) == 0)
+			pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pud_val(pudp[pud_index]) == 0) {
+		next_phys = alloc_pmd(va);
+		pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
+		nextp = get_pmd_virt(next_phys);
+		memset(nextp, 0, PAGE_SIZE);
+	} else {
+		next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
+		nextp = get_pmd_virt(next_phys);
+	}
+
+	create_pmd_mapping(nextp, va, pa, sz, prot);
+}
+
+#define pgd_next_t		pud_t
+#define alloc_pgd_next(__va)	alloc_pud(__va)
+#define get_pgd_next_virt(__pa)	get_pud_virt(__pa)
 #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)	\
-	create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_next		fixmap_pmd
+	create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
+#define fixmap_pgd_next		(pgtable_l4_enabled ?			\
+			(uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
+#define trampoline_pgd_next	(pgtable_l4_enabled ?			\
+			(uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
 #else
 #define pgd_next_t		pte_t
 #define alloc_pgd_next(__va)	alloc_pte(__va)
 #define get_pgd_next_virt(__pa)	get_pte_virt(__pa)
 #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)	\
 	create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_next		fixmap_pte
+#define fixmap_pgd_next		((uintptr_t)fixmap_pte)
 #endif
 
 static void __init create_pgd_mapping(pgd_t *pgdp,
@@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
 	phys_addr_t next_phys;
 	uintptr_t pgd_index = pgd_index(va);
 
+#ifndef __PAGETABLE_PMD_FOLDED
+	if (!pgtable_l4_enabled) {
+		create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
+		return;
+	}
+#endif
+
 	if (sz == PGDIR_SIZE) {
 		if (pgd_val(pgdp[pgd_index]) == 0)
 			pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
@@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 
 	/* Setup early PGD for fixmap */
 	create_pgd_mapping(early_pg_dir, FIXADDR_START,
-			   (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+			   fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
 
 #ifndef __PAGETABLE_PMD_FOLDED
-	/* Setup fixmap PMD */
+	/* Setup fixmap PUD and PMD */
+	if (pgtable_l4_enabled)
+		create_pud_mapping(fixmap_pud, FIXADDR_START,
+			   (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
 	create_pmd_mapping(fixmap_pmd, FIXADDR_START,
 			   (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
+
 	/* Setup trampoline PGD and PMD */
 	create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
-			   (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
+			   trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+	if (pgtable_l4_enabled)
+		create_pud_mapping(trampoline_pud, PAGE_OFFSET,
+			   (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
 	create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
 			   load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
 #else
@@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 	dtb_early_pa = dtb_pa;
 }
 
+#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
+/*
+ * This function is called only if the current kernel is 64bit and the HW
+ * does not support sv48.
+ */
+asmlinkage __init void setup_vm_fold_pud(void)
+{
+	pgtable_l4_enabled = false;
+	kernel_virt_addr = PAGE_OFFSET_L3;
+	satp_mode = SATP_MODE_39;
+
+	/*
+	 * PTE/PMD levels do not need to be cleared as they are common between
+	 * 3- and 4-level page tables: the 30 least significant bits
+	 * (2 * 9 + 12) are common.
+	 */
+	memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
+	memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
+
+	setup_vm(dtb_early_pa);
+}
+#endif
+
 static void __init setup_vm_final(void)
 {
 	uintptr_t va, map_size;
@@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
 		}
 	}
 
-	/* Clear fixmap PTE and PMD mappings */
+	/* Clear fixmap page table mappings */
 	clear_fixmap(FIX_PTE);
 	clear_fixmap(FIX_PMD);
+	clear_fixmap(FIX_PUD);
 
 	/* Move to swapper page table */
-	csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
+	csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
 	local_flush_tlb_all();
 }
 #else
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
                   ` (3 preceding siblings ...)
  2020-03-22 11:00 ` [RFC PATCH 4/7] riscv: Implement sv48 support Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  7:01   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  2020-03-22 11:00 ` [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree Alexandre Ghiti
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/boot/dts/sifive/fu540-c000.dtsi |  4 ----
 arch/riscv/kernel/cpu.c                    | 24 ++++++++++++----------
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
index 7db861053483..6138590a2229 100644
--- a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
+++ b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
@@ -50,7 +50,6 @@
 			i-cache-size = <32768>;
 			i-tlb-sets = <1>;
 			i-tlb-size = <32>;
-			mmu-type = "riscv,sv39";
 			reg = <1>;
 			riscv,isa = "rv64imafdc";
 			tlb-split;
@@ -74,7 +73,6 @@
 			i-cache-size = <32768>;
 			i-tlb-sets = <1>;
 			i-tlb-size = <32>;
-			mmu-type = "riscv,sv39";
 			reg = <2>;
 			riscv,isa = "rv64imafdc";
 			tlb-split;
@@ -98,7 +96,6 @@
 			i-cache-size = <32768>;
 			i-tlb-sets = <1>;
 			i-tlb-size = <32>;
-			mmu-type = "riscv,sv39";
 			reg = <3>;
 			riscv,isa = "rv64imafdc";
 			tlb-split;
@@ -122,7 +119,6 @@
 			i-cache-size = <32768>;
 			i-tlb-sets = <1>;
 			i-tlb-size = <32>;
-			mmu-type = "riscv,sv39";
 			reg = <4>;
 			riscv,isa = "rv64imafdc";
 			tlb-split;
diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 40a3c442ac5f..38a699b997a8 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -8,6 +8,8 @@
 #include <linux/of.h>
 #include <asm/smp.h>
 
+extern bool pgtable_l4_enabled;
+
 /*
  * Returns the hart ID of the given device tree node, or -ENODEV if the node
  * isn't an enabled and valid RISC-V hart node.
@@ -54,18 +56,19 @@ static void print_isa(struct seq_file *f, const char *isa)
 	seq_puts(f, "\n");
 }
 
-static void print_mmu(struct seq_file *f, const char *mmu_type)
+static void print_mmu(struct seq_file *f)
 {
+	char sv_type[16];
+
 #if defined(CONFIG_32BIT)
-	if (strcmp(mmu_type, "riscv,sv32") != 0)
-		return;
+	strncpy(sv_type, "sv32", 5);
 #elif defined(CONFIG_64BIT)
-	if (strcmp(mmu_type, "riscv,sv39") != 0 &&
-	    strcmp(mmu_type, "riscv,sv48") != 0)
-		return;
+	if (pgtable_l4_enabled)
+		strncpy(sv_type, "sv48", 5);
+	else
+		strncpy(sv_type, "sv39", 5);
 #endif
-
-	seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
+	seq_printf(f, "mmu\t\t: %s\n", sv_type);
 }
 
 static void *c_start(struct seq_file *m, loff_t *pos)
@@ -90,14 +93,13 @@ static int c_show(struct seq_file *m, void *v)
 {
 	unsigned long cpu_id = (unsigned long)v - 1;
 	struct device_node *node = of_get_cpu_node(cpu_id, NULL);
-	const char *compat, *isa, *mmu;
+	const char *compat, *isa;
 
 	seq_printf(m, "processor\t: %lu\n", cpu_id);
 	seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
 	if (!of_property_read_string(node, "riscv,isa", &isa))
 		print_isa(m, isa);
-	if (!of_property_read_string(node, "mmu-type", &mmu))
-		print_mmu(m, mmu);
+	print_mmu(m);
 	if (!of_property_read_string(node, "compatible", &compat)
 	    && strcmp(compat, "riscv"))
 		seq_printf(m, "uarch\t\t: %s\n", compat);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
                   ` (4 preceding siblings ...)
  2020-03-22 11:00 ` [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  7:03   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  2020-03-22 11:00 ` [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size Alexandre Ghiti
  2020-03-31 19:53 ` [RFC PATCH 0/7] Introduce sv48 support Palmer Dabbelt
  7 siblings, 2 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

This property can not be used before virtual memory is set up
and then the  distinction between sv39 and sv48 is done at runtime
using SATP csr property: this property is now useless, so remove it.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 Documentation/devicetree/bindings/riscv/cpus.yaml | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/Documentation/devicetree/bindings/riscv/cpus.yaml b/Documentation/devicetree/bindings/riscv/cpus.yaml
index 04819ad379c2..12baabbac213 100644
--- a/Documentation/devicetree/bindings/riscv/cpus.yaml
+++ b/Documentation/devicetree/bindings/riscv/cpus.yaml
@@ -39,19 +39,6 @@ properties:
       Identifies that the hart uses the RISC-V instruction set
       and identifies the type of the hart.
 
-  mmu-type:
-    allOf:
-      - $ref: "/schemas/types.yaml#/definitions/string"
-      - enum:
-          - riscv,sv32
-          - riscv,sv39
-          - riscv,sv48
-    description:
-      Identifies the MMU address translation mode used on this
-      hart.  These values originate from the RISC-V Privileged
-      Specification document, available from
-      https://riscv.org/specifications/
-
   riscv,isa:
     allOf:
       - $ref: "/schemas/types.yaml#/definitions/string"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
                   ` (5 preceding siblings ...)
  2020-03-22 11:00 ` [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree Alexandre Ghiti
@ 2020-03-22 11:00 ` Alexandre Ghiti
  2020-03-26  7:05   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  2020-03-31 19:53 ` [RFC PATCH 0/7] Introduce sv48 support Palmer Dabbelt
  7 siblings, 2 replies; 35+ messages in thread
From: Alexandre Ghiti @ 2020-03-22 11:00 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Zong Li, Anup Patel,
	Christoph Hellwig, linux-riscv, linux-kernel
  Cc: Alexandre Ghiti

Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 arch/riscv/include/asm/pgtable.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 06361db3f486..be117a0b4ea1 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -456,8 +456,15 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
 
 /*
- * Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32.
- * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
+ * Task size is:
+ * -     0x9fc00000 (~2.5GB) for RV32.
+ * -   0x4000000000 ( 256GB) for RV64 using SV39 mmu
+ * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
+ *
+ * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
+ * Instruction Set Manual Volume II: Privileged Architecture" states that
+ * "load and store effective addresses, which are 64bits, must have bits
+ * 63–48 all equal to bit 47, or else a page-fault exception will occur."
  */
 #ifdef CONFIG_64BIT
 #define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  2020-03-22 11:00 ` [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE Alexandre Ghiti
@ 2020-03-26  6:10   ` Anup Patel
  2020-04-03 15:17   ` Palmer Dabbelt
  1 sibling, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  6:10 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:31 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
> with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is less
> than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
> definition.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/mm/init.c | 16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 238bd0033c3f..18bbb426848e 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -247,13 +247,7 @@ static void __init create_pte_mapping(pte_t *ptep,
>
>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> -
> -#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
> -#define NUM_EARLY_PMDS         1UL
> -#else
> -#define NUM_EARLY_PMDS         (1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
> -#endif
> -pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
> +pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>
>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>  {
> @@ -267,14 +261,12 @@ static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>
>  static phys_addr_t __init alloc_pmd(uintptr_t va)
>  {
> -       uintptr_t pmd_num;
> -
>         if (mmu_enabled)
>                 return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -       pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> -       BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> -       return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> +       BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> +
> +       return (uintptr_t)early_pmd;
>  }
>
>  static void __init create_pmd_mapping(pmd_t *pmdp,
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS
  2020-03-22 11:00 ` [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS Alexandre Ghiti
@ 2020-03-26  6:12   ` Anup Patel
  2020-04-03 15:17   ` Palmer Dabbelt
  1 sibling, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  6:12 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> With 4-level page table folding at runtime, we don't know at compile time
> the size of the virtual address space so we must set VA_BITS dynamically
> so that sparsemem reserves the right amount of memory for struct pages.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig                 | 10 ----------
>  arch/riscv/include/asm/pgtable.h   | 10 +++++++++-
>  arch/riscv/include/asm/sparsemem.h |  2 +-
>  3 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index f5f3d474504d..8e4b1cbcf2c2 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -99,16 +99,6 @@ config ZONE_DMA32
>         bool
>         default y if 64BIT
>
> -config VA_BITS
> -       int
> -       default 32 if 32BIT
> -       default 39 if 64BIT
> -
> -config PA_BITS
> -       int
> -       default 34 if 32BIT
> -       default 56 if 64BIT
> -
>  config PAGE_OFFSET
>         hex
>         default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 185ffe3723ec..dce401eed1d3 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -26,6 +26,14 @@
>  #endif /* CONFIG_64BIT */
>
>  #ifdef CONFIG_MMU
> +#ifdef CONFIG_64BIT
> +#define VA_BITS                39
> +#define PA_BITS                56
> +#else
> +#define VA_BITS                32
> +#define PA_BITS                34
> +#endif
> +
>  /* Number of entries in the page global directory */
>  #define PTRS_PER_PGD    (PAGE_SIZE / sizeof(pgd_t))
>  /* Number of entries in the page table */
> @@ -108,7 +116,7 @@ extern pgd_t swapper_pg_dir[];
>   * position vmemmap directly below the VMALLOC region.
>   */
>  #define VMEMMAP_SHIFT \
> -       (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
> +       (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
>  #define VMEMMAP_SIZE   BIT(VMEMMAP_SHIFT)
>  #define VMEMMAP_END    (VMALLOC_START - 1)
>  #define VMEMMAP_START  (VMALLOC_START - VMEMMAP_SIZE)
> diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
> index 45a7018a8118..f08d72155bc8 100644
> --- a/arch/riscv/include/asm/sparsemem.h
> +++ b/arch/riscv/include/asm/sparsemem.h
> @@ -4,7 +4,7 @@
>  #define _ASM_RISCV_SPARSEMEM_H
>
>  #ifdef CONFIG_SPARSEMEM
> -#define MAX_PHYSMEM_BITS       CONFIG_PA_BITS
> +#define MAX_PHYSMEM_BITS       PA_BITS
>  #define SECTION_SIZE_BITS      27
>  #endif /* CONFIG_SPARSEMEM */
>
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config
  2020-03-22 11:00 ` [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config Alexandre Ghiti
@ 2020-03-26  6:22   ` Anup Patel
  2020-03-26  6:34   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  2 siblings, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  6:22 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:33 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Either the user specifies maximum physical memory size of 2GB or the
> user lives with the system constraint which is 128GB in 64BIT for now.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 8e4b1cbcf2c2..a475c78e66bc 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -104,7 +104,7 @@ config PAGE_OFFSET
>         default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>         default 0x80000000 if 64BIT && !MMU
>         default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> -       default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
> +       default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>
>  config ARCH_FLATMEM_ENABLE
>         def_bool y
> @@ -216,19 +216,11 @@ config MODULE_SECTIONS
>         bool
>         select HAVE_MOD_ARCH_SPECIFIC
>
> -choice
> -       prompt "Maximum Physical Memory"
> -       default MAXPHYSMEM_2GB if 32BIT
> -       default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
> -       default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
> -
> -       config MAXPHYSMEM_2GB
> -               bool "2GiB"
> -       config MAXPHYSMEM_128GB
> -               depends on 64BIT && CMODEL_MEDANY
> -               bool "128GiB"
> -endchoice
> -
> +config MAXPHYSMEM_2GB
> +       bool "Maximum Physical Memory 2GiB"
> +       default y if 32BIT
> +       default y if 64BIT && CMODEL_MEDLOW
> +       default n
>
>  config SMP
>         bool "Symmetric Multi-Processing"
> --
> 2.20.1
>

Currently, we don't have systems with 256 GB or more memory
but I am sure quite a few organizations are working on server
class RISC-V.

Let's not force RV64 physical memory limit to 128GB by
removing MAXPHYSMEM_128GB. On the contrary, I suggest
to have more options such as MAXPHYSMEM_256GB,
MAXPHYSMEM_512GB, and MAXPHYSMEM_1TB.

Regards,
Anup


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config
  2020-03-22 11:00 ` [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config Alexandre Ghiti
  2020-03-26  6:22   ` Anup Patel
@ 2020-03-26  6:34   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  2 siblings, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  6:34 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:33 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Either the user specifies maximum physical memory size of 2GB or the
> user lives with the system constraint which is 128GB in 64BIT for now.

Ignore my previous comment. I see that you are setting the
PAGE_OFFSET to 0xffffc00000000000 in the next PATCH.

The commit description is can bit improved as follows:

Either the user specifies maximum physical memory size of 2GB or the
user lives with the current system constraint which is 1/4th of maximum
addressable memory in Sv39 MMU mode (i.e. 128GB) for now.

Other than above, looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 8e4b1cbcf2c2..a475c78e66bc 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -104,7 +104,7 @@ config PAGE_OFFSET
>         default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>         default 0x80000000 if 64BIT && !MMU
>         default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> -       default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
> +       default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>
>  config ARCH_FLATMEM_ENABLE
>         def_bool y
> @@ -216,19 +216,11 @@ config MODULE_SECTIONS
>         bool
>         select HAVE_MOD_ARCH_SPECIFIC
>
> -choice
> -       prompt "Maximum Physical Memory"
> -       default MAXPHYSMEM_2GB if 32BIT
> -       default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
> -       default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
> -
> -       config MAXPHYSMEM_2GB
> -               bool "2GiB"
> -       config MAXPHYSMEM_128GB
> -               depends on 64BIT && CMODEL_MEDANY
> -               bool "128GiB"
> -endchoice
> -
> +config MAXPHYSMEM_2GB
> +       bool "Maximum Physical Memory 2GiB"
> +       default y if 32BIT
> +       default y if 64BIT && CMODEL_MEDLOW
> +       default n
>
>  config SMP
>         bool "Symmetric Multi-Processing"
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-03-22 11:00 ` [RFC PATCH 4/7] riscv: Implement sv48 support Alexandre Ghiti
@ 2020-03-26  7:00   ` Anup Patel
  2020-03-31 16:31     ` Alex Ghiti
  2020-04-03 15:53   ` Palmer Dabbelt
  1 sibling, 1 reply; 35+ messages in thread
From: Anup Patel @ 2020-03-26  7:00 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:34 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> By adding a new 4th level of page table, give the possibility to 64bit
> kernel to address 2^48 bytes of virtual address: in practice, that roughly
> offers ~160TB of virtual address space to userspace and allows up to 64TB
> of physical memory.
>
> By default, the kernel will try to boot with a 4-level page table. If the
> underlying hardware does not support it, we will automatically fallback to
> a standard 3-level page table by folding the new PUD level into PGDIR
> level.
>
> Early page table preparation is too early in the boot process to use any
> device-tree entry, then in order to detect HW capabilities at runtime, we
> use SATP feature that ignores writes with an unsupported mode. The current
> mode used by the kernel is then made available through cpuinfo.

Overall the patch look fine but I don't agree with the strategy of detecting
SV48 mode in relocate() of head.S.

Instead of this, I suggest that we have a separate PATCH before this
PATCH which just adds satp_mode global variable.

This PATCH can detect-n-update both satp_mode and pgtable_l4_enabled
variables in setup_vm() (i.e. C sources) before setting up initial page
tables. This way we can avoid adding setup_vm_fold_pud() function and
in-future we will not require any changes in head.S when a new page table
format shows-up to support 64K pages. The means MMU mode detection
will always be restricted to setup_vm() in mm/init.c.

To achieve Sv48 mode detection in C sources, we just need one 4KB
page for PGD (e.g. trampoline_pg_dir). To detect Sv48, we create flat
identity mapping directly in PGD assuming Sv48 is available and try
enabling Sv48. If Sv48 is enabled successfully then reading back SATP
CSR will show mode as Sv48 otherwise mode will not be set.

As an example, refer __detect_pgtbl_mode() function of Xvisor at
https://github.com/avpatel/xvisor-next/blob/master/arch/riscv/cpu/generic/cpu_mmu_initial_pgtbl.c

Regards,
Anup

>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig                  |   6 +-
>  arch/riscv/include/asm/csr.h        |   3 +-
>  arch/riscv/include/asm/fixmap.h     |   1 +
>  arch/riscv/include/asm/page.h       |  15 +++-
>  arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
>  arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
>  arch/riscv/include/asm/pgtable.h    |   5 +-
>  arch/riscv/kernel/head.S            |  37 ++++++--
>  arch/riscv/mm/context.c             |   4 +-
>  arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
>  10 files changed, 302 insertions(+), 31 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a475c78e66bc..79560e94cc7c 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -66,6 +66,7 @@ config RISCV
>         select ARCH_HAS_GCOV_PROFILE_ALL
>         select HAVE_COPY_THREAD_TLS
>         select HAVE_ARCH_KASAN if MMU && 64BIT
> +       select RELOCATABLE if 64BIT
>
>  config ARCH_MMAP_RND_BITS_MIN
>         default 18 if 64BIT
> @@ -104,7 +105,7 @@ config PAGE_OFFSET
>         default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>         default 0x80000000 if 64BIT && !MMU
>         default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> -       default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
> +       default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
>
>  config ARCH_FLATMEM_ENABLE
>         def_bool y
> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
>  config FIX_EARLYCON_MEM
>         def_bool MMU
>
> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
> +# on a 3-level page table when sv48 is not supported.
>  config PGTABLE_LEVELS
>         int
> +       default 4 if 64BIT && RELOCATABLE
>         default 3 if 64BIT
>         default 2
>
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 435b65532e29..3828d55af85e 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -40,11 +40,10 @@
>  #ifndef CONFIG_64BIT
>  #define SATP_PPN       _AC(0x003FFFFF, UL)
>  #define SATP_MODE_32   _AC(0x80000000, UL)
> -#define SATP_MODE      SATP_MODE_32
>  #else
>  #define SATP_PPN       _AC(0x00000FFFFFFFFFFF, UL)
>  #define SATP_MODE_39   _AC(0x8000000000000000, UL)
> -#define SATP_MODE      SATP_MODE_39
> +#define SATP_MODE_48   _AC(0x9000000000000000, UL)
>  #endif
>
>  /* Exception cause high bit - is an interrupt if set */
> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> index 42d2c42f3cc9..26e7799c5675 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -27,6 +27,7 @@ enum fixed_addresses {
>         FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
>         FIX_PTE,
>         FIX_PMD,
> +       FIX_PUD,
>         FIX_EARLYCON_MEM_BASE,
>         __end_of_fixed_addresses
>  };
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 691f2f9ded2f..f1a26a0690ef 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -32,11 +32,19 @@
>   * physical memory (aligned on a page boundary).
>   */
>  #ifdef CONFIG_RELOCATABLE
> -extern unsigned long kernel_virt_addr;
>  #define PAGE_OFFSET            kernel_virt_addr
> +
> +#ifdef CONFIG_64BIT
> +/*
> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> + * define the PAGE_OFFSET value for SV39.
> + */
> +#define PAGE_OFFSET_L3         0xffffffe000000000
> +#define PAGE_OFFSET_L4         _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif /* CONFIG_64BIT */
>  #else
>  #define PAGE_OFFSET            _AC(CONFIG_PAGE_OFFSET, UL)
> -#endif
> +#endif /* CONFIG_RELOCATABLE */
>
>  #define KERN_VIRT_SIZE         -PAGE_OFFSET
>
> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
>
>  extern unsigned long max_low_pfn;
>  extern unsigned long min_low_pfn;
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long kernel_virt_addr;
> +#endif
>
>  #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>  #define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> index 3f601ee8233f..540eaa5a8658 100644
> --- a/arch/riscv/include/asm/pgalloc.h
> +++ b/arch/riscv/include/asm/pgalloc.h
> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
>
>         set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>  }
> +
> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> +{
> +       if (pgtable_l4_enabled) {
> +               unsigned long pfn = virt_to_pfn(pud);
> +
> +               set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> +       }
> +}
> +
> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> +                                    pud_t *pud)
> +{
> +       if (pgtable_l4_enabled) {
> +               unsigned long pfn = virt_to_pfn(pud);
> +
> +               set_p4d_safe(p4d,
> +                            __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> +       }
> +}
> +
> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> +{
> +       if (pgtable_l4_enabled)
> +               return (pud_t *)__get_free_page(
> +                               GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
> +       return NULL;
> +}
> +
> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> +{
> +       if (pgtable_l4_enabled)
> +               free_page((unsigned long)pud);
> +}
> +
> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
>  #endif /* __PAGETABLE_PMD_FOLDED */
>
>  #define pmd_pgtable(pmd)       pmd_page(pmd)
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index b15f70a1fdfa..cc4ffbe778f3 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -8,16 +8,32 @@
>
>  #include <linux/const.h>
>
> -#define PGDIR_SHIFT     30
> +extern bool pgtable_l4_enabled;
> +
> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
>  /* Size of region mapped by a page global directory */
>  #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
>  #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
>
> +/* pud is folded into pgd in case of 3-level page table */
> +#define PUD_SHIFT      30
> +#define PUD_SIZE       (_AC(1, UL) << PUD_SHIFT)
> +#define PUD_MASK       (~(PUD_SIZE - 1))
> +
>  #define PMD_SHIFT       21
>  /* Size of region mapped by a page middle directory */
>  #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
>  #define PMD_MASK        (~(PMD_SIZE - 1))
>
> +/* Page Upper Directory entry */
> +typedef struct {
> +       unsigned long pud;
> +} pud_t;
> +
> +#define pud_val(x)      ((x).pud)
> +#define __pud(x)        ((pud_t) { (x) })
> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
> +
>  /* Page Middle Directory entry */
>  typedef struct {
>         unsigned long pmd;
> @@ -25,7 +41,6 @@ typedef struct {
>
>  #define pmd_val(x)      ((x).pmd)
>  #define __pmd(x)        ((pmd_t) { (x) })
> -
>  #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
>
>  static inline int pud_present(pud_t pud)
> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
>         set_pud(pudp, __pud(0));
>  }
>
> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> +{
> +       return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> +}
> +
> +static inline unsigned long _pud_pfn(pud_t pud)
> +{
> +       return pud_val(pud) >> _PAGE_PFN_SHIFT;
> +}
> +
>  static inline unsigned long pud_page_vaddr(pud_t pud)
>  {
>         return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
>         return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
>  }
>
> +#define mm_pud_folded  mm_pud_folded
> +static inline bool mm_pud_folded(struct mm_struct *mm)
> +{
> +       if (pgtable_l4_enabled)
> +               return false;
> +
> +       return true;
> +}
> +
>  #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>
>  static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
>  #define pmd_ERROR(e) \
>         pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> +#define pud_ERROR(e)   \
> +       pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> +
> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> +{
> +       if (pgtable_l4_enabled)
> +               *p4dp = p4d;
> +       else
> +               set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> +}
> +
> +static inline int p4d_none(p4d_t p4d)
> +{
> +       if (pgtable_l4_enabled)
> +               return (p4d_val(p4d) == 0);
> +
> +       return 0;
> +}
> +
> +static inline int p4d_present(p4d_t p4d)
> +{
> +       if (pgtable_l4_enabled)
> +               return (p4d_val(p4d) & _PAGE_PRESENT);
> +
> +       return 1;
> +}
> +
> +static inline int p4d_bad(p4d_t p4d)
> +{
> +       if (pgtable_l4_enabled)
> +               return !p4d_present(p4d);
> +
> +       return 0;
> +}
> +
> +static inline void p4d_clear(p4d_t *p4d)
> +{
> +       if (pgtable_l4_enabled)
> +               set_p4d(p4d, __p4d(0));
> +}
> +
> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
> +{
> +       if (pgtable_l4_enabled)
> +               return (unsigned long)pfn_to_virt(
> +                               p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +
> +       return pud_page_vaddr((pud_t) { p4d_val(p4d) });
> +}
> +
> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> +
> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +{
> +       if (pgtable_l4_enabled)
> +               return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> +
> +       return (pud_t *)p4d;
> +}
> +
>  #endif /* _ASM_RISCV_PGTABLE_64_H */
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index dce401eed1d3..06361db3f486 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -13,8 +13,7 @@
>
>  #ifndef __ASSEMBLY__
>
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> +#include <asm-generic/pgtable-nop4d.h>
>  #include <asm/page.h>
>  #include <asm/tlbflush.h>
>  #include <linux/mm_types.h>
> @@ -27,7 +26,7 @@
>
>  #ifdef CONFIG_MMU
>  #ifdef CONFIG_64BIT
> -#define VA_BITS                39
> +#define VA_BITS                (pgtable_l4_enabled ? 48 : 39)
>  #define PA_BITS                56
>  #else
>  #define VA_BITS                32
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 1c2fbefb8786..22617bd7477f 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -113,6 +113,8 @@ clear_bss_done:
>         call setup_vm
>  #ifdef CONFIG_MMU
>         la a0, early_pg_dir
> +       la a1, satp_mode
> +       REG_L a1, (a1)
>         call relocate
>  #endif /* CONFIG_MMU */
>
> @@ -131,24 +133,28 @@ clear_bss_done:
>  #ifdef CONFIG_MMU
>  relocate:
>  #ifdef CONFIG_RELOCATABLE
> -       /* Relocate return address */
> -       la a1, kernel_virt_addr
> -       REG_L a1, 0(a1)
> +       /*
> +        * Relocate return address but save it in case 4-level page table is
> +        * not supported.
> +        */
> +       mv s1, ra
> +       la a3, kernel_virt_addr
> +       REG_L a3, 0(a3)
>  #else
> -       li a1, PAGE_OFFSET
> +       li a3, PAGE_OFFSET
>  #endif
>         la a2, _start
> -       sub a1, a1, a2
> -       add ra, ra, a1
> +       sub a3, a3, a2
> +       add ra, ra, a3
>
>         /* Point stvec to virtual address of intruction after satp write */
>         la a2, 1f
> -       add a2, a2, a1
> +       add a2, a2, a3
>         csrw CSR_TVEC, a2
>
> +       /* First try with a 4-level page table */
>         /* Compute satp for kernel page tables, but don't load it yet */
>         srl a2, a0, PAGE_SHIFT
> -       li a1, SATP_MODE
>         or a2, a2, a1
>
>         /*
> @@ -162,6 +168,19 @@ relocate:
>         or a0, a0, a1
>         sfence.vma
>         csrw CSR_SATP, a0
> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> +       /*
> +        * If we fall through here, that means the HW does not support SV48.
> +        * We need a 3-level page table then simply fold pud into pgd level
> +        * and finally jump back to relocate with 3-level parameters.
> +        */
> +       call setup_vm_fold_pud
> +
> +       la a0, early_pg_dir
> +       li a1, SATP_MODE_39
> +       mv ra, s1
> +       tail relocate
> +#endif
>  .align 2
>  1:
>         /* Set trap vector to spin forever to help debug */
> @@ -213,6 +232,8 @@ relocate:
>  #ifdef CONFIG_MMU
>         /* Enable virtual memory and relocate to virtual address */
>         la a0, swapper_pg_dir
> +       la a1, satp_mode
> +       REG_L a1, (a1)
>         call relocate
>  #endif
>
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 613ec81a8979..152b423c02ea 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -9,6 +9,8 @@
>  #include <asm/cacheflush.h>
>  #include <asm/mmu_context.h>
>
> +extern uint64_t satp_mode;
> +
>  /*
>   * When necessary, performs a deferred icache flush for the given MM context,
>   * on the local CPU.  RISC-V has no direct mechanism for instruction cache
> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>         cpumask_set_cpu(cpu, mm_cpumask(next));
>
>  #ifdef CONFIG_MMU
> -       csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
> +       csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
>         local_flush_tlb_all();
>  #endif
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 18bbb426848e..ad96667d2ab6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -24,6 +24,17 @@
>
>  #include "../kernel/head.h"
>
> +#ifdef CONFIG_64BIT
> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
> +                               SATP_MODE_39 : SATP_MODE_48;
> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : true;
> +#else
> +uint64_t satp_mode = SATP_MODE_32;
> +bool pgtable_l4_enabled = false;
> +#endif
> +EXPORT_SYMBOL(pgtable_l4_enabled);
> +EXPORT_SYMBOL(satp_mode);
> +
>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>                                                         __page_aligned_bss;
>  EXPORT_SYMBOL(empty_zero_page);
> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
>
>  #ifndef __PAGETABLE_PMD_FOLDED
>
> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
>  pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
>
>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>  {
> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>         if (mmu_enabled)
>                 return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -       BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> +       /* Only one PMD is available for early mapping */
> +       BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
>
>         return (uintptr_t)early_pmd;
>  }
> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
>         create_pte_mapping(ptep, va, pa, sz, prot);
>  }
>
> -#define pgd_next_t             pmd_t
> -#define alloc_pgd_next(__va)   alloc_pmd(__va)
> -#define get_pgd_next_virt(__pa)        get_pmd_virt(__pa)
> +static pud_t *__init get_pud_virt(phys_addr_t pa)
> +{
> +       if (mmu_enabled) {
> +               clear_fixmap(FIX_PUD);
> +               return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> +       } else {
> +               return (pud_t *)((uintptr_t)pa);
> +       }
> +}
> +
> +static phys_addr_t __init alloc_pud(uintptr_t va)
> +{
> +       if (mmu_enabled)
> +               return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> +
> +       /* Only one PUD is available for early mapping */
> +       BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> +
> +       return (uintptr_t)early_pud;
> +}
> +
> +static void __init create_pud_mapping(pud_t *pudp,
> +                                     uintptr_t va, phys_addr_t pa,
> +                                     phys_addr_t sz, pgprot_t prot)
> +{
> +       pmd_t *nextp;
> +       phys_addr_t next_phys;
> +       uintptr_t pud_index = pud_index(va);
> +
> +       if (sz == PUD_SIZE) {
> +               if (pud_val(pudp[pud_index]) == 0)
> +                       pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> +               return;
> +       }
> +
> +       if (pud_val(pudp[pud_index]) == 0) {
> +               next_phys = alloc_pmd(va);
> +               pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> +               nextp = get_pmd_virt(next_phys);
> +               memset(nextp, 0, PAGE_SIZE);
> +       } else {
> +               next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> +               nextp = get_pmd_virt(next_phys);
> +       }
> +
> +       create_pmd_mapping(nextp, va, pa, sz, prot);
> +}
> +
> +#define pgd_next_t             pud_t
> +#define alloc_pgd_next(__va)   alloc_pud(__va)
> +#define get_pgd_next_virt(__pa)        get_pud_virt(__pa)
>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)     \
> -       create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next                fixmap_pmd
> +       create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
> +#define fixmap_pgd_next                (pgtable_l4_enabled ?                   \
> +                       (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> +#define trampoline_pgd_next    (pgtable_l4_enabled ?                   \
> +                       (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
>  #else
>  #define pgd_next_t             pte_t
>  #define alloc_pgd_next(__va)   alloc_pte(__va)
>  #define get_pgd_next_virt(__pa)        get_pte_virt(__pa)
>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)     \
>         create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next                fixmap_pte
> +#define fixmap_pgd_next                ((uintptr_t)fixmap_pte)
>  #endif
>
>  static void __init create_pgd_mapping(pgd_t *pgdp,
> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>         phys_addr_t next_phys;
>         uintptr_t pgd_index = pgd_index(va);
>
> +#ifndef __PAGETABLE_PMD_FOLDED
> +       if (!pgtable_l4_enabled) {
> +               create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
> +               return;
> +       }
> +#endif
> +
>         if (sz == PGDIR_SIZE) {
>                 if (pgd_val(pgdp[pgd_index]) == 0)
>                         pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>         /* Setup early PGD for fixmap */
>         create_pgd_mapping(early_pg_dir, FIXADDR_START,
> -                          (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> +                          fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>
>  #ifndef __PAGETABLE_PMD_FOLDED
> -       /* Setup fixmap PMD */
> +       /* Setup fixmap PUD and PMD */
> +       if (pgtable_l4_enabled)
> +               create_pud_mapping(fixmap_pud, FIXADDR_START,
> +                          (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
>         create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>                            (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> +
>         /* Setup trampoline PGD and PMD */
>         create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> -                          (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> +                          trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> +       if (pgtable_l4_enabled)
> +               create_pud_mapping(trampoline_pud, PAGE_OFFSET,
> +                          (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
>         create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>                            load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>  #else
> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>         dtb_early_pa = dtb_pa;
>  }
>
> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> +/*
> + * This function is called only if the current kernel is 64bit and the HW
> + * does not support sv48.
> + */
> +asmlinkage __init void setup_vm_fold_pud(void)
> +{
> +       pgtable_l4_enabled = false;
> +       kernel_virt_addr = PAGE_OFFSET_L3;
> +       satp_mode = SATP_MODE_39;
> +
> +       /*
> +        * PTE/PMD levels do not need to be cleared as they are common between
> +        * 3- and 4-level page tables: the 30 least significant bits
> +        * (2 * 9 + 12) are common.
> +        */
> +       memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> +       memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> +
> +       setup_vm(dtb_early_pa);
> +}
> +#endif
> +
>  static void __init setup_vm_final(void)
>  {
>         uintptr_t va, map_size;
> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
>                 }
>         }
>
> -       /* Clear fixmap PTE and PMD mappings */
> +       /* Clear fixmap page table mappings */
>         clear_fixmap(FIX_PTE);
>         clear_fixmap(FIX_PMD);
> +       clear_fixmap(FIX_PUD);
>
>         /* Move to swapper page table */
> -       csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
> +       csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
>         local_flush_tlb_all();
>  }
>  #else
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  2020-03-22 11:00 ` [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo Alexandre Ghiti
@ 2020-03-26  7:01   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  1 sibling, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  7:01 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:35 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Now that the mmu type is determined at runtime using SATP
> characteristic, use the global variable pgtable_l4_enabled to output
> mmu type of the processor through /proc/cpuinfo instead of relying on
> device tree infos.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/boot/dts/sifive/fu540-c000.dtsi |  4 ----
>  arch/riscv/kernel/cpu.c                    | 24 ++++++++++++----------
>  2 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
> index 7db861053483..6138590a2229 100644
> --- a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
> +++ b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
> @@ -50,7 +50,6 @@
>                         i-cache-size = <32768>;
>                         i-tlb-sets = <1>;
>                         i-tlb-size = <32>;
> -                       mmu-type = "riscv,sv39";
>                         reg = <1>;
>                         riscv,isa = "rv64imafdc";
>                         tlb-split;
> @@ -74,7 +73,6 @@
>                         i-cache-size = <32768>;
>                         i-tlb-sets = <1>;
>                         i-tlb-size = <32>;
> -                       mmu-type = "riscv,sv39";
>                         reg = <2>;
>                         riscv,isa = "rv64imafdc";
>                         tlb-split;
> @@ -98,7 +96,6 @@
>                         i-cache-size = <32768>;
>                         i-tlb-sets = <1>;
>                         i-tlb-size = <32>;
> -                       mmu-type = "riscv,sv39";
>                         reg = <3>;
>                         riscv,isa = "rv64imafdc";
>                         tlb-split;
> @@ -122,7 +119,6 @@
>                         i-cache-size = <32768>;
>                         i-tlb-sets = <1>;
>                         i-tlb-size = <32>;
> -                       mmu-type = "riscv,sv39";
>                         reg = <4>;
>                         riscv,isa = "rv64imafdc";
>                         tlb-split;
> diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
> index 40a3c442ac5f..38a699b997a8 100644
> --- a/arch/riscv/kernel/cpu.c
> +++ b/arch/riscv/kernel/cpu.c
> @@ -8,6 +8,8 @@
>  #include <linux/of.h>
>  #include <asm/smp.h>
>
> +extern bool pgtable_l4_enabled;
> +
>  /*
>   * Returns the hart ID of the given device tree node, or -ENODEV if the node
>   * isn't an enabled and valid RISC-V hart node.
> @@ -54,18 +56,19 @@ static void print_isa(struct seq_file *f, const char *isa)
>         seq_puts(f, "\n");
>  }
>
> -static void print_mmu(struct seq_file *f, const char *mmu_type)
> +static void print_mmu(struct seq_file *f)
>  {
> +       char sv_type[16];
> +
>  #if defined(CONFIG_32BIT)
> -       if (strcmp(mmu_type, "riscv,sv32") != 0)
> -               return;
> +       strncpy(sv_type, "sv32", 5);
>  #elif defined(CONFIG_64BIT)
> -       if (strcmp(mmu_type, "riscv,sv39") != 0 &&
> -           strcmp(mmu_type, "riscv,sv48") != 0)
> -               return;
> +       if (pgtable_l4_enabled)
> +               strncpy(sv_type, "sv48", 5);
> +       else
> +               strncpy(sv_type, "sv39", 5);
>  #endif
> -
> -       seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
> +       seq_printf(f, "mmu\t\t: %s\n", sv_type);
>  }
>
>  static void *c_start(struct seq_file *m, loff_t *pos)
> @@ -90,14 +93,13 @@ static int c_show(struct seq_file *m, void *v)
>  {
>         unsigned long cpu_id = (unsigned long)v - 1;
>         struct device_node *node = of_get_cpu_node(cpu_id, NULL);
> -       const char *compat, *isa, *mmu;
> +       const char *compat, *isa;
>
>         seq_printf(m, "processor\t: %lu\n", cpu_id);
>         seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
>         if (!of_property_read_string(node, "riscv,isa", &isa))
>                 print_isa(m, isa);
> -       if (!of_property_read_string(node, "mmu-type", &mmu))
> -               print_mmu(m, mmu);
> +       print_mmu(m);
>         if (!of_property_read_string(node, "compatible", &compat)
>             && strcmp(compat, "riscv"))
>                 seq_printf(m, "uarch\t\t: %s\n", compat);
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree
  2020-03-22 11:00 ` [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree Alexandre Ghiti
@ 2020-03-26  7:03   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  1 sibling, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  7:03 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: devicetree, linux-kernel@vger.kernel.org List, Palmer Dabbelt,
	Zong Li, Paul Walmsley, linux-riscv, Christoph Hellwig

+Device tree mailing list

On Sun, Mar 22, 2020 at 4:36 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> This property can not be used before virtual memory is set up
> and then the  distinction between sv39 and sv48 is done at runtime
> using SATP csr property: this property is now useless, so remove it.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  Documentation/devicetree/bindings/riscv/cpus.yaml | 13 -------------
>  1 file changed, 13 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/riscv/cpus.yaml b/Documentation/devicetree/bindings/riscv/cpus.yaml
> index 04819ad379c2..12baabbac213 100644
> --- a/Documentation/devicetree/bindings/riscv/cpus.yaml
> +++ b/Documentation/devicetree/bindings/riscv/cpus.yaml
> @@ -39,19 +39,6 @@ properties:
>        Identifies that the hart uses the RISC-V instruction set
>        and identifies the type of the hart.
>
> -  mmu-type:
> -    allOf:
> -      - $ref: "/schemas/types.yaml#/definitions/string"
> -      - enum:
> -          - riscv,sv32
> -          - riscv,sv39
> -          - riscv,sv48
> -    description:
> -      Identifies the MMU address translation mode used on this
> -      hart.  These values originate from the RISC-V Privileged
> -      Specification document, available from
> -      https://riscv.org/specifications/
> -
>    riscv,isa:
>      allOf:
>        - $ref: "/schemas/types.yaml#/definitions/string"
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size
  2020-03-22 11:00 ` [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size Alexandre Ghiti
@ 2020-03-26  7:05   ` Anup Patel
  2020-04-03 15:53   ` Palmer Dabbelt
  1 sibling, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-03-26  7:05 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Sun, Mar 22, 2020 at 4:37 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Define precisely the size of the user accessible virtual space size
> for sv32/39/48 mmu types and explain why the whole virtual address
> space is split into 2 equal chunks between kernel and user space.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/include/asm/pgtable.h | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 06361db3f486..be117a0b4ea1 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -456,8 +456,15 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>
>  /*
> - * Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32.
> - * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> + * Task size is:
> + * -     0x9fc00000 (~2.5GB) for RV32.
> + * -   0x4000000000 ( 256GB) for RV64 using SV39 mmu
> + * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
> + *
> + * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
> + * Instruction Set Manual Volume II: Privileged Architecture" states that
> + * "load and store effective addresses, which are 64bits, must have bits
> + * 63–48 all equal to bit 47, or else a page-fault exception will occur."
>   */
>  #ifdef CONFIG_64BIT
>  #define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> --
> 2.20.1
>

Looks good to me.

Reviewed-by: Anup Patel <anup@brainfault.org>

Regards,
Anup


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-03-26  7:00   ` Anup Patel
@ 2020-03-31 16:31     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-03-31 16:31 UTC (permalink / raw)
  To: Anup Patel
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

Hi Anup,

On 3/26/20 3:00 AM, Anup Patel wrote:
> On Sun, Mar 22, 2020 at 4:34 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>
>> By adding a new 4th level of page table, give the possibility to 64bit
>> kernel to address 2^48 bytes of virtual address: in practice, that roughly
>> offers ~160TB of virtual address space to userspace and allows up to 64TB
>> of physical memory.
>>
>> By default, the kernel will try to boot with a 4-level page table. If the
>> underlying hardware does not support it, we will automatically fallback to
>> a standard 3-level page table by folding the new PUD level into PGDIR
>> level.
>>
>> Early page table preparation is too early in the boot process to use any
>> device-tree entry, then in order to detect HW capabilities at runtime, we
>> use SATP feature that ignores writes with an unsupported mode. The current
>> mode used by the kernel is then made available through cpuinfo.
> 
> Overall the patch look fine but I don't agree with the strategy of detecting
> SV48 mode in relocate() of head.S.
> 
> Instead of this, I suggest that we have a separate PATCH before this
> PATCH which just adds satp_mode global variable.
> 
> This PATCH can detect-n-update both satp_mode and pgtable_l4_enabled
> variables in setup_vm() (i.e. C sources) before setting up initial page
> tables. This way we can avoid adding setup_vm_fold_pud() function and
> in-future we will not require any changes in head.S when a new page table
> format shows-up to support 64K pages. The means MMU mode detection
> will always be restricted to setup_vm() in mm/init.c.
> 
> To achieve Sv48 mode detection in C sources, we just need one 4KB
> page for PGD (e.g. trampoline_pg_dir). To detect Sv48, we create flat
> identity mapping directly in PGD assuming Sv48 is available and try
> enabling Sv48. If Sv48 is enabled successfully then reading back SATP
> CSR will show mode as Sv48 otherwise mode will not be set.
> 
> As an example, refer __detect_pgtbl_mode() function of Xvisor at
> https://github.com/avpatel/xvisor-next/blob/master/arch/riscv/cpu/generic/cpu_mmu_initial_pgtbl.c
> 

Nice ! Indeed that sounds easier this way since I would not have to call 
relocate again and then setup_vm_fold_pud. Quite a good idea :)

Thanks for your review of the other patches Anup,

Alex

> Regards,
> Anup
> 
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>   arch/riscv/Kconfig                  |   6 +-
>>   arch/riscv/include/asm/csr.h        |   3 +-
>>   arch/riscv/include/asm/fixmap.h     |   1 +
>>   arch/riscv/include/asm/page.h       |  15 +++-
>>   arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
>>   arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
>>   arch/riscv/include/asm/pgtable.h    |   5 +-
>>   arch/riscv/kernel/head.S            |  37 ++++++--
>>   arch/riscv/mm/context.c             |   4 +-
>>   arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
>>   10 files changed, 302 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index a475c78e66bc..79560e94cc7c 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -66,6 +66,7 @@ config RISCV
>>          select ARCH_HAS_GCOV_PROFILE_ALL
>>          select HAVE_COPY_THREAD_TLS
>>          select HAVE_ARCH_KASAN if MMU && 64BIT
>> +       select RELOCATABLE if 64BIT
>>
>>   config ARCH_MMAP_RND_BITS_MIN
>>          default 18 if 64BIT
>> @@ -104,7 +105,7 @@ config PAGE_OFFSET
>>          default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>>          default 0x80000000 if 64BIT && !MMU
>>          default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
>> -       default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>> +       default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
>>
>>   config ARCH_FLATMEM_ENABLE
>>          def_bool y
>> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
>>   config FIX_EARLYCON_MEM
>>          def_bool MMU
>>
>> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
>> +# on a 3-level page table when sv48 is not supported.
>>   config PGTABLE_LEVELS
>>          int
>> +       default 4 if 64BIT && RELOCATABLE
>>          default 3 if 64BIT
>>          default 2
>>
>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
>> index 435b65532e29..3828d55af85e 100644
>> --- a/arch/riscv/include/asm/csr.h
>> +++ b/arch/riscv/include/asm/csr.h
>> @@ -40,11 +40,10 @@
>>   #ifndef CONFIG_64BIT
>>   #define SATP_PPN       _AC(0x003FFFFF, UL)
>>   #define SATP_MODE_32   _AC(0x80000000, UL)
>> -#define SATP_MODE      SATP_MODE_32
>>   #else
>>   #define SATP_PPN       _AC(0x00000FFFFFFFFFFF, UL)
>>   #define SATP_MODE_39   _AC(0x8000000000000000, UL)
>> -#define SATP_MODE      SATP_MODE_39
>> +#define SATP_MODE_48   _AC(0x9000000000000000, UL)
>>   #endif
>>
>>   /* Exception cause high bit - is an interrupt if set */
>> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
>> index 42d2c42f3cc9..26e7799c5675 100644
>> --- a/arch/riscv/include/asm/fixmap.h
>> +++ b/arch/riscv/include/asm/fixmap.h
>> @@ -27,6 +27,7 @@ enum fixed_addresses {
>>          FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
>>          FIX_PTE,
>>          FIX_PMD,
>> +       FIX_PUD,
>>          FIX_EARLYCON_MEM_BASE,
>>          __end_of_fixed_addresses
>>   };
>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>> index 691f2f9ded2f..f1a26a0690ef 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -32,11 +32,19 @@
>>    * physical memory (aligned on a page boundary).
>>    */
>>   #ifdef CONFIG_RELOCATABLE
>> -extern unsigned long kernel_virt_addr;
>>   #define PAGE_OFFSET            kernel_virt_addr
>> +
>> +#ifdef CONFIG_64BIT
>> +/*
>> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
>> + * define the PAGE_OFFSET value for SV39.
>> + */
>> +#define PAGE_OFFSET_L3         0xffffffe000000000
>> +#define PAGE_OFFSET_L4         _AC(CONFIG_PAGE_OFFSET, UL)
>> +#endif /* CONFIG_64BIT */
>>   #else
>>   #define PAGE_OFFSET            _AC(CONFIG_PAGE_OFFSET, UL)
>> -#endif
>> +#endif /* CONFIG_RELOCATABLE */
>>
>>   #define KERN_VIRT_SIZE         -PAGE_OFFSET
>>
>> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
>>
>>   extern unsigned long max_low_pfn;
>>   extern unsigned long min_low_pfn;
>> +#ifdef CONFIG_RELOCATABLE
>> +extern unsigned long kernel_virt_addr;
>> +#endif
>>
>>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
>>   #define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
>> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
>> index 3f601ee8233f..540eaa5a8658 100644
>> --- a/arch/riscv/include/asm/pgalloc.h
>> +++ b/arch/riscv/include/asm/pgalloc.h
>> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
>>
>>          set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>>   }
>> +
>> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
>> +{
>> +       if (pgtable_l4_enabled) {
>> +               unsigned long pfn = virt_to_pfn(pud);
>> +
>> +               set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>> +       }
>> +}
>> +
>> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
>> +                                    pud_t *pud)
>> +{
>> +       if (pgtable_l4_enabled) {
>> +               unsigned long pfn = virt_to_pfn(pud);
>> +
>> +               set_p4d_safe(p4d,
>> +                            __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>> +       }
>> +}
>> +
>> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return (pud_t *)__get_free_page(
>> +                               GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
>> +       return NULL;
>> +}
>> +
>> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               free_page((unsigned long)pud);
>> +}
>> +
>> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
>>   #endif /* __PAGETABLE_PMD_FOLDED */
>>
>>   #define pmd_pgtable(pmd)       pmd_page(pmd)
>> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
>> index b15f70a1fdfa..cc4ffbe778f3 100644
>> --- a/arch/riscv/include/asm/pgtable-64.h
>> +++ b/arch/riscv/include/asm/pgtable-64.h
>> @@ -8,16 +8,32 @@
>>
>>   #include <linux/const.h>
>>
>> -#define PGDIR_SHIFT     30
>> +extern bool pgtable_l4_enabled;
>> +
>> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
>>   /* Size of region mapped by a page global directory */
>>   #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
>>   #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
>>
>> +/* pud is folded into pgd in case of 3-level page table */
>> +#define PUD_SHIFT      30
>> +#define PUD_SIZE       (_AC(1, UL) << PUD_SHIFT)
>> +#define PUD_MASK       (~(PUD_SIZE - 1))
>> +
>>   #define PMD_SHIFT       21
>>   /* Size of region mapped by a page middle directory */
>>   #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
>>   #define PMD_MASK        (~(PMD_SIZE - 1))
>>
>> +/* Page Upper Directory entry */
>> +typedef struct {
>> +       unsigned long pud;
>> +} pud_t;
>> +
>> +#define pud_val(x)      ((x).pud)
>> +#define __pud(x)        ((pud_t) { (x) })
>> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
>> +
>>   /* Page Middle Directory entry */
>>   typedef struct {
>>          unsigned long pmd;
>> @@ -25,7 +41,6 @@ typedef struct {
>>
>>   #define pmd_val(x)      ((x).pmd)
>>   #define __pmd(x)        ((pmd_t) { (x) })
>> -
>>   #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
>>
>>   static inline int pud_present(pud_t pud)
>> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
>>          set_pud(pudp, __pud(0));
>>   }
>>
>> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
>> +{
>> +       return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
>> +}
>> +
>> +static inline unsigned long _pud_pfn(pud_t pud)
>> +{
>> +       return pud_val(pud) >> _PAGE_PFN_SHIFT;
>> +}
>> +
>>   static inline unsigned long pud_page_vaddr(pud_t pud)
>>   {
>>          return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
>> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
>>          return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
>>   }
>>
>> +#define mm_pud_folded  mm_pud_folded
>> +static inline bool mm_pud_folded(struct mm_struct *mm)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return false;
>> +
>> +       return true;
>> +}
>> +
>>   #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>>
>>   static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
>> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
>>   #define pmd_ERROR(e) \
>>          pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>>
>> +#define pud_ERROR(e)   \
>> +       pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
>> +
>> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               *p4dp = p4d;
>> +       else
>> +               set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
>> +}
>> +
>> +static inline int p4d_none(p4d_t p4d)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return (p4d_val(p4d) == 0);
>> +
>> +       return 0;
>> +}
>> +
>> +static inline int p4d_present(p4d_t p4d)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return (p4d_val(p4d) & _PAGE_PRESENT);
>> +
>> +       return 1;
>> +}
>> +
>> +static inline int p4d_bad(p4d_t p4d)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return !p4d_present(p4d);
>> +
>> +       return 0;
>> +}
>> +
>> +static inline void p4d_clear(p4d_t *p4d)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               set_p4d(p4d, __p4d(0));
>> +}
>> +
>> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return (unsigned long)pfn_to_virt(
>> +                               p4d_val(p4d) >> _PAGE_PFN_SHIFT);
>> +
>> +       return pud_page_vaddr((pud_t) { p4d_val(p4d) });
>> +}
>> +
>> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>> +
>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
>> +{
>> +       if (pgtable_l4_enabled)
>> +               return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
>> +
>> +       return (pud_t *)p4d;
>> +}
>> +
>>   #endif /* _ASM_RISCV_PGTABLE_64_H */
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index dce401eed1d3..06361db3f486 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -13,8 +13,7 @@
>>
>>   #ifndef __ASSEMBLY__
>>
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> +#include <asm-generic/pgtable-nop4d.h>
>>   #include <asm/page.h>
>>   #include <asm/tlbflush.h>
>>   #include <linux/mm_types.h>
>> @@ -27,7 +26,7 @@
>>
>>   #ifdef CONFIG_MMU
>>   #ifdef CONFIG_64BIT
>> -#define VA_BITS                39
>> +#define VA_BITS                (pgtable_l4_enabled ? 48 : 39)
>>   #define PA_BITS                56
>>   #else
>>   #define VA_BITS                32
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 1c2fbefb8786..22617bd7477f 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -113,6 +113,8 @@ clear_bss_done:
>>          call setup_vm
>>   #ifdef CONFIG_MMU
>>          la a0, early_pg_dir
>> +       la a1, satp_mode
>> +       REG_L a1, (a1)
>>          call relocate
>>   #endif /* CONFIG_MMU */
>>
>> @@ -131,24 +133,28 @@ clear_bss_done:
>>   #ifdef CONFIG_MMU
>>   relocate:
>>   #ifdef CONFIG_RELOCATABLE
>> -       /* Relocate return address */
>> -       la a1, kernel_virt_addr
>> -       REG_L a1, 0(a1)
>> +       /*
>> +        * Relocate return address but save it in case 4-level page table is
>> +        * not supported.
>> +        */
>> +       mv s1, ra
>> +       la a3, kernel_virt_addr
>> +       REG_L a3, 0(a3)
>>   #else
>> -       li a1, PAGE_OFFSET
>> +       li a3, PAGE_OFFSET
>>   #endif
>>          la a2, _start
>> -       sub a1, a1, a2
>> -       add ra, ra, a1
>> +       sub a3, a3, a2
>> +       add ra, ra, a3
>>
>>          /* Point stvec to virtual address of intruction after satp write */
>>          la a2, 1f
>> -       add a2, a2, a1
>> +       add a2, a2, a3
>>          csrw CSR_TVEC, a2
>>
>> +       /* First try with a 4-level page table */
>>          /* Compute satp for kernel page tables, but don't load it yet */
>>          srl a2, a0, PAGE_SHIFT
>> -       li a1, SATP_MODE
>>          or a2, a2, a1
>>
>>          /*
>> @@ -162,6 +168,19 @@ relocate:
>>          or a0, a0, a1
>>          sfence.vma
>>          csrw CSR_SATP, a0
>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
>> +       /*
>> +        * If we fall through here, that means the HW does not support SV48.
>> +        * We need a 3-level page table then simply fold pud into pgd level
>> +        * and finally jump back to relocate with 3-level parameters.
>> +        */
>> +       call setup_vm_fold_pud
>> +
>> +       la a0, early_pg_dir
>> +       li a1, SATP_MODE_39
>> +       mv ra, s1
>> +       tail relocate
>> +#endif
>>   .align 2
>>   1:
>>          /* Set trap vector to spin forever to help debug */
>> @@ -213,6 +232,8 @@ relocate:
>>   #ifdef CONFIG_MMU
>>          /* Enable virtual memory and relocate to virtual address */
>>          la a0, swapper_pg_dir
>> +       la a1, satp_mode
>> +       REG_L a1, (a1)
>>          call relocate
>>   #endif
>>
>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
>> index 613ec81a8979..152b423c02ea 100644
>> --- a/arch/riscv/mm/context.c
>> +++ b/arch/riscv/mm/context.c
>> @@ -9,6 +9,8 @@
>>   #include <asm/cacheflush.h>
>>   #include <asm/mmu_context.h>
>>
>> +extern uint64_t satp_mode;
>> +
>>   /*
>>    * When necessary, performs a deferred icache flush for the given MM context,
>>    * on the local CPU.  RISC-V has no direct mechanism for instruction cache
>> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>          cpumask_set_cpu(cpu, mm_cpumask(next));
>>
>>   #ifdef CONFIG_MMU
>> -       csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
>> +       csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
>>          local_flush_tlb_all();
>>   #endif
>>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 18bbb426848e..ad96667d2ab6 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -24,6 +24,17 @@
>>
>>   #include "../kernel/head.h"
>>
>> +#ifdef CONFIG_64BIT
>> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
>> +                               SATP_MODE_39 : SATP_MODE_48;
>> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : true;
>> +#else
>> +uint64_t satp_mode = SATP_MODE_32;
>> +bool pgtable_l4_enabled = false;
>> +#endif
>> +EXPORT_SYMBOL(pgtable_l4_enabled);
>> +EXPORT_SYMBOL(satp_mode);
>> +
>>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>                                                          __page_aligned_bss;
>>   EXPORT_SYMBOL(empty_zero_page);
>> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
>>
>>   #ifndef __PAGETABLE_PMD_FOLDED
>>
>> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
>>   pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
>> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
>>   pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
>>   pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
>>
>>   static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>>   {
>> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>          if (mmu_enabled)
>>                  return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> -       BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>> +       /* Only one PMD is available for early mapping */
>> +       BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
>>
>>          return (uintptr_t)early_pmd;
>>   }
>> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
>>          create_pte_mapping(ptep, va, pa, sz, prot);
>>   }
>>
>> -#define pgd_next_t             pmd_t
>> -#define alloc_pgd_next(__va)   alloc_pmd(__va)
>> -#define get_pgd_next_virt(__pa)        get_pmd_virt(__pa)
>> +static pud_t *__init get_pud_virt(phys_addr_t pa)
>> +{
>> +       if (mmu_enabled) {
>> +               clear_fixmap(FIX_PUD);
>> +               return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>> +       } else {
>> +               return (pud_t *)((uintptr_t)pa);
>> +       }
>> +}
>> +
>> +static phys_addr_t __init alloc_pud(uintptr_t va)
>> +{
>> +       if (mmu_enabled)
>> +               return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>> +
>> +       /* Only one PUD is available for early mapping */
>> +       BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>> +
>> +       return (uintptr_t)early_pud;
>> +}
>> +
>> +static void __init create_pud_mapping(pud_t *pudp,
>> +                                     uintptr_t va, phys_addr_t pa,
>> +                                     phys_addr_t sz, pgprot_t prot)
>> +{
>> +       pmd_t *nextp;
>> +       phys_addr_t next_phys;
>> +       uintptr_t pud_index = pud_index(va);
>> +
>> +       if (sz == PUD_SIZE) {
>> +               if (pud_val(pudp[pud_index]) == 0)
>> +                       pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
>> +               return;
>> +       }
>> +
>> +       if (pud_val(pudp[pud_index]) == 0) {
>> +               next_phys = alloc_pmd(va);
>> +               pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
>> +               nextp = get_pmd_virt(next_phys);
>> +               memset(nextp, 0, PAGE_SIZE);
>> +       } else {
>> +               next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
>> +               nextp = get_pmd_virt(next_phys);
>> +       }
>> +
>> +       create_pmd_mapping(nextp, va, pa, sz, prot);
>> +}
>> +
>> +#define pgd_next_t             pud_t
>> +#define alloc_pgd_next(__va)   alloc_pud(__va)
>> +#define get_pgd_next_virt(__pa)        get_pud_virt(__pa)
>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)     \
>> -       create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
>> -#define fixmap_pgd_next                fixmap_pmd
>> +       create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
>> +#define fixmap_pgd_next                (pgtable_l4_enabled ?                   \
>> +                       (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
>> +#define trampoline_pgd_next    (pgtable_l4_enabled ?                   \
>> +                       (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
>>   #else
>>   #define pgd_next_t             pte_t
>>   #define alloc_pgd_next(__va)   alloc_pte(__va)
>>   #define get_pgd_next_virt(__pa)        get_pte_virt(__pa)
>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)     \
>>          create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
>> -#define fixmap_pgd_next                fixmap_pte
>> +#define fixmap_pgd_next                ((uintptr_t)fixmap_pte)
>>   #endif
>>
>>   static void __init create_pgd_mapping(pgd_t *pgdp,
>> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>>          phys_addr_t next_phys;
>>          uintptr_t pgd_index = pgd_index(va);
>>
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +       if (!pgtable_l4_enabled) {
>> +               create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
>> +               return;
>> +       }
>> +#endif
>> +
>>          if (sz == PGDIR_SIZE) {
>>                  if (pgd_val(pgdp[pgd_index]) == 0)
>>                          pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
>> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>
>>          /* Setup early PGD for fixmap */
>>          create_pgd_mapping(early_pg_dir, FIXADDR_START,
>> -                          (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>> +                          fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>>
>>   #ifndef __PAGETABLE_PMD_FOLDED
>> -       /* Setup fixmap PMD */
>> +       /* Setup fixmap PUD and PMD */
>> +       if (pgtable_l4_enabled)
>> +               create_pud_mapping(fixmap_pud, FIXADDR_START,
>> +                          (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
>>          create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>                             (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>> +
>>          /* Setup trampoline PGD and PMD */
>>          create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> -                          (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> +                          trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>> +       if (pgtable_l4_enabled)
>> +               create_pud_mapping(trampoline_pud, PAGE_OFFSET,
>> +                          (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
>>          create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>                             load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>   #else
>> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>          dtb_early_pa = dtb_pa;
>>   }
>>
>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
>> +/*
>> + * This function is called only if the current kernel is 64bit and the HW
>> + * does not support sv48.
>> + */
>> +asmlinkage __init void setup_vm_fold_pud(void)
>> +{
>> +       pgtable_l4_enabled = false;
>> +       kernel_virt_addr = PAGE_OFFSET_L3;
>> +       satp_mode = SATP_MODE_39;
>> +
>> +       /*
>> +        * PTE/PMD levels do not need to be cleared as they are common between
>> +        * 3- and 4-level page tables: the 30 least significant bits
>> +        * (2 * 9 + 12) are common.
>> +        */
>> +       memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
>> +       memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
>> +
>> +       setup_vm(dtb_early_pa);
>> +}
>> +#endif
>> +
>>   static void __init setup_vm_final(void)
>>   {
>>          uintptr_t va, map_size;
>> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
>>                  }
>>          }
>>
>> -       /* Clear fixmap PTE and PMD mappings */
>> +       /* Clear fixmap page table mappings */
>>          clear_fixmap(FIX_PTE);
>>          clear_fixmap(FIX_PMD);
>> +       clear_fixmap(FIX_PUD);
>>
>>          /* Move to swapper page table */
>> -       csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
>> +       csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
>>          local_flush_tlb_all();
>>   }
>>   #else
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/7] Introduce sv48 support
  2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
                   ` (6 preceding siblings ...)
  2020-03-22 11:00 ` [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size Alexandre Ghiti
@ 2020-03-31 19:53 ` Palmer Dabbelt
  7 siblings, 0 replies; 35+ messages in thread
From: Palmer Dabbelt @ 2020-03-31 19:53 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:21 PDT (-0700), alex@ghiti.fr wrote:
> This patchset implements sv48 support at runtime. The kernel will try to
> boot with 4-level page table and will fallback to 3-level if the HW does not
> support it.
>
> The biggest advantage is that we only have one kernel for 64bit, which
> is way easier to maintain.

Thanks, that's great!  This is something we've been missing for a long time
now.

> Folding the 4th level into a 3-level page table has almost no cost at
> runtime.
>
> At the moment, there is no way to enforce 3-level if the HW supports
> 4-level page table: early parameters are parsed after the choice must be
> made.

This is different than how I'd been considering doing it -- specifically, my
worry was that 4-level paging would have a meaningful performance hit and
therefor we'd want to allow users to select 3-level paging somehow.  I'd been
thinking of having a Kconfig option, so the options would be "3-level only" or
"3 or 4 level".  That came with a bunch of drawbacks, so I'd be much happier to
have a single kernel.

Where did you get your performance numbers from?  Appologies in advance if
there's more info in the patches, I'll look at those now...

> It is based on my relocatable patchset v3 that I have not posted yet,
> you can try the sv48 support by using the branch
> int/alex/riscv_sv48_runtime_v1 here:
>
> https://github.com/AlexGhiti/riscv-linux
>
> Any feedback appreciated,
>
> Thanks,
>
> Alexandre Ghiti (7):
>   riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
>   riscv: Allow to dynamically define VA_BITS
>   riscv: Simplify MAXPHYSMEM config
>   riscv: Implement sv48 support
>   riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
>   dt-bindings: riscv: Remove "riscv,svXX" property from device-tree
>   riscv: Explicit comment about user virtual address space size
>
>  .../devicetree/bindings/riscv/cpus.yaml       |  13 --
>  arch/riscv/Kconfig                            |  34 ++---
>  arch/riscv/boot/dts/sifive/fu540-c000.dtsi    |   4 -
>  arch/riscv/include/asm/csr.h                  |   3 +-
>  arch/riscv/include/asm/fixmap.h               |   1 +
>  arch/riscv/include/asm/page.h                 |  15 +-
>  arch/riscv/include/asm/pgalloc.h              |  36 +++++
>  arch/riscv/include/asm/pgtable-64.h           |  98 +++++++++++-
>  arch/riscv/include/asm/pgtable.h              |  24 ++-
>  arch/riscv/include/asm/sparsemem.h            |   2 +-
>  arch/riscv/kernel/cpu.c                       |  24 +--
>  arch/riscv/kernel/head.S                      |  37 ++++-
>  arch/riscv/mm/context.c                       |   4 +-
>  arch/riscv/mm/init.c                          | 142 +++++++++++++++---
>  14 files changed, 341 insertions(+), 96 deletions(-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  2020-03-22 11:00 ` [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE Alexandre Ghiti
  2020-03-26  6:10   ` Anup Patel
@ 2020-04-03 15:17   ` Palmer Dabbelt
  2020-04-07  5:12     ` Alex Ghiti
  1 sibling, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:17 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:22 PDT (-0700), alex@ghiti.fr wrote:
> There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
> with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is less
> than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
> definition.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/mm/init.c | 16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 238bd0033c3f..18bbb426848e 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -247,13 +247,7 @@ static void __init create_pte_mapping(pte_t *ptep,
>
>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> -
> -#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
> -#define NUM_EARLY_PMDS		1UL
> -#else
> -#define NUM_EARLY_PMDS		(1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
> -#endif
> -pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
> +pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>
>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>  {
> @@ -267,14 +261,12 @@ static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>
>  static phys_addr_t __init alloc_pmd(uintptr_t va)
>  {
> -	uintptr_t pmd_num;
> -
>  	if (mmu_enabled)
>  		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -	pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> -	BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> -	return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> +	BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> +
> +	return (uintptr_t)early_pmd;
>  }
>
>  static void __init create_pmd_mapping(pmd_t *pmdp,

My specific worry here was that allyesconfig kernels are quite large, and that
dropping the code to handle large kernels would make it even harder to boot
them.  That said, I can't actually get one to boot so I'm happy to just push
that off until later and drop the code we can't practically use.

Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>

Thanks!


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS
  2020-03-22 11:00 ` [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS Alexandre Ghiti
  2020-03-26  6:12   ` Anup Patel
@ 2020-04-03 15:17   ` Palmer Dabbelt
  2020-04-07  5:12     ` Alex Ghiti
  1 sibling, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:17 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:23 PDT (-0700), alex@ghiti.fr wrote:
> With 4-level page table folding at runtime, we don't know at compile time
> the size of the virtual address space so we must set VA_BITS dynamically
> so that sparsemem reserves the right amount of memory for struct pages.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig                 | 10 ----------
>  arch/riscv/include/asm/pgtable.h   | 10 +++++++++-
>  arch/riscv/include/asm/sparsemem.h |  2 +-
>  3 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index f5f3d474504d..8e4b1cbcf2c2 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -99,16 +99,6 @@ config ZONE_DMA32
>  	bool
>  	default y if 64BIT
>
> -config VA_BITS
> -	int
> -	default 32 if 32BIT
> -	default 39 if 64BIT
> -
> -config PA_BITS
> -	int
> -	default 34 if 32BIT
> -	default 56 if 64BIT
> -
>  config PAGE_OFFSET
>  	hex
>  	default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 185ffe3723ec..dce401eed1d3 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -26,6 +26,14 @@
>  #endif /* CONFIG_64BIT */
>
>  #ifdef CONFIG_MMU
> +#ifdef CONFIG_64BIT
> +#define VA_BITS		39
> +#define PA_BITS		56
> +#else
> +#define VA_BITS		32
> +#define PA_BITS		34

We've moved to 32-bit physical addresses on rv32 in Linux.  The mismatch was
causing too many issues in generic code.

> +#endif
> +
>  /* Number of entries in the page global directory */
>  #define PTRS_PER_PGD    (PAGE_SIZE / sizeof(pgd_t))
>  /* Number of entries in the page table */
> @@ -108,7 +116,7 @@ extern pgd_t swapper_pg_dir[];
>   * position vmemmap directly below the VMALLOC region.
>   */
>  #define VMEMMAP_SHIFT \
> -	(CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
> +	(VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
>  #define VMEMMAP_SIZE	BIT(VMEMMAP_SHIFT)
>  #define VMEMMAP_END	(VMALLOC_START - 1)
>  #define VMEMMAP_START	(VMALLOC_START - VMEMMAP_SIZE)
> diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
> index 45a7018a8118..f08d72155bc8 100644
> --- a/arch/riscv/include/asm/sparsemem.h
> +++ b/arch/riscv/include/asm/sparsemem.h
> @@ -4,7 +4,7 @@
>  #define _ASM_RISCV_SPARSEMEM_H
>
>  #ifdef CONFIG_SPARSEMEM
> -#define MAX_PHYSMEM_BITS	CONFIG_PA_BITS
> +#define MAX_PHYSMEM_BITS	PA_BITS
>  #define SECTION_SIZE_BITS	27
>  #endif /* CONFIG_SPARSEMEM */

Aside from the 32-bit PA issue:

Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config
  2020-03-22 11:00 ` [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config Alexandre Ghiti
  2020-03-26  6:22   ` Anup Patel
  2020-03-26  6:34   ` Anup Patel
@ 2020-04-03 15:53   ` Palmer Dabbelt
  2020-04-07  5:13     ` Alex Ghiti
  2 siblings, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:53 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:24 PDT (-0700), alex@ghiti.fr wrote:
> Either the user specifies maximum physical memory size of 2GB or the
> user lives with the system constraint which is 128GB in 64BIT for now.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 8e4b1cbcf2c2..a475c78e66bc 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -104,7 +104,7 @@ config PAGE_OFFSET
>  	default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>  	default 0x80000000 if 64BIT && !MMU
>  	default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> -	default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
> +	default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>
>  config ARCH_FLATMEM_ENABLE
>  	def_bool y
> @@ -216,19 +216,11 @@ config MODULE_SECTIONS
>  	bool
>  	select HAVE_MOD_ARCH_SPECIFIC
>
> -choice
> -	prompt "Maximum Physical Memory"
> -	default MAXPHYSMEM_2GB if 32BIT
> -	default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
> -	default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
> -
> -	config MAXPHYSMEM_2GB
> -		bool "2GiB"
> -	config MAXPHYSMEM_128GB
> -		depends on 64BIT && CMODEL_MEDANY
> -		bool "128GiB"
> -endchoice
> -
> +config MAXPHYSMEM_2GB
> +	bool "Maximum Physical Memory 2GiB"
> +	default y if 32BIT
> +	default y if 64BIT && CMODEL_MEDLOW
> +	default n
>
>  config SMP
>  	bool "Symmetric Multi-Processing"

I'm not sure this actually helps with anything, but if it's all going away then it's
fine.  Originally the 2G/128G stuff was there to allow for larger VA spaces in
the future.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-03-22 11:00 ` [RFC PATCH 4/7] riscv: Implement sv48 support Alexandre Ghiti
  2020-03-26  7:00   ` Anup Patel
@ 2020-04-03 15:53   ` Palmer Dabbelt
  2020-04-07  5:14     ` Alex Ghiti
  1 sibling, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:53 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@ghiti.fr wrote:
> By adding a new 4th level of page table, give the possibility to 64bit
> kernel to address 2^48 bytes of virtual address: in practice, that roughly
> offers ~160TB of virtual address space to userspace and allows up to 64TB
> of physical memory.
>
> By default, the kernel will try to boot with a 4-level page table. If the
> underlying hardware does not support it, we will automatically fallback to
> a standard 3-level page table by folding the new PUD level into PGDIR
> level.
>
> Early page table preparation is too early in the boot process to use any
> device-tree entry, then in order to detect HW capabilities at runtime, we
> use SATP feature that ignores writes with an unsupported mode. The current
> mode used by the kernel is then made available through cpuinfo.

Ya, I think that's the right way to go about this.  There's no reason to
rely on duplicate DT mechanisms for things the ISA defines for us.

>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/Kconfig                  |   6 +-
>  arch/riscv/include/asm/csr.h        |   3 +-
>  arch/riscv/include/asm/fixmap.h     |   1 +
>  arch/riscv/include/asm/page.h       |  15 +++-
>  arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
>  arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
>  arch/riscv/include/asm/pgtable.h    |   5 +-
>  arch/riscv/kernel/head.S            |  37 ++++++--
>  arch/riscv/mm/context.c             |   4 +-
>  arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
>  10 files changed, 302 insertions(+), 31 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index a475c78e66bc..79560e94cc7c 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -66,6 +66,7 @@ config RISCV
>  	select ARCH_HAS_GCOV_PROFILE_ALL
>  	select HAVE_COPY_THREAD_TLS
>  	select HAVE_ARCH_KASAN if MMU && 64BIT
> +	select RELOCATABLE if 64BIT
>
>  config ARCH_MMAP_RND_BITS_MIN
>  	default 18 if 64BIT
> @@ -104,7 +105,7 @@ config PAGE_OFFSET
>  	default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>  	default 0x80000000 if 64BIT && !MMU
>  	default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> -	default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
> +	default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
>
>  config ARCH_FLATMEM_ENABLE
>  	def_bool y
> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
>  config FIX_EARLYCON_MEM
>  	def_bool MMU
>
> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
> +# on a 3-level page table when sv48 is not supported.
>  config PGTABLE_LEVELS
>  	int
> +	default 4 if 64BIT && RELOCATABLE
>  	default 3 if 64BIT
>  	default 2

I assume this means you're relying on relocation to move the kernel around
independently of PAGE_OFFSET in order to fold in the missing page table level?
That seems reasonable, but it does impose a performance penalty as relocatable
kernels necessitate slower generated code.  Additionally, there will likely be
a performance penalty due to the extra memory access on TLB misses that is
unnecessary for workloads that don't necessitate the longer VA width on
machines that support it.

I think the best bet here would be to have a Kconfig option for the number of
page table levels (which could be MAXPHYSMEM or a second partially free
parameter) and then another boolean argument along the lines of "also support
machines with smaller VA widths".  It seems best to turn on the largest VA
width and support for folding by default, as I assume that's what distros would
do.

I didn't really look closely at the rest of this, but it generally smells OK.
The diff will need to be somewhat different for the next version, anyway :)

Thanks for doing this!

> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 435b65532e29..3828d55af85e 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -40,11 +40,10 @@
>  #ifndef CONFIG_64BIT
>  #define SATP_PPN	_AC(0x003FFFFF, UL)
>  #define SATP_MODE_32	_AC(0x80000000, UL)
> -#define SATP_MODE	SATP_MODE_32
>  #else
>  #define SATP_PPN	_AC(0x00000FFFFFFFFFFF, UL)
>  #define SATP_MODE_39	_AC(0x8000000000000000, UL)
> -#define SATP_MODE	SATP_MODE_39
> +#define SATP_MODE_48	_AC(0x9000000000000000, UL)
>  #endif
>
>  /* Exception cause high bit - is an interrupt if set */
> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> index 42d2c42f3cc9..26e7799c5675 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -27,6 +27,7 @@ enum fixed_addresses {
>  	FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
>  	FIX_PTE,
>  	FIX_PMD,
> +	FIX_PUD,
>  	FIX_EARLYCON_MEM_BASE,
>  	__end_of_fixed_addresses
>  };
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 691f2f9ded2f..f1a26a0690ef 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -32,11 +32,19 @@
>   * physical memory (aligned on a page boundary).
>   */
>  #ifdef CONFIG_RELOCATABLE
> -extern unsigned long kernel_virt_addr;
>  #define PAGE_OFFSET		kernel_virt_addr
> +
> +#ifdef CONFIG_64BIT
> +/*
> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> + * define the PAGE_OFFSET value for SV39.
> + */
> +#define PAGE_OFFSET_L3		0xffffffe000000000
> +#define PAGE_OFFSET_L4		_AC(CONFIG_PAGE_OFFSET, UL)
> +#endif /* CONFIG_64BIT */
>  #else
>  #define PAGE_OFFSET		_AC(CONFIG_PAGE_OFFSET, UL)
> -#endif
> +#endif /* CONFIG_RELOCATABLE */
>
>  #define KERN_VIRT_SIZE		-PAGE_OFFSET
>
> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
>
>  extern unsigned long max_low_pfn;
>  extern unsigned long min_low_pfn;
> +#ifdef CONFIG_RELOCATABLE
> +extern unsigned long kernel_virt_addr;
> +#endif
>
>  #define __pa_to_va_nodebug(x)	((void *)((unsigned long) (x) + va_pa_offset))
>  #define __va_to_pa_nodebug(x)	((unsigned long)(x) - va_pa_offset)
> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> index 3f601ee8233f..540eaa5a8658 100644
> --- a/arch/riscv/include/asm/pgalloc.h
> +++ b/arch/riscv/include/asm/pgalloc.h
> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
>
>  	set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>  }
> +
> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> +{
> +	if (pgtable_l4_enabled) {
> +		unsigned long pfn = virt_to_pfn(pud);
> +
> +		set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> +	}
> +}
> +
> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> +				     pud_t *pud)
> +{
> +	if (pgtable_l4_enabled) {
> +		unsigned long pfn = virt_to_pfn(pud);
> +
> +		set_p4d_safe(p4d,
> +			     __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> +	}
> +}
> +
> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> +{
> +	if (pgtable_l4_enabled)
> +		return (pud_t *)__get_free_page(
> +				GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
> +	return NULL;
> +}
> +
> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> +{
> +	if (pgtable_l4_enabled)
> +		free_page((unsigned long)pud);
> +}
> +
> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
>  #endif /* __PAGETABLE_PMD_FOLDED */
>
>  #define pmd_pgtable(pmd)	pmd_page(pmd)
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index b15f70a1fdfa..cc4ffbe778f3 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -8,16 +8,32 @@
>
>  #include <linux/const.h>
>
> -#define PGDIR_SHIFT     30
> +extern bool pgtable_l4_enabled;
> +
> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
>  /* Size of region mapped by a page global directory */
>  #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
>  #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
>
> +/* pud is folded into pgd in case of 3-level page table */
> +#define PUD_SHIFT	30
> +#define PUD_SIZE	(_AC(1, UL) << PUD_SHIFT)
> +#define PUD_MASK	(~(PUD_SIZE - 1))
> +
>  #define PMD_SHIFT       21
>  /* Size of region mapped by a page middle directory */
>  #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
>  #define PMD_MASK        (~(PMD_SIZE - 1))
>
> +/* Page Upper Directory entry */
> +typedef struct {
> +	unsigned long pud;
> +} pud_t;
> +
> +#define pud_val(x)      ((x).pud)
> +#define __pud(x)        ((pud_t) { (x) })
> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
> +
>  /* Page Middle Directory entry */
>  typedef struct {
>  	unsigned long pmd;
> @@ -25,7 +41,6 @@ typedef struct {
>
>  #define pmd_val(x)      ((x).pmd)
>  #define __pmd(x)        ((pmd_t) { (x) })
> -
>  #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
>
>  static inline int pud_present(pud_t pud)
> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
>  	set_pud(pudp, __pud(0));
>  }
>
> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> +{
> +	return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> +}
> +
> +static inline unsigned long _pud_pfn(pud_t pud)
> +{
> +	return pud_val(pud) >> _PAGE_PFN_SHIFT;
> +}
> +
>  static inline unsigned long pud_page_vaddr(pud_t pud)
>  {
>  	return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
>  	return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
>  }
>
> +#define mm_pud_folded	mm_pud_folded
> +static inline bool mm_pud_folded(struct mm_struct *mm)
> +{
> +	if (pgtable_l4_enabled)
> +		return false;
> +
> +	return true;
> +}
> +
>  #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>
>  static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
>  #define pmd_ERROR(e) \
>  	pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> +#define pud_ERROR(e)	\
> +	pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> +
> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> +{
> +	if (pgtable_l4_enabled)
> +		*p4dp = p4d;
> +	else
> +		set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> +}
> +
> +static inline int p4d_none(p4d_t p4d)
> +{
> +	if (pgtable_l4_enabled)
> +		return (p4d_val(p4d) == 0);
> +
> +	return 0;
> +}
> +
> +static inline int p4d_present(p4d_t p4d)
> +{
> +	if (pgtable_l4_enabled)
> +		return (p4d_val(p4d) & _PAGE_PRESENT);
> +
> +	return 1;
> +}
> +
> +static inline int p4d_bad(p4d_t p4d)
> +{
> +	if (pgtable_l4_enabled)
> +		return !p4d_present(p4d);
> +
> +	return 0;
> +}
> +
> +static inline void p4d_clear(p4d_t *p4d)
> +{
> +	if (pgtable_l4_enabled)
> +		set_p4d(p4d, __p4d(0));
> +}
> +
> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
> +{
> +	if (pgtable_l4_enabled)
> +		return (unsigned long)pfn_to_virt(
> +				p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +
> +	return pud_page_vaddr((pud_t) { p4d_val(p4d) });
> +}
> +
> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> +
> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +{
> +	if (pgtable_l4_enabled)
> +		return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> +
> +	return (pud_t *)p4d;
> +}
> +
>  #endif /* _ASM_RISCV_PGTABLE_64_H */
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index dce401eed1d3..06361db3f486 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -13,8 +13,7 @@
>
>  #ifndef __ASSEMBLY__
>
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> +#include <asm-generic/pgtable-nop4d.h>
>  #include <asm/page.h>
>  #include <asm/tlbflush.h>
>  #include <linux/mm_types.h>
> @@ -27,7 +26,7 @@
>
>  #ifdef CONFIG_MMU
>  #ifdef CONFIG_64BIT
> -#define VA_BITS		39
> +#define VA_BITS		(pgtable_l4_enabled ? 48 : 39)
>  #define PA_BITS		56
>  #else
>  #define VA_BITS		32
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 1c2fbefb8786..22617bd7477f 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -113,6 +113,8 @@ clear_bss_done:
>  	call setup_vm
>  #ifdef CONFIG_MMU
>  	la a0, early_pg_dir
> +	la a1, satp_mode
> +	REG_L a1, (a1)
>  	call relocate
>  #endif /* CONFIG_MMU */
>
> @@ -131,24 +133,28 @@ clear_bss_done:
>  #ifdef CONFIG_MMU
>  relocate:
>  #ifdef CONFIG_RELOCATABLE
> -	/* Relocate return address */
> -	la a1, kernel_virt_addr
> -	REG_L a1, 0(a1)
> +	/*
> +	 * Relocate return address but save it in case 4-level page table is
> +	 * not supported.
> +	 */
> +	mv s1, ra
> +	la a3, kernel_virt_addr
> +	REG_L a3, 0(a3)
>  #else
> -	li a1, PAGE_OFFSET
> +	li a3, PAGE_OFFSET
>  #endif
>  	la a2, _start
> -	sub a1, a1, a2
> -	add ra, ra, a1
> +	sub a3, a3, a2
> +	add ra, ra, a3
>
>  	/* Point stvec to virtual address of intruction after satp write */
>  	la a2, 1f
> -	add a2, a2, a1
> +	add a2, a2, a3
>  	csrw CSR_TVEC, a2
>
> +	/* First try with a 4-level page table */
>  	/* Compute satp for kernel page tables, but don't load it yet */
>  	srl a2, a0, PAGE_SHIFT
> -	li a1, SATP_MODE
>  	or a2, a2, a1
>
>  	/*
> @@ -162,6 +168,19 @@ relocate:
>  	or a0, a0, a1
>  	sfence.vma
>  	csrw CSR_SATP, a0
> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> +	/*
> +	 * If we fall through here, that means the HW does not support SV48.
> +	 * We need a 3-level page table then simply fold pud into pgd level
> +	 * and finally jump back to relocate with 3-level parameters.
> +	 */
> +	call setup_vm_fold_pud
> +
> +	la a0, early_pg_dir
> +	li a1, SATP_MODE_39
> +	mv ra, s1
> +	tail relocate
> +#endif
>  .align 2
>  1:
>  	/* Set trap vector to spin forever to help debug */
> @@ -213,6 +232,8 @@ relocate:
>  #ifdef CONFIG_MMU
>  	/* Enable virtual memory and relocate to virtual address */
>  	la a0, swapper_pg_dir
> +	la a1, satp_mode
> +	REG_L a1, (a1)
>  	call relocate
>  #endif
>
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 613ec81a8979..152b423c02ea 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -9,6 +9,8 @@
>  #include <asm/cacheflush.h>
>  #include <asm/mmu_context.h>
>
> +extern uint64_t satp_mode;
> +
>  /*
>   * When necessary, performs a deferred icache flush for the given MM context,
>   * on the local CPU.  RISC-V has no direct mechanism for instruction cache
> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	cpumask_set_cpu(cpu, mm_cpumask(next));
>
>  #ifdef CONFIG_MMU
> -	csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
> +	csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
>  	local_flush_tlb_all();
>  #endif
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 18bbb426848e..ad96667d2ab6 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -24,6 +24,17 @@
>
>  #include "../kernel/head.h"
>
> +#ifdef CONFIG_64BIT
> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
> +				SATP_MODE_39 : SATP_MODE_48;
> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : true;
> +#else
> +uint64_t satp_mode = SATP_MODE_32;
> +bool pgtable_l4_enabled = false;
> +#endif
> +EXPORT_SYMBOL(pgtable_l4_enabled);
> +EXPORT_SYMBOL(satp_mode);
> +
>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>  							__page_aligned_bss;
>  EXPORT_SYMBOL(empty_zero_page);
> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
>
>  #ifndef __PAGETABLE_PMD_FOLDED
>
> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
>  pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
>
>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>  {
> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>  	if (mmu_enabled)
>  		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>
> -	BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> +	/* Only one PMD is available for early mapping */
> +	BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
>
>  	return (uintptr_t)early_pmd;
>  }
> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
>  	create_pte_mapping(ptep, va, pa, sz, prot);
>  }
>
> -#define pgd_next_t		pmd_t
> -#define alloc_pgd_next(__va)	alloc_pmd(__va)
> -#define get_pgd_next_virt(__pa)	get_pmd_virt(__pa)
> +static pud_t *__init get_pud_virt(phys_addr_t pa)
> +{
> +	if (mmu_enabled) {
> +		clear_fixmap(FIX_PUD);
> +		return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> +	} else {
> +		return (pud_t *)((uintptr_t)pa);
> +	}
> +}
> +
> +static phys_addr_t __init alloc_pud(uintptr_t va)
> +{
> +	if (mmu_enabled)
> +		return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> +
> +	/* Only one PUD is available for early mapping */
> +	BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> +
> +	return (uintptr_t)early_pud;
> +}
> +
> +static void __init create_pud_mapping(pud_t *pudp,
> +				      uintptr_t va, phys_addr_t pa,
> +				      phys_addr_t sz, pgprot_t prot)
> +{
> +	pmd_t *nextp;
> +	phys_addr_t next_phys;
> +	uintptr_t pud_index = pud_index(va);
> +
> +	if (sz == PUD_SIZE) {
> +		if (pud_val(pudp[pud_index]) == 0)
> +			pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> +		return;
> +	}
> +
> +	if (pud_val(pudp[pud_index]) == 0) {
> +		next_phys = alloc_pmd(va);
> +		pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> +		nextp = get_pmd_virt(next_phys);
> +		memset(nextp, 0, PAGE_SIZE);
> +	} else {
> +		next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> +		nextp = get_pmd_virt(next_phys);
> +	}
> +
> +	create_pmd_mapping(nextp, va, pa, sz, prot);
> +}
> +
> +#define pgd_next_t		pud_t
> +#define alloc_pgd_next(__va)	alloc_pud(__va)
> +#define get_pgd_next_virt(__pa)	get_pud_virt(__pa)
>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)	\
> -	create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next		fixmap_pmd
> +	create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
> +#define fixmap_pgd_next		(pgtable_l4_enabled ?			\
> +			(uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> +#define trampoline_pgd_next	(pgtable_l4_enabled ?			\
> +			(uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
>  #else
>  #define pgd_next_t		pte_t
>  #define alloc_pgd_next(__va)	alloc_pte(__va)
>  #define get_pgd_next_virt(__pa)	get_pte_virt(__pa)
>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)	\
>  	create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next		fixmap_pte
> +#define fixmap_pgd_next		((uintptr_t)fixmap_pte)
>  #endif
>
>  static void __init create_pgd_mapping(pgd_t *pgdp,
> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>  	phys_addr_t next_phys;
>  	uintptr_t pgd_index = pgd_index(va);
>
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if (!pgtable_l4_enabled) {
> +		create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
> +		return;
> +	}
> +#endif
> +
>  	if (sz == PGDIR_SIZE) {
>  		if (pgd_val(pgdp[pgd_index]) == 0)
>  			pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
>  	/* Setup early PGD for fixmap */
>  	create_pgd_mapping(early_pg_dir, FIXADDR_START,
> -			   (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> +			   fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>
>  #ifndef __PAGETABLE_PMD_FOLDED
> -	/* Setup fixmap PMD */
> +	/* Setup fixmap PUD and PMD */
> +	if (pgtable_l4_enabled)
> +		create_pud_mapping(fixmap_pud, FIXADDR_START,
> +			   (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
>  	create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>  			   (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> +
>  	/* Setup trampoline PGD and PMD */
>  	create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> -			   (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> +			   trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> +	if (pgtable_l4_enabled)
> +		create_pud_mapping(trampoline_pud, PAGE_OFFSET,
> +			   (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
>  	create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>  			   load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>  #else
> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  	dtb_early_pa = dtb_pa;
>  }
>
> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> +/*
> + * This function is called only if the current kernel is 64bit and the HW
> + * does not support sv48.
> + */
> +asmlinkage __init void setup_vm_fold_pud(void)
> +{
> +	pgtable_l4_enabled = false;
> +	kernel_virt_addr = PAGE_OFFSET_L3;
> +	satp_mode = SATP_MODE_39;
> +
> +	/*
> +	 * PTE/PMD levels do not need to be cleared as they are common between
> +	 * 3- and 4-level page tables: the 30 least significant bits
> +	 * (2 * 9 + 12) are common.
> +	 */
> +	memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> +	memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> +
> +	setup_vm(dtb_early_pa);
> +}
> +#endif
> +
>  static void __init setup_vm_final(void)
>  {
>  	uintptr_t va, map_size;
> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
>  		}
>  	}
>
> -	/* Clear fixmap PTE and PMD mappings */
> +	/* Clear fixmap page table mappings */
>  	clear_fixmap(FIX_PTE);
>  	clear_fixmap(FIX_PMD);
> +	clear_fixmap(FIX_PUD);
>
>  	/* Move to swapper page table */
> -	csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
> +	csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
>  	local_flush_tlb_all();
>  }
>  #else


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  2020-03-22 11:00 ` [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo Alexandre Ghiti
  2020-03-26  7:01   ` Anup Patel
@ 2020-04-03 15:53   ` Palmer Dabbelt
  2020-04-07  5:14     ` Alex Ghiti
  1 sibling, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:53 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:26 PDT (-0700), alex@ghiti.fr wrote:
> Now that the mmu type is determined at runtime using SATP
> characteristic, use the global variable pgtable_l4_enabled to output
> mmu type of the processor through /proc/cpuinfo instead of relying on
> device tree infos.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/boot/dts/sifive/fu540-c000.dtsi |  4 ----
>  arch/riscv/kernel/cpu.c                    | 24 ++++++++++++----------
>  2 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
> index 7db861053483..6138590a2229 100644
> --- a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
> +++ b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
> @@ -50,7 +50,6 @@
>  			i-cache-size = <32768>;
>  			i-tlb-sets = <1>;
>  			i-tlb-size = <32>;
> -			mmu-type = "riscv,sv39";
>  			reg = <1>;
>  			riscv,isa = "rv64imafdc";
>  			tlb-split;
> @@ -74,7 +73,6 @@
>  			i-cache-size = <32768>;
>  			i-tlb-sets = <1>;
>  			i-tlb-size = <32>;
> -			mmu-type = "riscv,sv39";
>  			reg = <2>;
>  			riscv,isa = "rv64imafdc";
>  			tlb-split;
> @@ -98,7 +96,6 @@
>  			i-cache-size = <32768>;
>  			i-tlb-sets = <1>;
>  			i-tlb-size = <32>;
> -			mmu-type = "riscv,sv39";
>  			reg = <3>;
>  			riscv,isa = "rv64imafdc";
>  			tlb-split;
> @@ -122,7 +119,6 @@
>  			i-cache-size = <32768>;
>  			i-tlb-sets = <1>;
>  			i-tlb-size = <32>;
> -			mmu-type = "riscv,sv39";
>  			reg = <4>;
>  			riscv,isa = "rv64imafdc";
>  			tlb-split;
> diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
> index 40a3c442ac5f..38a699b997a8 100644
> --- a/arch/riscv/kernel/cpu.c
> +++ b/arch/riscv/kernel/cpu.c
> @@ -8,6 +8,8 @@
>  #include <linux/of.h>
>  #include <asm/smp.h>
>
> +extern bool pgtable_l4_enabled;
> +
>  /*
>   * Returns the hart ID of the given device tree node, or -ENODEV if the node
>   * isn't an enabled and valid RISC-V hart node.
> @@ -54,18 +56,19 @@ static void print_isa(struct seq_file *f, const char *isa)
>  	seq_puts(f, "\n");
>  }
>
> -static void print_mmu(struct seq_file *f, const char *mmu_type)
> +static void print_mmu(struct seq_file *f)
>  {
> +	char sv_type[16];
> +
>  #if defined(CONFIG_32BIT)
> -	if (strcmp(mmu_type, "riscv,sv32") != 0)
> -		return;
> +	strncpy(sv_type, "sv32", 5);
>  #elif defined(CONFIG_64BIT)
> -	if (strcmp(mmu_type, "riscv,sv39") != 0 &&
> -	    strcmp(mmu_type, "riscv,sv48") != 0)
> -		return;
> +	if (pgtable_l4_enabled)
> +		strncpy(sv_type, "sv48", 5);
> +	else
> +		strncpy(sv_type, "sv39", 5);
>  #endif
> -
> -	seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
> +	seq_printf(f, "mmu\t\t: %s\n", sv_type);
>  }
>
>  static void *c_start(struct seq_file *m, loff_t *pos)
> @@ -90,14 +93,13 @@ static int c_show(struct seq_file *m, void *v)
>  {
>  	unsigned long cpu_id = (unsigned long)v - 1;
>  	struct device_node *node = of_get_cpu_node(cpu_id, NULL);
> -	const char *compat, *isa, *mmu;
> +	const char *compat, *isa;
>
>  	seq_printf(m, "processor\t: %lu\n", cpu_id);
>  	seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
>  	if (!of_property_read_string(node, "riscv,isa", &isa))
>  		print_isa(m, isa);
> -	if (!of_property_read_string(node, "mmu-type", &mmu))
> -		print_mmu(m, mmu);
> +	print_mmu(m);
>  	if (!of_property_read_string(node, "compatible", &compat)
>  	    && strcmp(compat, "riscv"))
>  		seq_printf(m, "uarch\t\t: %s\n", compat);

Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree
  2020-03-22 11:00 ` [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree Alexandre Ghiti
  2020-03-26  7:03   ` Anup Patel
@ 2020-04-03 15:53   ` Palmer Dabbelt
  2020-04-07  5:14     ` Alex Ghiti
  1 sibling, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:53 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:27 PDT (-0700), alex@ghiti.fr wrote:
> This property can not be used before virtual memory is set up
> and then the  distinction between sv39 and sv48 is done at runtime
> using SATP csr property: this property is now useless, so remove it.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  Documentation/devicetree/bindings/riscv/cpus.yaml | 13 -------------
>  1 file changed, 13 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/riscv/cpus.yaml b/Documentation/devicetree/bindings/riscv/cpus.yaml
> index 04819ad379c2..12baabbac213 100644
> --- a/Documentation/devicetree/bindings/riscv/cpus.yaml
> +++ b/Documentation/devicetree/bindings/riscv/cpus.yaml
> @@ -39,19 +39,6 @@ properties:
>        Identifies that the hart uses the RISC-V instruction set
>        and identifies the type of the hart.
>
> -  mmu-type:
> -    allOf:
> -      - $ref: "/schemas/types.yaml#/definitions/string"
> -      - enum:
> -          - riscv,sv32
> -          - riscv,sv39
> -          - riscv,sv48
> -    description:
> -      Identifies the MMU address translation mode used on this
> -      hart.  These values originate from the RISC-V Privileged
> -      Specification document, available from
> -      https://riscv.org/specifications/
> -
>    riscv,isa:
>      allOf:
>        - $ref: "/schemas/types.yaml#/definitions/string"

I'd prefer if we continue to define this in the schema: while Linux won't use
it, it's still useful for other programs that want to statically determine the
available VA widths.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size
  2020-03-22 11:00 ` [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size Alexandre Ghiti
  2020-03-26  7:05   ` Anup Patel
@ 2020-04-03 15:53   ` Palmer Dabbelt
  2020-04-07  5:15     ` Alex Ghiti
  1 sibling, 1 reply; 35+ messages in thread
From: Palmer Dabbelt @ 2020-04-03 15:53 UTC (permalink / raw)
  To: alex
  Cc: alex, anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig

On Sun, 22 Mar 2020 04:00:28 PDT (-0700), alex@ghiti.fr wrote:
> Define precisely the size of the user accessible virtual space size
> for sv32/39/48 mmu types and explain why the whole virtual address
> space is split into 2 equal chunks between kernel and user space.
>
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  arch/riscv/include/asm/pgtable.h | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 06361db3f486..be117a0b4ea1 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -456,8 +456,15 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>
>  /*
> - * Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32.
> - * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> + * Task size is:
> + * -     0x9fc00000 (~2.5GB) for RV32.
> + * -   0x4000000000 ( 256GB) for RV64 using SV39 mmu
> + * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
> + *
> + * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
> + * Instruction Set Manual Volume II: Privileged Architecture" states that
> + * "load and store effective addresses, which are 64bits, must have bits
> + * 63–48 all equal to bit 47, or else a page-fault exception will occur."
>   */
>  #ifdef CONFIG_64BIT
>  #define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)

Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  2020-04-03 15:17   ` Palmer Dabbelt
@ 2020-04-07  5:12     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:12 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig



On 4/3/20 11:17 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:22 PDT (-0700), alex@ghiti.fr wrote:
>> There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
>> with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is 
>> less
>> than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
>> definition.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  arch/riscv/mm/init.c | 16 ++++------------
>>  1 file changed, 4 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 238bd0033c3f..18bbb426848e 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -247,13 +247,7 @@ static void __init create_pte_mapping(pte_t *ptep,
>>
>>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
>>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
>> -
>> -#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
>> -#define NUM_EARLY_PMDS        1UL
>> -#else
>> -#define NUM_EARLY_PMDS        (1UL + MAX_EARLY_MAPPING_SIZE / 
>> PGDIR_SIZE)
>> -#endif
>> -pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata 
>> __aligned(PAGE_SIZE);
>> +pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>>
>>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>>  {
>> @@ -267,14 +261,12 @@ static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>>
>>  static phys_addr_t __init alloc_pmd(uintptr_t va)
>>  {
>> -    uintptr_t pmd_num;
>> -
>>      if (mmu_enabled)
>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> -    pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>> -    BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>> -    return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>> +    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>> +
>> +    return (uintptr_t)early_pmd;
>>  }
>>
>>  static void __init create_pmd_mapping(pmd_t *pmdp,
> 
> My specific worry here was that allyesconfig kernels are quite large, 
> and that
> dropping the code to handle large kernels would make it even harder to boot
> them.  That said, I can't actually get one to boot so I'm happy to just 
> push
> that off until later and drop the code we can't practically use.
> 
> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
> 
> Thanks!
> 

Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS
  2020-04-03 15:17   ` Palmer Dabbelt
@ 2020-04-07  5:12     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:12 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig


On 4/3/20 11:17 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:23 PDT (-0700), alex@ghiti.fr wrote:
>> With 4-level page table folding at runtime, we don't know at compile time
>> the size of the virtual address space so we must set VA_BITS dynamically
>> so that sparsemem reserves the right amount of memory for struct pages.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  arch/riscv/Kconfig                 | 10 ----------
>>  arch/riscv/include/asm/pgtable.h   | 10 +++++++++-
>>  arch/riscv/include/asm/sparsemem.h |  2 +-
>>  3 files changed, 10 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index f5f3d474504d..8e4b1cbcf2c2 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -99,16 +99,6 @@ config ZONE_DMA32
>>      bool
>>      default y if 64BIT
>>
>> -config VA_BITS
>> -    int
>> -    default 32 if 32BIT
>> -    default 39 if 64BIT
>> -
>> -config PA_BITS
>> -    int
>> -    default 34 if 32BIT
>> -    default 56 if 64BIT
>> -
>>  config PAGE_OFFSET
>>      hex
>>      default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>> diff --git a/arch/riscv/include/asm/pgtable.h 
>> b/arch/riscv/include/asm/pgtable.h
>> index 185ffe3723ec..dce401eed1d3 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -26,6 +26,14 @@
>>  #endif /* CONFIG_64BIT */
>>
>>  #ifdef CONFIG_MMU
>> +#ifdef CONFIG_64BIT
>> +#define VA_BITS        39
>> +#define PA_BITS        56
>> +#else
>> +#define VA_BITS        32
>> +#define PA_BITS        34
> 
> We've moved to 32-bit physical addresses on rv32 in Linux.  The mismatch 
> was
> causing too many issues in generic code.

Ok I missed this one, thanks.

> 
>> +#endif
>> +
>>  /* Number of entries in the page global directory */
>>  #define PTRS_PER_PGD    (PAGE_SIZE / sizeof(pgd_t))
>>  /* Number of entries in the page table */
>> @@ -108,7 +116,7 @@ extern pgd_t swapper_pg_dir[];
>>   * position vmemmap directly below the VMALLOC region.
>>   */
>>  #define VMEMMAP_SHIFT \
>> -    (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
>> +    (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
>>  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
>>  #define VMEMMAP_END    (VMALLOC_START - 1)
>>  #define VMEMMAP_START    (VMALLOC_START - VMEMMAP_SIZE)
>> diff --git a/arch/riscv/include/asm/sparsemem.h 
>> b/arch/riscv/include/asm/sparsemem.h
>> index 45a7018a8118..f08d72155bc8 100644
>> --- a/arch/riscv/include/asm/sparsemem.h
>> +++ b/arch/riscv/include/asm/sparsemem.h
>> @@ -4,7 +4,7 @@
>>  #define _ASM_RISCV_SPARSEMEM_H
>>
>>  #ifdef CONFIG_SPARSEMEM
>> -#define MAX_PHYSMEM_BITS    CONFIG_PA_BITS
>> +#define MAX_PHYSMEM_BITS    PA_BITS
>>  #define SECTION_SIZE_BITS    27
>>  #endif /* CONFIG_SPARSEMEM */
> 
> Aside from the 32-bit PA issue:
> 
> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>

Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config
  2020-04-03 15:53   ` Palmer Dabbelt
@ 2020-04-07  5:13     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:13 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig


On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:24 PDT (-0700), alex@ghiti.fr wrote:
>> Either the user specifies maximum physical memory size of 2GB or the
>> user lives with the system constraint which is 128GB in 64BIT for now.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  arch/riscv/Kconfig | 20 ++++++--------------
>>  1 file changed, 6 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 8e4b1cbcf2c2..a475c78e66bc 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -104,7 +104,7 @@ config PAGE_OFFSET
>>      default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>>      default 0x80000000 if 64BIT && !MMU
>>      default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
>> -    default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
>> +    default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>>
>>  config ARCH_FLATMEM_ENABLE
>>      def_bool y
>> @@ -216,19 +216,11 @@ config MODULE_SECTIONS
>>      bool
>>      select HAVE_MOD_ARCH_SPECIFIC
>>
>> -choice
>> -    prompt "Maximum Physical Memory"
>> -    default MAXPHYSMEM_2GB if 32BIT
>> -    default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
>> -    default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
>> -
>> -    config MAXPHYSMEM_2GB
>> -        bool "2GiB"
>> -    config MAXPHYSMEM_128GB
>> -        depends on 64BIT && CMODEL_MEDANY
>> -        bool "128GiB"
>> -endchoice
>> -
>> +config MAXPHYSMEM_2GB
>> +    bool "Maximum Physical Memory 2GiB"
>> +    default y if 32BIT
>> +    default y if 64BIT && CMODEL_MEDLOW
>> +    default n
>>
>>  config SMP
>>      bool "Symmetric Multi-Processing"
> 
> I'm not sure this actually helps with anything, but if it's all going 
> away then it's
> fine.  Originally the 2G/128G stuff was there to allow for larger VA 
> spaces in
> the future.

With runtime sv48 introduction, whatever we would have used here could 
have been wrong at runtime, so removing it was easier.

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-04-03 15:53   ` Palmer Dabbelt
@ 2020-04-07  5:14     ` Alex Ghiti
  2020-04-07  5:56       ` Anup Patel
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:14 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig


On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@ghiti.fr wrote:
>> By adding a new 4th level of page table, give the possibility to 64bit
>> kernel to address 2^48 bytes of virtual address: in practice, that 
>> roughly
>> offers ~160TB of virtual address space to userspace and allows up to 64TB
>> of physical memory.
>>
>> By default, the kernel will try to boot with a 4-level page table. If the
>> underlying hardware does not support it, we will automatically 
>> fallback to
>> a standard 3-level page table by folding the new PUD level into PGDIR
>> level.
>>
>> Early page table preparation is too early in the boot process to use any
>> device-tree entry, then in order to detect HW capabilities at runtime, we
>> use SATP feature that ignores writes with an unsupported mode. The 
>> current
>> mode used by the kernel is then made available through cpuinfo.
> 
> Ya, I think that's the right way to go about this.  There's no reason to
> rely on duplicate DT mechanisms for things the ISA defines for us.
> 
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  arch/riscv/Kconfig                  |   6 +-
>>  arch/riscv/include/asm/csr.h        |   3 +-
>>  arch/riscv/include/asm/fixmap.h     |   1 +
>>  arch/riscv/include/asm/page.h       |  15 +++-
>>  arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
>>  arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
>>  arch/riscv/include/asm/pgtable.h    |   5 +-
>>  arch/riscv/kernel/head.S            |  37 ++++++--
>>  arch/riscv/mm/context.c             |   4 +-
>>  arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
>>  10 files changed, 302 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index a475c78e66bc..79560e94cc7c 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -66,6 +66,7 @@ config RISCV
>>      select ARCH_HAS_GCOV_PROFILE_ALL
>>      select HAVE_COPY_THREAD_TLS
>>      select HAVE_ARCH_KASAN if MMU && 64BIT
>> +    select RELOCATABLE if 64BIT
>>
>>  config ARCH_MMAP_RND_BITS_MIN
>>      default 18 if 64BIT
>> @@ -104,7 +105,7 @@ config PAGE_OFFSET
>>      default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>>      default 0x80000000 if 64BIT && !MMU
>>      default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
>> -    default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>> +    default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
>>
>>  config ARCH_FLATMEM_ENABLE
>>      def_bool y
>> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
>>  config FIX_EARLYCON_MEM
>>      def_bool MMU
>>
>> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime 
>> folded
>> +# on a 3-level page table when sv48 is not supported.
>>  config PGTABLE_LEVELS
>>      int
>> +    default 4 if 64BIT && RELOCATABLE
>>      default 3 if 64BIT
>>      default 2
> 
> I assume this means you're relying on relocation to move the kernel around
> independently of PAGE_OFFSET in order to fold in the missing page table 
> level?

Yes, relocation is needed to fallback to 3-level and move PAGE_OFFSET 
accordingly.

> That seems reasonable, but it does impose a performance penalty as 
> relocatable
> kernels necessitate slower generated code.  Additionally, there will 
> likely be
> a performance penalty due to the extra memory access on TLB misses that is
> unnecessary for workloads that don't necessitate the longer VA width on
> machines that support it.

Sorry, I had no time to answer your previous mail regarding performance: 
I have no number. But the only penalty caused by this patchset on 
3-level page table is the check in page table management functions to 
know if 4-level is activated or not. And as you said, the extra cost of 
relocatable kernel that I had ignored since necessary anyway.

> 
> I think the best bet here would be to have a Kconfig option for the 
> number of
> page table levels (which could be MAXPHYSMEM or a second partially free
> parameter) and then another boolean argument along the lines of "also 
> support
> machines with smaller VA widths".  It seems best to turn on the largest VA
> width and support for folding by default, as I assume that's what 
> distros would
> do.

I'm not a big fan of a new Kconfig option to allow people to have a 
3-level page table because that implies maintaining a new kernel, even 
for us, having to compile 2 kernels each time we change something to mm 
code will be painful.

I have just reviewed Zong's KASLR patchset: he needs to parse the dtb to 
find out the reserved regions in order to not override one of them when 
copying the kernel to its new destination. And after that, he loops back 
to setup_vm to re-create the mapping to the new kernel.
If that's the way we take for KASLR, we can follow the same path here: 
boot with 4-level by default, go to check what is wanted in the device 
tree and if it is 3-level, loop back to setup_vm.

> 
> I didn't really look closely at the rest of this, but it generally 
> smells OK.
> The diff will need to be somewhat different for the next version, anyway :)
> 
> Thanks for doing this!
> 
>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
>> index 435b65532e29..3828d55af85e 100644
>> --- a/arch/riscv/include/asm/csr.h
>> +++ b/arch/riscv/include/asm/csr.h
>> @@ -40,11 +40,10 @@
>>  #ifndef CONFIG_64BIT
>>  #define SATP_PPN    _AC(0x003FFFFF, UL)
>>  #define SATP_MODE_32    _AC(0x80000000, UL)
>> -#define SATP_MODE    SATP_MODE_32
>>  #else
>>  #define SATP_PPN    _AC(0x00000FFFFFFFFFFF, UL)
>>  #define SATP_MODE_39    _AC(0x8000000000000000, UL)
>> -#define SATP_MODE    SATP_MODE_39
>> +#define SATP_MODE_48    _AC(0x9000000000000000, UL)
>>  #endif
>>
>>  /* Exception cause high bit - is an interrupt if set */
>> diff --git a/arch/riscv/include/asm/fixmap.h 
>> b/arch/riscv/include/asm/fixmap.h
>> index 42d2c42f3cc9..26e7799c5675 100644
>> --- a/arch/riscv/include/asm/fixmap.h
>> +++ b/arch/riscv/include/asm/fixmap.h
>> @@ -27,6 +27,7 @@ enum fixed_addresses {
>>      FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
>>      FIX_PTE,
>>      FIX_PMD,
>> +    FIX_PUD,
>>      FIX_EARLYCON_MEM_BASE,
>>      __end_of_fixed_addresses
>>  };
>> diff --git a/arch/riscv/include/asm/page.h 
>> b/arch/riscv/include/asm/page.h
>> index 691f2f9ded2f..f1a26a0690ef 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -32,11 +32,19 @@
>>   * physical memory (aligned on a page boundary).
>>   */
>>  #ifdef CONFIG_RELOCATABLE
>> -extern unsigned long kernel_virt_addr;
>>  #define PAGE_OFFSET        kernel_virt_addr
>> +
>> +#ifdef CONFIG_64BIT
>> +/*
>> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address 
>> space so
>> + * define the PAGE_OFFSET value for SV39.
>> + */
>> +#define PAGE_OFFSET_L3        0xffffffe000000000
>> +#define PAGE_OFFSET_L4        _AC(CONFIG_PAGE_OFFSET, UL)
>> +#endif /* CONFIG_64BIT */
>>  #else
>>  #define PAGE_OFFSET        _AC(CONFIG_PAGE_OFFSET, UL)
>> -#endif
>> +#endif /* CONFIG_RELOCATABLE */
>>
>>  #define KERN_VIRT_SIZE        -PAGE_OFFSET
>>
>> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
>>
>>  extern unsigned long max_low_pfn;
>>  extern unsigned long min_low_pfn;
>> +#ifdef CONFIG_RELOCATABLE
>> +extern unsigned long kernel_virt_addr;
>> +#endif
>>
>>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) + 
>> va_pa_offset))
>>  #define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>> diff --git a/arch/riscv/include/asm/pgalloc.h 
>> b/arch/riscv/include/asm/pgalloc.h
>> index 3f601ee8233f..540eaa5a8658 100644
>> --- a/arch/riscv/include/asm/pgalloc.h
>> +++ b/arch/riscv/include/asm/pgalloc.h
>> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct 
>> *mm, pud_t *pud, pmd_t *pmd)
>>
>>      set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>>  }
>> +
>> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, 
>> pud_t *pud)
>> +{
>> +    if (pgtable_l4_enabled) {
>> +        unsigned long pfn = virt_to_pfn(pud);
>> +
>> +        set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>> +    }
>> +}
>> +
>> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
>> +                     pud_t *pud)
>> +{
>> +    if (pgtable_l4_enabled) {
>> +        unsigned long pfn = virt_to_pfn(pud);
>> +
>> +        set_p4d_safe(p4d,
>> +                 __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>> +    }
>> +}
>> +
>> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned 
>> long addr)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (pud_t *)__get_free_page(
>> +                GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
>> +    return NULL;
>> +}
>> +
>> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        free_page((unsigned long)pud);
>> +}
>> +
>> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
>>  #endif /* __PAGETABLE_PMD_FOLDED */
>>
>>  #define pmd_pgtable(pmd)    pmd_page(pmd)
>> diff --git a/arch/riscv/include/asm/pgtable-64.h 
>> b/arch/riscv/include/asm/pgtable-64.h
>> index b15f70a1fdfa..cc4ffbe778f3 100644
>> --- a/arch/riscv/include/asm/pgtable-64.h
>> +++ b/arch/riscv/include/asm/pgtable-64.h
>> @@ -8,16 +8,32 @@
>>
>>  #include <linux/const.h>
>>
>> -#define PGDIR_SHIFT     30
>> +extern bool pgtable_l4_enabled;
>> +
>> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
>>  /* Size of region mapped by a page global directory */
>>  #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
>>  #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
>>
>> +/* pud is folded into pgd in case of 3-level page table */
>> +#define PUD_SHIFT    30
>> +#define PUD_SIZE    (_AC(1, UL) << PUD_SHIFT)
>> +#define PUD_MASK    (~(PUD_SIZE - 1))
>> +
>>  #define PMD_SHIFT       21
>>  /* Size of region mapped by a page middle directory */
>>  #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
>>  #define PMD_MASK        (~(PMD_SIZE - 1))
>>
>> +/* Page Upper Directory entry */
>> +typedef struct {
>> +    unsigned long pud;
>> +} pud_t;
>> +
>> +#define pud_val(x)      ((x).pud)
>> +#define __pud(x)        ((pud_t) { (x) })
>> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
>> +
>>  /* Page Middle Directory entry */
>>  typedef struct {
>>      unsigned long pmd;
>> @@ -25,7 +41,6 @@ typedef struct {
>>
>>  #define pmd_val(x)      ((x).pmd)
>>  #define __pmd(x)        ((pmd_t) { (x) })
>> -
>>  #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
>>
>>  static inline int pud_present(pud_t pud)
>> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
>>      set_pud(pudp, __pud(0));
>>  }
>>
>> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
>> +{
>> +    return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
>> +}
>> +
>> +static inline unsigned long _pud_pfn(pud_t pud)
>> +{
>> +    return pud_val(pud) >> _PAGE_PFN_SHIFT;
>> +}
>> +
>>  static inline unsigned long pud_page_vaddr(pud_t pud)
>>  {
>>      return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
>> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
>>      return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
>>  }
>>
>> +#define mm_pud_folded    mm_pud_folded
>> +static inline bool mm_pud_folded(struct mm_struct *mm)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return false;
>> +
>> +    return true;
>> +}
>> +
>>  #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>>
>>  static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
>> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
>>  #define pmd_ERROR(e) \
>>      pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>>
>> +#define pud_ERROR(e)    \
>> +    pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
>> +
>> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        *p4dp = p4d;
>> +    else
>> +        set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
>> +}
>> +
>> +static inline int p4d_none(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (p4d_val(p4d) == 0);
>> +
>> +    return 0;
>> +}
>> +
>> +static inline int p4d_present(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (p4d_val(p4d) & _PAGE_PRESENT);
>> +
>> +    return 1;
>> +}
>> +
>> +static inline int p4d_bad(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return !p4d_present(p4d);
>> +
>> +    return 0;
>> +}
>> +
>> +static inline void p4d_clear(p4d_t *p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        set_p4d(p4d, __p4d(0));
>> +}
>> +
>> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (unsigned long)pfn_to_virt(
>> +                p4d_val(p4d) >> _PAGE_PFN_SHIFT);
>> +
>> +    return pud_page_vaddr((pud_t) { p4d_val(p4d) });
>> +}
>> +
>> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>> +
>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
>> +
>> +    return (pud_t *)p4d;
>> +}
>> +
>>  #endif /* _ASM_RISCV_PGTABLE_64_H */
>> diff --git a/arch/riscv/include/asm/pgtable.h 
>> b/arch/riscv/include/asm/pgtable.h
>> index dce401eed1d3..06361db3f486 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -13,8 +13,7 @@
>>
>>  #ifndef __ASSEMBLY__
>>
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> +#include <asm-generic/pgtable-nop4d.h>
>>  #include <asm/page.h>
>>  #include <asm/tlbflush.h>
>>  #include <linux/mm_types.h>
>> @@ -27,7 +26,7 @@
>>
>>  #ifdef CONFIG_MMU
>>  #ifdef CONFIG_64BIT
>> -#define VA_BITS        39
>> +#define VA_BITS        (pgtable_l4_enabled ? 48 : 39)
>>  #define PA_BITS        56
>>  #else
>>  #define VA_BITS        32
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 1c2fbefb8786..22617bd7477f 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -113,6 +113,8 @@ clear_bss_done:
>>      call setup_vm
>>  #ifdef CONFIG_MMU
>>      la a0, early_pg_dir
>> +    la a1, satp_mode
>> +    REG_L a1, (a1)
>>      call relocate
>>  #endif /* CONFIG_MMU */
>>
>> @@ -131,24 +133,28 @@ clear_bss_done:
>>  #ifdef CONFIG_MMU
>>  relocate:
>>  #ifdef CONFIG_RELOCATABLE
>> -    /* Relocate return address */
>> -    la a1, kernel_virt_addr
>> -    REG_L a1, 0(a1)
>> +    /*
>> +     * Relocate return address but save it in case 4-level page table is
>> +     * not supported.
>> +     */
>> +    mv s1, ra
>> +    la a3, kernel_virt_addr
>> +    REG_L a3, 0(a3)
>>  #else
>> -    li a1, PAGE_OFFSET
>> +    li a3, PAGE_OFFSET
>>  #endif
>>      la a2, _start
>> -    sub a1, a1, a2
>> -    add ra, ra, a1
>> +    sub a3, a3, a2
>> +    add ra, ra, a3
>>
>>      /* Point stvec to virtual address of intruction after satp write */
>>      la a2, 1f
>> -    add a2, a2, a1
>> +    add a2, a2, a3
>>      csrw CSR_TVEC, a2
>>
>> +    /* First try with a 4-level page table */
>>      /* Compute satp for kernel page tables, but don't load it yet */
>>      srl a2, a0, PAGE_SHIFT
>> -    li a1, SATP_MODE
>>      or a2, a2, a1
>>
>>      /*
>> @@ -162,6 +168,19 @@ relocate:
>>      or a0, a0, a1
>>      sfence.vma
>>      csrw CSR_SATP, a0
>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
>> +    /*
>> +     * If we fall through here, that means the HW does not support SV48.
>> +     * We need a 3-level page table then simply fold pud into pgd level
>> +     * and finally jump back to relocate with 3-level parameters.
>> +     */
>> +    call setup_vm_fold_pud
>> +
>> +    la a0, early_pg_dir
>> +    li a1, SATP_MODE_39
>> +    mv ra, s1
>> +    tail relocate
>> +#endif
>>  .align 2
>>  1:
>>      /* Set trap vector to spin forever to help debug */
>> @@ -213,6 +232,8 @@ relocate:
>>  #ifdef CONFIG_MMU
>>      /* Enable virtual memory and relocate to virtual address */
>>      la a0, swapper_pg_dir
>> +    la a1, satp_mode
>> +    REG_L a1, (a1)
>>      call relocate
>>  #endif
>>
>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
>> index 613ec81a8979..152b423c02ea 100644
>> --- a/arch/riscv/mm/context.c
>> +++ b/arch/riscv/mm/context.c
>> @@ -9,6 +9,8 @@
>>  #include <asm/cacheflush.h>
>>  #include <asm/mmu_context.h>
>>
>> +extern uint64_t satp_mode;
>> +
>>  /*
>>   * When necessary, performs a deferred icache flush for the given MM 
>> context,
>>   * on the local CPU.  RISC-V has no direct mechanism for instruction 
>> cache
>> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct 
>> mm_struct *next,
>>      cpumask_set_cpu(cpu, mm_cpumask(next));
>>
>>  #ifdef CONFIG_MMU
>> -    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
>> +    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
>>      local_flush_tlb_all();
>>  #endif
>>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 18bbb426848e..ad96667d2ab6 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -24,6 +24,17 @@
>>
>>  #include "../kernel/head.h"
>>
>> +#ifdef CONFIG_64BIT
>> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
>> +                SATP_MODE_39 : SATP_MODE_48;
>> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : 
>> true;
>> +#else
>> +uint64_t satp_mode = SATP_MODE_32;
>> +bool pgtable_l4_enabled = false;
>> +#endif
>> +EXPORT_SYMBOL(pgtable_l4_enabled);
>> +EXPORT_SYMBOL(satp_mode);
>> +
>>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>                              __page_aligned_bss;
>>  EXPORT_SYMBOL(empty_zero_page);
>> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
>>
>>  #ifndef __PAGETABLE_PMD_FOLDED
>>
>> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
>>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
>> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
>>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
>>  pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
>>
>>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>>  {
>> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>      if (mmu_enabled)
>>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> -    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>> +    /* Only one PMD is available for early mapping */
>> +    BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
>>
>>      return (uintptr_t)early_pmd;
>>  }
>> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
>>      create_pte_mapping(ptep, va, pa, sz, prot);
>>  }
>>
>> -#define pgd_next_t        pmd_t
>> -#define alloc_pgd_next(__va)    alloc_pmd(__va)
>> -#define get_pgd_next_virt(__pa)    get_pmd_virt(__pa)
>> +static pud_t *__init get_pud_virt(phys_addr_t pa)
>> +{
>> +    if (mmu_enabled) {
>> +        clear_fixmap(FIX_PUD);
>> +        return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>> +    } else {
>> +        return (pud_t *)((uintptr_t)pa);
>> +    }
>> +}
>> +
>> +static phys_addr_t __init alloc_pud(uintptr_t va)
>> +{
>> +    if (mmu_enabled)
>> +        return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>> +
>> +    /* Only one PUD is available for early mapping */
>> +    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>> +
>> +    return (uintptr_t)early_pud;
>> +}
>> +
>> +static void __init create_pud_mapping(pud_t *pudp,
>> +                      uintptr_t va, phys_addr_t pa,
>> +                      phys_addr_t sz, pgprot_t prot)
>> +{
>> +    pmd_t *nextp;
>> +    phys_addr_t next_phys;
>> +    uintptr_t pud_index = pud_index(va);
>> +
>> +    if (sz == PUD_SIZE) {
>> +        if (pud_val(pudp[pud_index]) == 0)
>> +            pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
>> +        return;
>> +    }
>> +
>> +    if (pud_val(pudp[pud_index]) == 0) {
>> +        next_phys = alloc_pmd(va);
>> +        pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
>> +        nextp = get_pmd_virt(next_phys);
>> +        memset(nextp, 0, PAGE_SIZE);
>> +    } else {
>> +        next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
>> +        nextp = get_pmd_virt(next_phys);
>> +    }
>> +
>> +    create_pmd_mapping(nextp, va, pa, sz, prot);
>> +}
>> +
>> +#define pgd_next_t        pud_t
>> +#define alloc_pgd_next(__va)    alloc_pud(__va)
>> +#define get_pgd_next_virt(__pa)    get_pud_virt(__pa)
>>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
>> -    create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
>> -#define fixmap_pgd_next        fixmap_pmd
>> +    create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
>> +#define fixmap_pgd_next        (pgtable_l4_enabled ?            \
>> +            (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
>> +#define trampoline_pgd_next    (pgtable_l4_enabled ?            \
>> +            (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
>>  #else
>>  #define pgd_next_t        pte_t
>>  #define alloc_pgd_next(__va)    alloc_pte(__va)
>>  #define get_pgd_next_virt(__pa)    get_pte_virt(__pa)
>>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
>>      create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
>> -#define fixmap_pgd_next        fixmap_pte
>> +#define fixmap_pgd_next        ((uintptr_t)fixmap_pte)
>>  #endif
>>
>>  static void __init create_pgd_mapping(pgd_t *pgdp,
>> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>>      phys_addr_t next_phys;
>>      uintptr_t pgd_index = pgd_index(va);
>>
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +    if (!pgtable_l4_enabled) {
>> +        create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
>> +        return;
>> +    }
>> +#endif
>> +
>>      if (sz == PGDIR_SIZE) {
>>          if (pgd_val(pgdp[pgd_index]) == 0)
>>              pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
>> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>
>>      /* Setup early PGD for fixmap */
>>      create_pgd_mapping(early_pg_dir, FIXADDR_START,
>> -               (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>> +               fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>>
>>  #ifndef __PAGETABLE_PMD_FOLDED
>> -    /* Setup fixmap PMD */
>> +    /* Setup fixmap PUD and PMD */
>> +    if (pgtable_l4_enabled)
>> +        create_pud_mapping(fixmap_pud, FIXADDR_START,
>> +               (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
>>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>> +
>>      /* Setup trampoline PGD and PMD */
>>      create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> -               (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> +               trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>> +    if (pgtable_l4_enabled)
>> +        create_pud_mapping(trampoline_pud, PAGE_OFFSET,
>> +               (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
>>      create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>  #else
>> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>      dtb_early_pa = dtb_pa;
>>  }
>>
>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
>> +/*
>> + * This function is called only if the current kernel is 64bit and 
>> the HW
>> + * does not support sv48.
>> + */
>> +asmlinkage __init void setup_vm_fold_pud(void)
>> +{
>> +    pgtable_l4_enabled = false;
>> +    kernel_virt_addr = PAGE_OFFSET_L3;
>> +    satp_mode = SATP_MODE_39;
>> +
>> +    /*
>> +     * PTE/PMD levels do not need to be cleared as they are common 
>> between
>> +     * 3- and 4-level page tables: the 30 least significant bits
>> +     * (2 * 9 + 12) are common.
>> +     */
>> +    memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
>> +    memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
>> +
>> +    setup_vm(dtb_early_pa);
>> +}
>> +#endif
>> +
>>  static void __init setup_vm_final(void)
>>  {
>>      uintptr_t va, map_size;
>> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
>>          }
>>      }
>>
>> -    /* Clear fixmap PTE and PMD mappings */
>> +    /* Clear fixmap page table mappings */
>>      clear_fixmap(FIX_PTE);
>>      clear_fixmap(FIX_PMD);
>> +    clear_fixmap(FIX_PUD);
>>
>>      /* Move to swapper page table */
>> -    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | 
>> SATP_MODE);
>> +    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | 
>> satp_mode);
>>      local_flush_tlb_all();
>>  }
>>  #else

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  2020-04-03 15:53   ` Palmer Dabbelt
@ 2020-04-07  5:14     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:14 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig



On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:26 PDT (-0700), alex@ghiti.fr wrote:
>> Now that the mmu type is determined at runtime using SATP
>> characteristic, use the global variable pgtable_l4_enabled to output
>> mmu type of the processor through /proc/cpuinfo instead of relying on
>> device tree infos.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  arch/riscv/boot/dts/sifive/fu540-c000.dtsi |  4 ----
>>  arch/riscv/kernel/cpu.c                    | 24 ++++++++++++----------
>>  2 files changed, 13 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi 
>> b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
>> index 7db861053483..6138590a2229 100644
>> --- a/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
>> +++ b/arch/riscv/boot/dts/sifive/fu540-c000.dtsi
>> @@ -50,7 +50,6 @@
>>              i-cache-size = <32768>;
>>              i-tlb-sets = <1>;
>>              i-tlb-size = <32>;
>> -            mmu-type = "riscv,sv39";
>>              reg = <1>;
>>              riscv,isa = "rv64imafdc";
>>              tlb-split;
>> @@ -74,7 +73,6 @@
>>              i-cache-size = <32768>;
>>              i-tlb-sets = <1>;
>>              i-tlb-size = <32>;
>> -            mmu-type = "riscv,sv39";
>>              reg = <2>;
>>              riscv,isa = "rv64imafdc";
>>              tlb-split;
>> @@ -98,7 +96,6 @@
>>              i-cache-size = <32768>;
>>              i-tlb-sets = <1>;
>>              i-tlb-size = <32>;
>> -            mmu-type = "riscv,sv39";
>>              reg = <3>;
>>              riscv,isa = "rv64imafdc";
>>              tlb-split;
>> @@ -122,7 +119,6 @@
>>              i-cache-size = <32768>;
>>              i-tlb-sets = <1>;
>>              i-tlb-size = <32>;
>> -            mmu-type = "riscv,sv39";
>>              reg = <4>;
>>              riscv,isa = "rv64imafdc";
>>              tlb-split;
>> diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
>> index 40a3c442ac5f..38a699b997a8 100644
>> --- a/arch/riscv/kernel/cpu.c
>> +++ b/arch/riscv/kernel/cpu.c
>> @@ -8,6 +8,8 @@
>>  #include <linux/of.h>
>>  #include <asm/smp.h>
>>
>> +extern bool pgtable_l4_enabled;
>> +
>>  /*
>>   * Returns the hart ID of the given device tree node, or -ENODEV if 
>> the node
>>   * isn't an enabled and valid RISC-V hart node.
>> @@ -54,18 +56,19 @@ static void print_isa(struct seq_file *f, const 
>> char *isa)
>>      seq_puts(f, "\n");
>>  }
>>
>> -static void print_mmu(struct seq_file *f, const char *mmu_type)
>> +static void print_mmu(struct seq_file *f)
>>  {
>> +    char sv_type[16];
>> +
>>  #if defined(CONFIG_32BIT)
>> -    if (strcmp(mmu_type, "riscv,sv32") != 0)
>> -        return;
>> +    strncpy(sv_type, "sv32", 5);
>>  #elif defined(CONFIG_64BIT)
>> -    if (strcmp(mmu_type, "riscv,sv39") != 0 &&
>> -        strcmp(mmu_type, "riscv,sv48") != 0)
>> -        return;
>> +    if (pgtable_l4_enabled)
>> +        strncpy(sv_type, "sv48", 5);
>> +    else
>> +        strncpy(sv_type, "sv39", 5);
>>  #endif
>> -
>> -    seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
>> +    seq_printf(f, "mmu\t\t: %s\n", sv_type);
>>  }
>>
>>  static void *c_start(struct seq_file *m, loff_t *pos)
>> @@ -90,14 +93,13 @@ static int c_show(struct seq_file *m, void *v)
>>  {
>>      unsigned long cpu_id = (unsigned long)v - 1;
>>      struct device_node *node = of_get_cpu_node(cpu_id, NULL);
>> -    const char *compat, *isa, *mmu;
>> +    const char *compat, *isa;
>>
>>      seq_printf(m, "processor\t: %lu\n", cpu_id);
>>      seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
>>      if (!of_property_read_string(node, "riscv,isa", &isa))
>>          print_isa(m, isa);
>> -    if (!of_property_read_string(node, "mmu-type", &mmu))
>> -        print_mmu(m, mmu);
>> +    print_mmu(m);
>>      if (!of_property_read_string(node, "compatible", &compat)
>>          && strcmp(compat, "riscv"))
>>          seq_printf(m, "uarch\t\t: %s\n", compat);
> 
> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>

Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree
  2020-04-03 15:53   ` Palmer Dabbelt
@ 2020-04-07  5:14     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:14 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig


On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:27 PDT (-0700), alex@ghiti.fr wrote:
>> This property can not be used before virtual memory is set up
>> and then the  distinction between sv39 and sv48 is done at runtime
>> using SATP csr property: this property is now useless, so remove it.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  Documentation/devicetree/bindings/riscv/cpus.yaml | 13 -------------
>>  1 file changed, 13 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/riscv/cpus.yaml 
>> b/Documentation/devicetree/bindings/riscv/cpus.yaml
>> index 04819ad379c2..12baabbac213 100644
>> --- a/Documentation/devicetree/bindings/riscv/cpus.yaml
>> +++ b/Documentation/devicetree/bindings/riscv/cpus.yaml
>> @@ -39,19 +39,6 @@ properties:
>>        Identifies that the hart uses the RISC-V instruction set
>>        and identifies the type of the hart.
>>
>> -  mmu-type:
>> -    allOf:
>> -      - $ref: "/schemas/types.yaml#/definitions/string"
>> -      - enum:
>> -          - riscv,sv32
>> -          - riscv,sv39
>> -          - riscv,sv48
>> -    description:
>> -      Identifies the MMU address translation mode used on this
>> -      hart.  These values originate from the RISC-V Privileged
>> -      Specification document, available from
>> -      https://riscv.org/specifications/
>> -
>>    riscv,isa:
>>      allOf:
>>        - $ref: "/schemas/types.yaml#/definitions/string"
> 
> I'd prefer if we continue to define this in the schema: while Linux 
> won't use
> it, it's still useful for other programs that want to statically 
> determine the
> available VA widths.

Sure, I'll remove that in next version.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size
  2020-04-03 15:53   ` Palmer Dabbelt
@ 2020-04-07  5:15     ` Alex Ghiti
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Ghiti @ 2020-04-07  5:15 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: anup, linux-kernel, zong.li, Paul Walmsley, linux-riscv,
	Christoph Hellwig



On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> On Sun, 22 Mar 2020 04:00:28 PDT (-0700), alex@ghiti.fr wrote:
>> Define precisely the size of the user accessible virtual space size
>> for sv32/39/48 mmu types and explain why the whole virtual address
>> space is split into 2 equal chunks between kernel and user space.
>>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> ---
>>  arch/riscv/include/asm/pgtable.h | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/riscv/include/asm/pgtable.h 
>> b/arch/riscv/include/asm/pgtable.h
>> index 06361db3f486..be117a0b4ea1 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -456,8 +456,15 @@ static inline int ptep_clear_flush_young(struct 
>> vm_area_struct *vma,
>>  #define FIXADDR_START    (FIXADDR_TOP - FIXADDR_SIZE)
>>
>>  /*
>> - * Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32.
>> - * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
>> + * Task size is:
>> + * -     0x9fc00000 (~2.5GB) for RV32.
>> + * -   0x4000000000 ( 256GB) for RV64 using SV39 mmu
>> + * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
>> + *
>> + * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
>> + * Instruction Set Manual Volume II: Privileged Architecture" states 
>> that
>> + * "load and store effective addresses, which are 64bits, must have bits
>> + * 63–48 all equal to bit 47, or else a page-fault exception will 
>> occur."
>>   */
>>  #ifdef CONFIG_64BIT
>>  #define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> 
> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>

Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-04-07  5:14     ` Alex Ghiti
@ 2020-04-07  5:56       ` Anup Patel
  2020-04-08  4:39         ` Alex Ghiti
  0 siblings, 1 reply; 35+ messages in thread
From: Anup Patel @ 2020-04-07  5:56 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Tue, Apr 7, 2020 at 10:44 AM Alex Ghiti <alex@ghiti.fr> wrote:
>
>
> On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> > On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@ghiti.fr wrote:
> >> By adding a new 4th level of page table, give the possibility to 64bit
> >> kernel to address 2^48 bytes of virtual address: in practice, that
> >> roughly
> >> offers ~160TB of virtual address space to userspace and allows up to 64TB
> >> of physical memory.
> >>
> >> By default, the kernel will try to boot with a 4-level page table. If the
> >> underlying hardware does not support it, we will automatically
> >> fallback to
> >> a standard 3-level page table by folding the new PUD level into PGDIR
> >> level.
> >>
> >> Early page table preparation is too early in the boot process to use any
> >> device-tree entry, then in order to detect HW capabilities at runtime, we
> >> use SATP feature that ignores writes with an unsupported mode. The
> >> current
> >> mode used by the kernel is then made available through cpuinfo.
> >
> > Ya, I think that's the right way to go about this.  There's no reason to
> > rely on duplicate DT mechanisms for things the ISA defines for us.
> >
> >>
> >> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> >> ---
> >>  arch/riscv/Kconfig                  |   6 +-
> >>  arch/riscv/include/asm/csr.h        |   3 +-
> >>  arch/riscv/include/asm/fixmap.h     |   1 +
> >>  arch/riscv/include/asm/page.h       |  15 +++-
> >>  arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
> >>  arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
> >>  arch/riscv/include/asm/pgtable.h    |   5 +-
> >>  arch/riscv/kernel/head.S            |  37 ++++++--
> >>  arch/riscv/mm/context.c             |   4 +-
> >>  arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
> >>  10 files changed, 302 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >> index a475c78e66bc..79560e94cc7c 100644
> >> --- a/arch/riscv/Kconfig
> >> +++ b/arch/riscv/Kconfig
> >> @@ -66,6 +66,7 @@ config RISCV
> >>      select ARCH_HAS_GCOV_PROFILE_ALL
> >>      select HAVE_COPY_THREAD_TLS
> >>      select HAVE_ARCH_KASAN if MMU && 64BIT
> >> +    select RELOCATABLE if 64BIT
> >>
> >>  config ARCH_MMAP_RND_BITS_MIN
> >>      default 18 if 64BIT
> >> @@ -104,7 +105,7 @@ config PAGE_OFFSET
> >>      default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
> >>      default 0x80000000 if 64BIT && !MMU
> >>      default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> >> -    default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
> >> +    default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
> >>
> >>  config ARCH_FLATMEM_ENABLE
> >>      def_bool y
> >> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
> >>  config FIX_EARLYCON_MEM
> >>      def_bool MMU
> >>
> >> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime
> >> folded
> >> +# on a 3-level page table when sv48 is not supported.
> >>  config PGTABLE_LEVELS
> >>      int
> >> +    default 4 if 64BIT && RELOCATABLE
> >>      default 3 if 64BIT
> >>      default 2
> >
> > I assume this means you're relying on relocation to move the kernel around
> > independently of PAGE_OFFSET in order to fold in the missing page table
> > level?
>
> Yes, relocation is needed to fallback to 3-level and move PAGE_OFFSET
> accordingly.
>
> > That seems reasonable, but it does impose a performance penalty as
> > relocatable
> > kernels necessitate slower generated code.  Additionally, there will
> > likely be
> > a performance penalty due to the extra memory access on TLB misses that is
> > unnecessary for workloads that don't necessitate the longer VA width on
> > machines that support it.
>
> Sorry, I had no time to answer your previous mail regarding performance:
> I have no number. But the only penalty caused by this patchset on
> 3-level page table is the check in page table management functions to
> know if 4-level is activated or not. And as you said, the extra cost of
> relocatable kernel that I had ignored since necessary anyway.

I guess we don't need relocation if we can avoid page table folding by
detecting Sv48 mode very early in setup_vm(). Is there any other place
where relocation would be required ?

If we can totally avoid relocation then it will certainly help in performance.

Regards,
Anup

>
> >
> > I think the best bet here would be to have a Kconfig option for the
> > number of
> > page table levels (which could be MAXPHYSMEM or a second partially free
> > parameter) and then another boolean argument along the lines of "also
> > support
> > machines with smaller VA widths".  It seems best to turn on the largest VA
> > width and support for folding by default, as I assume that's what
> > distros would
> > do.
>
> I'm not a big fan of a new Kconfig option to allow people to have a
> 3-level page table because that implies maintaining a new kernel, even
> for us, having to compile 2 kernels each time we change something to mm
> code will be painful.
>
> I have just reviewed Zong's KASLR patchset: he needs to parse the dtb to
> find out the reserved regions in order to not override one of them when
> copying the kernel to its new destination. And after that, he loops back
> to setup_vm to re-create the mapping to the new kernel.
> If that's the way we take for KASLR, we can follow the same path here:
> boot with 4-level by default, go to check what is wanted in the device
> tree and if it is 3-level, loop back to setup_vm.
>
> >
> > I didn't really look closely at the rest of this, but it generally
> > smells OK.
> > The diff will need to be somewhat different for the next version, anyway :)
> >
> > Thanks for doing this!
> >
> >> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> >> index 435b65532e29..3828d55af85e 100644
> >> --- a/arch/riscv/include/asm/csr.h
> >> +++ b/arch/riscv/include/asm/csr.h
> >> @@ -40,11 +40,10 @@
> >>  #ifndef CONFIG_64BIT
> >>  #define SATP_PPN    _AC(0x003FFFFF, UL)
> >>  #define SATP_MODE_32    _AC(0x80000000, UL)
> >> -#define SATP_MODE    SATP_MODE_32
> >>  #else
> >>  #define SATP_PPN    _AC(0x00000FFFFFFFFFFF, UL)
> >>  #define SATP_MODE_39    _AC(0x8000000000000000, UL)
> >> -#define SATP_MODE    SATP_MODE_39
> >> +#define SATP_MODE_48    _AC(0x9000000000000000, UL)
> >>  #endif
> >>
> >>  /* Exception cause high bit - is an interrupt if set */
> >> diff --git a/arch/riscv/include/asm/fixmap.h
> >> b/arch/riscv/include/asm/fixmap.h
> >> index 42d2c42f3cc9..26e7799c5675 100644
> >> --- a/arch/riscv/include/asm/fixmap.h
> >> +++ b/arch/riscv/include/asm/fixmap.h
> >> @@ -27,6 +27,7 @@ enum fixed_addresses {
> >>      FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
> >>      FIX_PTE,
> >>      FIX_PMD,
> >> +    FIX_PUD,
> >>      FIX_EARLYCON_MEM_BASE,
> >>      __end_of_fixed_addresses
> >>  };
> >> diff --git a/arch/riscv/include/asm/page.h
> >> b/arch/riscv/include/asm/page.h
> >> index 691f2f9ded2f..f1a26a0690ef 100644
> >> --- a/arch/riscv/include/asm/page.h
> >> +++ b/arch/riscv/include/asm/page.h
> >> @@ -32,11 +32,19 @@
> >>   * physical memory (aligned on a page boundary).
> >>   */
> >>  #ifdef CONFIG_RELOCATABLE
> >> -extern unsigned long kernel_virt_addr;
> >>  #define PAGE_OFFSET        kernel_virt_addr
> >> +
> >> +#ifdef CONFIG_64BIT
> >> +/*
> >> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address
> >> space so
> >> + * define the PAGE_OFFSET value for SV39.
> >> + */
> >> +#define PAGE_OFFSET_L3        0xffffffe000000000
> >> +#define PAGE_OFFSET_L4        _AC(CONFIG_PAGE_OFFSET, UL)
> >> +#endif /* CONFIG_64BIT */
> >>  #else
> >>  #define PAGE_OFFSET        _AC(CONFIG_PAGE_OFFSET, UL)
> >> -#endif
> >> +#endif /* CONFIG_RELOCATABLE */
> >>
> >>  #define KERN_VIRT_SIZE        -PAGE_OFFSET
> >>
> >> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
> >>
> >>  extern unsigned long max_low_pfn;
> >>  extern unsigned long min_low_pfn;
> >> +#ifdef CONFIG_RELOCATABLE
> >> +extern unsigned long kernel_virt_addr;
> >> +#endif
> >>
> >>  #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) +
> >> va_pa_offset))
> >>  #define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
> >> diff --git a/arch/riscv/include/asm/pgalloc.h
> >> b/arch/riscv/include/asm/pgalloc.h
> >> index 3f601ee8233f..540eaa5a8658 100644
> >> --- a/arch/riscv/include/asm/pgalloc.h
> >> +++ b/arch/riscv/include/asm/pgalloc.h
> >> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct
> >> *mm, pud_t *pud, pmd_t *pmd)
> >>
> >>      set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>  }
> >> +
> >> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d,
> >> pud_t *pud)
> >> +{
> >> +    if (pgtable_l4_enabled) {
> >> +        unsigned long pfn = virt_to_pfn(pud);
> >> +
> >> +        set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >> +    }
> >> +}
> >> +
> >> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> >> +                     pud_t *pud)
> >> +{
> >> +    if (pgtable_l4_enabled) {
> >> +        unsigned long pfn = virt_to_pfn(pud);
> >> +
> >> +        set_p4d_safe(p4d,
> >> +                 __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >> +    }
> >> +}
> >> +
> >> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned
> >> long addr)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return (pud_t *)__get_free_page(
> >> +                GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
> >> +    return NULL;
> >> +}
> >> +
> >> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        free_page((unsigned long)pud);
> >> +}
> >> +
> >> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
> >>  #endif /* __PAGETABLE_PMD_FOLDED */
> >>
> >>  #define pmd_pgtable(pmd)    pmd_page(pmd)
> >> diff --git a/arch/riscv/include/asm/pgtable-64.h
> >> b/arch/riscv/include/asm/pgtable-64.h
> >> index b15f70a1fdfa..cc4ffbe778f3 100644
> >> --- a/arch/riscv/include/asm/pgtable-64.h
> >> +++ b/arch/riscv/include/asm/pgtable-64.h
> >> @@ -8,16 +8,32 @@
> >>
> >>  #include <linux/const.h>
> >>
> >> -#define PGDIR_SHIFT     30
> >> +extern bool pgtable_l4_enabled;
> >> +
> >> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
> >>  /* Size of region mapped by a page global directory */
> >>  #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
> >>  #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
> >>
> >> +/* pud is folded into pgd in case of 3-level page table */
> >> +#define PUD_SHIFT    30
> >> +#define PUD_SIZE    (_AC(1, UL) << PUD_SHIFT)
> >> +#define PUD_MASK    (~(PUD_SIZE - 1))
> >> +
> >>  #define PMD_SHIFT       21
> >>  /* Size of region mapped by a page middle directory */
> >>  #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
> >>  #define PMD_MASK        (~(PMD_SIZE - 1))
> >>
> >> +/* Page Upper Directory entry */
> >> +typedef struct {
> >> +    unsigned long pud;
> >> +} pud_t;
> >> +
> >> +#define pud_val(x)      ((x).pud)
> >> +#define __pud(x)        ((pud_t) { (x) })
> >> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
> >> +
> >>  /* Page Middle Directory entry */
> >>  typedef struct {
> >>      unsigned long pmd;
> >> @@ -25,7 +41,6 @@ typedef struct {
> >>
> >>  #define pmd_val(x)      ((x).pmd)
> >>  #define __pmd(x)        ((pmd_t) { (x) })
> >> -
> >>  #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
> >>
> >>  static inline int pud_present(pud_t pud)
> >> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
> >>      set_pud(pudp, __pud(0));
> >>  }
> >>
> >> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> >> +{
> >> +    return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> >> +}
> >> +
> >> +static inline unsigned long _pud_pfn(pud_t pud)
> >> +{
> >> +    return pud_val(pud) >> _PAGE_PFN_SHIFT;
> >> +}
> >> +
> >>  static inline unsigned long pud_page_vaddr(pud_t pud)
> >>  {
> >>      return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> >> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
> >>      return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> >>  }
> >>
> >> +#define mm_pud_folded    mm_pud_folded
> >> +static inline bool mm_pud_folded(struct mm_struct *mm)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return false;
> >> +
> >> +    return true;
> >> +}
> >> +
> >>  #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> >>
> >>  static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
> >> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> >>  #define pmd_ERROR(e) \
> >>      pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
> >>
> >> +#define pud_ERROR(e)    \
> >> +    pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> >> +
> >> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        *p4dp = p4d;
> >> +    else
> >> +        set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> >> +}
> >> +
> >> +static inline int p4d_none(p4d_t p4d)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return (p4d_val(p4d) == 0);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static inline int p4d_present(p4d_t p4d)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return (p4d_val(p4d) & _PAGE_PRESENT);
> >> +
> >> +    return 1;
> >> +}
> >> +
> >> +static inline int p4d_bad(p4d_t p4d)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return !p4d_present(p4d);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static inline void p4d_clear(p4d_t *p4d)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        set_p4d(p4d, __p4d(0));
> >> +}
> >> +
> >> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return (unsigned long)pfn_to_virt(
> >> +                p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> >> +
> >> +    return pud_page_vaddr((pud_t) { p4d_val(p4d) });
> >> +}
> >> +
> >> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> >> +
> >> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> >> +{
> >> +    if (pgtable_l4_enabled)
> >> +        return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> >> +
> >> +    return (pud_t *)p4d;
> >> +}
> >> +
> >>  #endif /* _ASM_RISCV_PGTABLE_64_H */
> >> diff --git a/arch/riscv/include/asm/pgtable.h
> >> b/arch/riscv/include/asm/pgtable.h
> >> index dce401eed1d3..06361db3f486 100644
> >> --- a/arch/riscv/include/asm/pgtable.h
> >> +++ b/arch/riscv/include/asm/pgtable.h
> >> @@ -13,8 +13,7 @@
> >>
> >>  #ifndef __ASSEMBLY__
> >>
> >> -/* Page Upper Directory not used in RISC-V */
> >> -#include <asm-generic/pgtable-nopud.h>
> >> +#include <asm-generic/pgtable-nop4d.h>
> >>  #include <asm/page.h>
> >>  #include <asm/tlbflush.h>
> >>  #include <linux/mm_types.h>
> >> @@ -27,7 +26,7 @@
> >>
> >>  #ifdef CONFIG_MMU
> >>  #ifdef CONFIG_64BIT
> >> -#define VA_BITS        39
> >> +#define VA_BITS        (pgtable_l4_enabled ? 48 : 39)
> >>  #define PA_BITS        56
> >>  #else
> >>  #define VA_BITS        32
> >> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >> index 1c2fbefb8786..22617bd7477f 100644
> >> --- a/arch/riscv/kernel/head.S
> >> +++ b/arch/riscv/kernel/head.S
> >> @@ -113,6 +113,8 @@ clear_bss_done:
> >>      call setup_vm
> >>  #ifdef CONFIG_MMU
> >>      la a0, early_pg_dir
> >> +    la a1, satp_mode
> >> +    REG_L a1, (a1)
> >>      call relocate
> >>  #endif /* CONFIG_MMU */
> >>
> >> @@ -131,24 +133,28 @@ clear_bss_done:
> >>  #ifdef CONFIG_MMU
> >>  relocate:
> >>  #ifdef CONFIG_RELOCATABLE
> >> -    /* Relocate return address */
> >> -    la a1, kernel_virt_addr
> >> -    REG_L a1, 0(a1)
> >> +    /*
> >> +     * Relocate return address but save it in case 4-level page table is
> >> +     * not supported.
> >> +     */
> >> +    mv s1, ra
> >> +    la a3, kernel_virt_addr
> >> +    REG_L a3, 0(a3)
> >>  #else
> >> -    li a1, PAGE_OFFSET
> >> +    li a3, PAGE_OFFSET
> >>  #endif
> >>      la a2, _start
> >> -    sub a1, a1, a2
> >> -    add ra, ra, a1
> >> +    sub a3, a3, a2
> >> +    add ra, ra, a3
> >>
> >>      /* Point stvec to virtual address of intruction after satp write */
> >>      la a2, 1f
> >> -    add a2, a2, a1
> >> +    add a2, a2, a3
> >>      csrw CSR_TVEC, a2
> >>
> >> +    /* First try with a 4-level page table */
> >>      /* Compute satp for kernel page tables, but don't load it yet */
> >>      srl a2, a0, PAGE_SHIFT
> >> -    li a1, SATP_MODE
> >>      or a2, a2, a1
> >>
> >>      /*
> >> @@ -162,6 +168,19 @@ relocate:
> >>      or a0, a0, a1
> >>      sfence.vma
> >>      csrw CSR_SATP, a0
> >> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> >> +    /*
> >> +     * If we fall through here, that means the HW does not support SV48.
> >> +     * We need a 3-level page table then simply fold pud into pgd level
> >> +     * and finally jump back to relocate with 3-level parameters.
> >> +     */
> >> +    call setup_vm_fold_pud
> >> +
> >> +    la a0, early_pg_dir
> >> +    li a1, SATP_MODE_39
> >> +    mv ra, s1
> >> +    tail relocate
> >> +#endif
> >>  .align 2
> >>  1:
> >>      /* Set trap vector to spin forever to help debug */
> >> @@ -213,6 +232,8 @@ relocate:
> >>  #ifdef CONFIG_MMU
> >>      /* Enable virtual memory and relocate to virtual address */
> >>      la a0, swapper_pg_dir
> >> +    la a1, satp_mode
> >> +    REG_L a1, (a1)
> >>      call relocate
> >>  #endif
> >>
> >> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> >> index 613ec81a8979..152b423c02ea 100644
> >> --- a/arch/riscv/mm/context.c
> >> +++ b/arch/riscv/mm/context.c
> >> @@ -9,6 +9,8 @@
> >>  #include <asm/cacheflush.h>
> >>  #include <asm/mmu_context.h>
> >>
> >> +extern uint64_t satp_mode;
> >> +
> >>  /*
> >>   * When necessary, performs a deferred icache flush for the given MM
> >> context,
> >>   * on the local CPU.  RISC-V has no direct mechanism for instruction
> >> cache
> >> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct
> >> mm_struct *next,
> >>      cpumask_set_cpu(cpu, mm_cpumask(next));
> >>
> >>  #ifdef CONFIG_MMU
> >> -    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
> >> +    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
> >>      local_flush_tlb_all();
> >>  #endif
> >>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index 18bbb426848e..ad96667d2ab6 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -24,6 +24,17 @@
> >>
> >>  #include "../kernel/head.h"
> >>
> >> +#ifdef CONFIG_64BIT
> >> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
> >> +                SATP_MODE_39 : SATP_MODE_48;
> >> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false :
> >> true;
> >> +#else
> >> +uint64_t satp_mode = SATP_MODE_32;
> >> +bool pgtable_l4_enabled = false;
> >> +#endif
> >> +EXPORT_SYMBOL(pgtable_l4_enabled);
> >> +EXPORT_SYMBOL(satp_mode);
> >> +
> >>  unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >>                              __page_aligned_bss;
> >>  EXPORT_SYMBOL(empty_zero_page);
> >> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
> >>
> >>  #ifndef __PAGETABLE_PMD_FOLDED
> >>
> >> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
> >>  pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
> >>  pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >>  pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> >> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> >>
> >>  static pmd_t *__init get_pmd_virt(phys_addr_t pa)
> >>  {
> >> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >>      if (mmu_enabled)
> >>          return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>
> >> -    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> >> +    /* Only one PMD is available for early mapping */
> >> +    BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
> >>
> >>      return (uintptr_t)early_pmd;
> >>  }
> >> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
> >>      create_pte_mapping(ptep, va, pa, sz, prot);
> >>  }
> >>
> >> -#define pgd_next_t        pmd_t
> >> -#define alloc_pgd_next(__va)    alloc_pmd(__va)
> >> -#define get_pgd_next_virt(__pa)    get_pmd_virt(__pa)
> >> +static pud_t *__init get_pud_virt(phys_addr_t pa)
> >> +{
> >> +    if (mmu_enabled) {
> >> +        clear_fixmap(FIX_PUD);
> >> +        return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> >> +    } else {
> >> +        return (pud_t *)((uintptr_t)pa);
> >> +    }
> >> +}
> >> +
> >> +static phys_addr_t __init alloc_pud(uintptr_t va)
> >> +{
> >> +    if (mmu_enabled)
> >> +        return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >> +
> >> +    /* Only one PUD is available for early mapping */
> >> +    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> >> +
> >> +    return (uintptr_t)early_pud;
> >> +}
> >> +
> >> +static void __init create_pud_mapping(pud_t *pudp,
> >> +                      uintptr_t va, phys_addr_t pa,
> >> +                      phys_addr_t sz, pgprot_t prot)
> >> +{
> >> +    pmd_t *nextp;
> >> +    phys_addr_t next_phys;
> >> +    uintptr_t pud_index = pud_index(va);
> >> +
> >> +    if (sz == PUD_SIZE) {
> >> +        if (pud_val(pudp[pud_index]) == 0)
> >> +            pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> >> +        return;
> >> +    }
> >> +
> >> +    if (pud_val(pudp[pud_index]) == 0) {
> >> +        next_phys = alloc_pmd(va);
> >> +        pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> >> +        nextp = get_pmd_virt(next_phys);
> >> +        memset(nextp, 0, PAGE_SIZE);
> >> +    } else {
> >> +        next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> >> +        nextp = get_pmd_virt(next_phys);
> >> +    }
> >> +
> >> +    create_pmd_mapping(nextp, va, pa, sz, prot);
> >> +}
> >> +
> >> +#define pgd_next_t        pud_t
> >> +#define alloc_pgd_next(__va)    alloc_pud(__va)
> >> +#define get_pgd_next_virt(__pa)    get_pud_virt(__pa)
> >>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
> >> -    create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> >> -#define fixmap_pgd_next        fixmap_pmd
> >> +    create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
> >> +#define fixmap_pgd_next        (pgtable_l4_enabled ?            \
> >> +            (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> >> +#define trampoline_pgd_next    (pgtable_l4_enabled ?            \
> >> +            (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
> >>  #else
> >>  #define pgd_next_t        pte_t
> >>  #define alloc_pgd_next(__va)    alloc_pte(__va)
> >>  #define get_pgd_next_virt(__pa)    get_pte_virt(__pa)
> >>  #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
> >>      create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> >> -#define fixmap_pgd_next        fixmap_pte
> >> +#define fixmap_pgd_next        ((uintptr_t)fixmap_pte)
> >>  #endif
> >>
> >>  static void __init create_pgd_mapping(pgd_t *pgdp,
> >> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
> >>      phys_addr_t next_phys;
> >>      uintptr_t pgd_index = pgd_index(va);
> >>
> >> +#ifndef __PAGETABLE_PMD_FOLDED
> >> +    if (!pgtable_l4_enabled) {
> >> +        create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
> >> +        return;
> >> +    }
> >> +#endif
> >> +
> >>      if (sz == PGDIR_SIZE) {
> >>          if (pgd_val(pgdp[pgd_index]) == 0)
> >>              pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> >> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>
> >>      /* Setup early PGD for fixmap */
> >>      create_pgd_mapping(early_pg_dir, FIXADDR_START,
> >> -               (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >> +               fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>
> >>  #ifndef __PAGETABLE_PMD_FOLDED
> >> -    /* Setup fixmap PMD */
> >> +    /* Setup fixmap PUD and PMD */
> >> +    if (pgtable_l4_enabled)
> >> +        create_pud_mapping(fixmap_pud, FIXADDR_START,
> >> +               (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
> >>      create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >>                 (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >> +
> >>      /* Setup trampoline PGD and PMD */
> >>      create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> -               (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >> +               trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >> +    if (pgtable_l4_enabled)
> >> +        create_pud_mapping(trampoline_pud, PAGE_OFFSET,
> >> +               (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
> >>      create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >>                 load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >>  #else
> >> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>      dtb_early_pa = dtb_pa;
> >>  }
> >>
> >> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> >> +/*
> >> + * This function is called only if the current kernel is 64bit and
> >> the HW
> >> + * does not support sv48.
> >> + */
> >> +asmlinkage __init void setup_vm_fold_pud(void)
> >> +{
> >> +    pgtable_l4_enabled = false;
> >> +    kernel_virt_addr = PAGE_OFFSET_L3;
> >> +    satp_mode = SATP_MODE_39;
> >> +
> >> +    /*
> >> +     * PTE/PMD levels do not need to be cleared as they are common
> >> between
> >> +     * 3- and 4-level page tables: the 30 least significant bits
> >> +     * (2 * 9 + 12) are common.
> >> +     */
> >> +    memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> >> +    memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> >> +
> >> +    setup_vm(dtb_early_pa);
> >> +}
> >> +#endif
> >> +
> >>  static void __init setup_vm_final(void)
> >>  {
> >>      uintptr_t va, map_size;
> >> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
> >>          }
> >>      }
> >>
> >> -    /* Clear fixmap PTE and PMD mappings */
> >> +    /* Clear fixmap page table mappings */
> >>      clear_fixmap(FIX_PTE);
> >>      clear_fixmap(FIX_PMD);
> >> +    clear_fixmap(FIX_PUD);
> >>
> >>      /* Move to swapper page table */
> >> -    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
> >> SATP_MODE);
> >> +    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
> >> satp_mode);
> >>      local_flush_tlb_all();
> >>  }
> >>  #else
>
> Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-04-07  5:56       ` Anup Patel
@ 2020-04-08  4:39         ` Alex Ghiti
  2020-04-08  5:06           ` Anup Patel
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Ghiti @ 2020-04-08  4:39 UTC (permalink / raw)
  To: Anup Patel
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

Hi Anup,

On 4/7/20 1:56 AM, Anup Patel wrote:
> On Tue, Apr 7, 2020 at 10:44 AM Alex Ghiti <alex@ghiti.fr> wrote:
>>
>>
>> On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
>>> On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@ghiti.fr wrote:
>>>> By adding a new 4th level of page table, give the possibility to 64bit
>>>> kernel to address 2^48 bytes of virtual address: in practice, that
>>>> roughly
>>>> offers ~160TB of virtual address space to userspace and allows up to 64TB
>>>> of physical memory.
>>>>
>>>> By default, the kernel will try to boot with a 4-level page table. If the
>>>> underlying hardware does not support it, we will automatically
>>>> fallback to
>>>> a standard 3-level page table by folding the new PUD level into PGDIR
>>>> level.
>>>>
>>>> Early page table preparation is too early in the boot process to use any
>>>> device-tree entry, then in order to detect HW capabilities at runtime, we
>>>> use SATP feature that ignores writes with an unsupported mode. The
>>>> current
>>>> mode used by the kernel is then made available through cpuinfo.
>>>
>>> Ya, I think that's the right way to go about this.  There's no reason to
>>> rely on duplicate DT mechanisms for things the ISA defines for us.
>>>
>>>>
>>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>>> ---
>>>>   arch/riscv/Kconfig                  |   6 +-
>>>>   arch/riscv/include/asm/csr.h        |   3 +-
>>>>   arch/riscv/include/asm/fixmap.h     |   1 +
>>>>   arch/riscv/include/asm/page.h       |  15 +++-
>>>>   arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
>>>>   arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
>>>>   arch/riscv/include/asm/pgtable.h    |   5 +-
>>>>   arch/riscv/kernel/head.S            |  37 ++++++--
>>>>   arch/riscv/mm/context.c             |   4 +-
>>>>   arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
>>>>   10 files changed, 302 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>>> index a475c78e66bc..79560e94cc7c 100644
>>>> --- a/arch/riscv/Kconfig
>>>> +++ b/arch/riscv/Kconfig
>>>> @@ -66,6 +66,7 @@ config RISCV
>>>>       select ARCH_HAS_GCOV_PROFILE_ALL
>>>>       select HAVE_COPY_THREAD_TLS
>>>>       select HAVE_ARCH_KASAN if MMU && 64BIT
>>>> +    select RELOCATABLE if 64BIT
>>>>
>>>>   config ARCH_MMAP_RND_BITS_MIN
>>>>       default 18 if 64BIT
>>>> @@ -104,7 +105,7 @@ config PAGE_OFFSET
>>>>       default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>>>>       default 0x80000000 if 64BIT && !MMU
>>>>       default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
>>>> -    default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
>>>> +    default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
>>>>
>>>>   config ARCH_FLATMEM_ENABLE
>>>>       def_bool y
>>>> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
>>>>   config FIX_EARLYCON_MEM
>>>>       def_bool MMU
>>>>
>>>> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime
>>>> folded
>>>> +# on a 3-level page table when sv48 is not supported.
>>>>   config PGTABLE_LEVELS
>>>>       int
>>>> +    default 4 if 64BIT && RELOCATABLE
>>>>       default 3 if 64BIT
>>>>       default 2
>>>
>>> I assume this means you're relying on relocation to move the kernel around
>>> independently of PAGE_OFFSET in order to fold in the missing page table
>>> level?
>>
>> Yes, relocation is needed to fallback to 3-level and move PAGE_OFFSET
>> accordingly.
>>
>>> That seems reasonable, but it does impose a performance penalty as
>>> relocatable
>>> kernels necessitate slower generated code.  Additionally, there will
>>> likely be
>>> a performance penalty due to the extra memory access on TLB misses that is
>>> unnecessary for workloads that don't necessitate the longer VA width on
>>> machines that support it.
>>
>> Sorry, I had no time to answer your previous mail regarding performance:
>> I have no number. But the only penalty caused by this patchset on
>> 3-level page table is the check in page table management functions to
>> know if 4-level is activated or not. And as you said, the extra cost of
>> relocatable kernel that I had ignored since necessary anyway.
> 
> I guess we don't need relocation if we can avoid page table folding by
> detecting Sv48 mode very early in setup_vm(). Is there any other place
> where relocation would be required ?

Folding the 4th level is only a part of the problem, we also have to 
dynamically change the virtual address of the kernel: how can we achieve 
that without relocations ?

KASLR also uses relocations, see Zong's recent patchset.

Thanks,

Alex

> 
> If we can totally avoid relocation then it will certainly help in performance.
> 
> Regards,
> Anup
> 
>>
>>>
>>> I think the best bet here would be to have a Kconfig option for the
>>> number of
>>> page table levels (which could be MAXPHYSMEM or a second partially free
>>> parameter) and then another boolean argument along the lines of "also
>>> support
>>> machines with smaller VA widths".  It seems best to turn on the largest VA
>>> width and support for folding by default, as I assume that's what
>>> distros would
>>> do.
>>
>> I'm not a big fan of a new Kconfig option to allow people to have a
>> 3-level page table because that implies maintaining a new kernel, even
>> for us, having to compile 2 kernels each time we change something to mm
>> code will be painful.
>>
>> I have just reviewed Zong's KASLR patchset: he needs to parse the dtb to
>> find out the reserved regions in order to not override one of them when
>> copying the kernel to its new destination. And after that, he loops back
>> to setup_vm to re-create the mapping to the new kernel.
>> If that's the way we take for KASLR, we can follow the same path here:
>> boot with 4-level by default, go to check what is wanted in the device
>> tree and if it is 3-level, loop back to setup_vm.
>>
>>>
>>> I didn't really look closely at the rest of this, but it generally
>>> smells OK.
>>> The diff will need to be somewhat different for the next version, anyway :)
>>>
>>> Thanks for doing this!
>>>
>>>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
>>>> index 435b65532e29..3828d55af85e 100644
>>>> --- a/arch/riscv/include/asm/csr.h
>>>> +++ b/arch/riscv/include/asm/csr.h
>>>> @@ -40,11 +40,10 @@
>>>>   #ifndef CONFIG_64BIT
>>>>   #define SATP_PPN    _AC(0x003FFFFF, UL)
>>>>   #define SATP_MODE_32    _AC(0x80000000, UL)
>>>> -#define SATP_MODE    SATP_MODE_32
>>>>   #else
>>>>   #define SATP_PPN    _AC(0x00000FFFFFFFFFFF, UL)
>>>>   #define SATP_MODE_39    _AC(0x8000000000000000, UL)
>>>> -#define SATP_MODE    SATP_MODE_39
>>>> +#define SATP_MODE_48    _AC(0x9000000000000000, UL)
>>>>   #endif
>>>>
>>>>   /* Exception cause high bit - is an interrupt if set */
>>>> diff --git a/arch/riscv/include/asm/fixmap.h
>>>> b/arch/riscv/include/asm/fixmap.h
>>>> index 42d2c42f3cc9..26e7799c5675 100644
>>>> --- a/arch/riscv/include/asm/fixmap.h
>>>> +++ b/arch/riscv/include/asm/fixmap.h
>>>> @@ -27,6 +27,7 @@ enum fixed_addresses {
>>>>       FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
>>>>       FIX_PTE,
>>>>       FIX_PMD,
>>>> +    FIX_PUD,
>>>>       FIX_EARLYCON_MEM_BASE,
>>>>       __end_of_fixed_addresses
>>>>   };
>>>> diff --git a/arch/riscv/include/asm/page.h
>>>> b/arch/riscv/include/asm/page.h
>>>> index 691f2f9ded2f..f1a26a0690ef 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -32,11 +32,19 @@
>>>>    * physical memory (aligned on a page boundary).
>>>>    */
>>>>   #ifdef CONFIG_RELOCATABLE
>>>> -extern unsigned long kernel_virt_addr;
>>>>   #define PAGE_OFFSET        kernel_virt_addr
>>>> +
>>>> +#ifdef CONFIG_64BIT
>>>> +/*
>>>> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address
>>>> space so
>>>> + * define the PAGE_OFFSET value for SV39.
>>>> + */
>>>> +#define PAGE_OFFSET_L3        0xffffffe000000000
>>>> +#define PAGE_OFFSET_L4        _AC(CONFIG_PAGE_OFFSET, UL)
>>>> +#endif /* CONFIG_64BIT */
>>>>   #else
>>>>   #define PAGE_OFFSET        _AC(CONFIG_PAGE_OFFSET, UL)
>>>> -#endif
>>>> +#endif /* CONFIG_RELOCATABLE */
>>>>
>>>>   #define KERN_VIRT_SIZE        -PAGE_OFFSET
>>>>
>>>> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
>>>>
>>>>   extern unsigned long max_low_pfn;
>>>>   extern unsigned long min_low_pfn;
>>>> +#ifdef CONFIG_RELOCATABLE
>>>> +extern unsigned long kernel_virt_addr;
>>>> +#endif
>>>>
>>>>   #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) +
>>>> va_pa_offset))
>>>>   #define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
>>>> diff --git a/arch/riscv/include/asm/pgalloc.h
>>>> b/arch/riscv/include/asm/pgalloc.h
>>>> index 3f601ee8233f..540eaa5a8658 100644
>>>> --- a/arch/riscv/include/asm/pgalloc.h
>>>> +++ b/arch/riscv/include/asm/pgalloc.h
>>>> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct
>>>> *mm, pud_t *pud, pmd_t *pmd)
>>>>
>>>>       set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>>>>   }
>>>> +
>>>> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d,
>>>> pud_t *pud)
>>>> +{
>>>> +    if (pgtable_l4_enabled) {
>>>> +        unsigned long pfn = virt_to_pfn(pud);
>>>> +
>>>> +        set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>>>> +    }
>>>> +}
>>>> +
>>>> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
>>>> +                     pud_t *pud)
>>>> +{
>>>> +    if (pgtable_l4_enabled) {
>>>> +        unsigned long pfn = virt_to_pfn(pud);
>>>> +
>>>> +        set_p4d_safe(p4d,
>>>> +                 __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>>>> +    }
>>>> +}
>>>> +
>>>> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned
>>>> long addr)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return (pud_t *)__get_free_page(
>>>> +                GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
>>>> +    return NULL;
>>>> +}
>>>> +
>>>> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        free_page((unsigned long)pud);
>>>> +}
>>>> +
>>>> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
>>>>   #endif /* __PAGETABLE_PMD_FOLDED */
>>>>
>>>>   #define pmd_pgtable(pmd)    pmd_page(pmd)
>>>> diff --git a/arch/riscv/include/asm/pgtable-64.h
>>>> b/arch/riscv/include/asm/pgtable-64.h
>>>> index b15f70a1fdfa..cc4ffbe778f3 100644
>>>> --- a/arch/riscv/include/asm/pgtable-64.h
>>>> +++ b/arch/riscv/include/asm/pgtable-64.h
>>>> @@ -8,16 +8,32 @@
>>>>
>>>>   #include <linux/const.h>
>>>>
>>>> -#define PGDIR_SHIFT     30
>>>> +extern bool pgtable_l4_enabled;
>>>> +
>>>> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
>>>>   /* Size of region mapped by a page global directory */
>>>>   #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
>>>>   #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
>>>>
>>>> +/* pud is folded into pgd in case of 3-level page table */
>>>> +#define PUD_SHIFT    30
>>>> +#define PUD_SIZE    (_AC(1, UL) << PUD_SHIFT)
>>>> +#define PUD_MASK    (~(PUD_SIZE - 1))
>>>> +
>>>>   #define PMD_SHIFT       21
>>>>   /* Size of region mapped by a page middle directory */
>>>>   #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
>>>>   #define PMD_MASK        (~(PMD_SIZE - 1))
>>>>
>>>> +/* Page Upper Directory entry */
>>>> +typedef struct {
>>>> +    unsigned long pud;
>>>> +} pud_t;
>>>> +
>>>> +#define pud_val(x)      ((x).pud)
>>>> +#define __pud(x)        ((pud_t) { (x) })
>>>> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
>>>> +
>>>>   /* Page Middle Directory entry */
>>>>   typedef struct {
>>>>       unsigned long pmd;
>>>> @@ -25,7 +41,6 @@ typedef struct {
>>>>
>>>>   #define pmd_val(x)      ((x).pmd)
>>>>   #define __pmd(x)        ((pmd_t) { (x) })
>>>> -
>>>>   #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
>>>>
>>>>   static inline int pud_present(pud_t pud)
>>>> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
>>>>       set_pud(pudp, __pud(0));
>>>>   }
>>>>
>>>> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
>>>> +{
>>>> +    return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
>>>> +}
>>>> +
>>>> +static inline unsigned long _pud_pfn(pud_t pud)
>>>> +{
>>>> +    return pud_val(pud) >> _PAGE_PFN_SHIFT;
>>>> +}
>>>> +
>>>>   static inline unsigned long pud_page_vaddr(pud_t pud)
>>>>   {
>>>>       return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
>>>> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
>>>>       return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
>>>>   }
>>>>
>>>> +#define mm_pud_folded    mm_pud_folded
>>>> +static inline bool mm_pud_folded(struct mm_struct *mm)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return false;
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>>   #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>>>>
>>>>   static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
>>>> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
>>>>   #define pmd_ERROR(e) \
>>>>       pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>>>>
>>>> +#define pud_ERROR(e)    \
>>>> +    pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
>>>> +
>>>> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        *p4dp = p4d;
>>>> +    else
>>>> +        set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
>>>> +}
>>>> +
>>>> +static inline int p4d_none(p4d_t p4d)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return (p4d_val(p4d) == 0);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static inline int p4d_present(p4d_t p4d)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return (p4d_val(p4d) & _PAGE_PRESENT);
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +static inline int p4d_bad(p4d_t p4d)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return !p4d_present(p4d);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static inline void p4d_clear(p4d_t *p4d)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        set_p4d(p4d, __p4d(0));
>>>> +}
>>>> +
>>>> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return (unsigned long)pfn_to_virt(
>>>> +                p4d_val(p4d) >> _PAGE_PFN_SHIFT);
>>>> +
>>>> +    return pud_page_vaddr((pud_t) { p4d_val(p4d) });
>>>> +}
>>>> +
>>>> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>>>> +
>>>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
>>>> +{
>>>> +    if (pgtable_l4_enabled)
>>>> +        return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
>>>> +
>>>> +    return (pud_t *)p4d;
>>>> +}
>>>> +
>>>>   #endif /* _ASM_RISCV_PGTABLE_64_H */
>>>> diff --git a/arch/riscv/include/asm/pgtable.h
>>>> b/arch/riscv/include/asm/pgtable.h
>>>> index dce401eed1d3..06361db3f486 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -13,8 +13,7 @@
>>>>
>>>>   #ifndef __ASSEMBLY__
>>>>
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm-generic/pgtable-nop4d.h>
>>>>   #include <asm/page.h>
>>>>   #include <asm/tlbflush.h>
>>>>   #include <linux/mm_types.h>
>>>> @@ -27,7 +26,7 @@
>>>>
>>>>   #ifdef CONFIG_MMU
>>>>   #ifdef CONFIG_64BIT
>>>> -#define VA_BITS        39
>>>> +#define VA_BITS        (pgtable_l4_enabled ? 48 : 39)
>>>>   #define PA_BITS        56
>>>>   #else
>>>>   #define VA_BITS        32
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 1c2fbefb8786..22617bd7477f 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -113,6 +113,8 @@ clear_bss_done:
>>>>       call setup_vm
>>>>   #ifdef CONFIG_MMU
>>>>       la a0, early_pg_dir
>>>> +    la a1, satp_mode
>>>> +    REG_L a1, (a1)
>>>>       call relocate
>>>>   #endif /* CONFIG_MMU */
>>>>
>>>> @@ -131,24 +133,28 @@ clear_bss_done:
>>>>   #ifdef CONFIG_MMU
>>>>   relocate:
>>>>   #ifdef CONFIG_RELOCATABLE
>>>> -    /* Relocate return address */
>>>> -    la a1, kernel_virt_addr
>>>> -    REG_L a1, 0(a1)
>>>> +    /*
>>>> +     * Relocate return address but save it in case 4-level page table is
>>>> +     * not supported.
>>>> +     */
>>>> +    mv s1, ra
>>>> +    la a3, kernel_virt_addr
>>>> +    REG_L a3, 0(a3)
>>>>   #else
>>>> -    li a1, PAGE_OFFSET
>>>> +    li a3, PAGE_OFFSET
>>>>   #endif
>>>>       la a2, _start
>>>> -    sub a1, a1, a2
>>>> -    add ra, ra, a1
>>>> +    sub a3, a3, a2
>>>> +    add ra, ra, a3
>>>>
>>>>       /* Point stvec to virtual address of intruction after satp write */
>>>>       la a2, 1f
>>>> -    add a2, a2, a1
>>>> +    add a2, a2, a3
>>>>       csrw CSR_TVEC, a2
>>>>
>>>> +    /* First try with a 4-level page table */
>>>>       /* Compute satp for kernel page tables, but don't load it yet */
>>>>       srl a2, a0, PAGE_SHIFT
>>>> -    li a1, SATP_MODE
>>>>       or a2, a2, a1
>>>>
>>>>       /*
>>>> @@ -162,6 +168,19 @@ relocate:
>>>>       or a0, a0, a1
>>>>       sfence.vma
>>>>       csrw CSR_SATP, a0
>>>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
>>>> +    /*
>>>> +     * If we fall through here, that means the HW does not support SV48.
>>>> +     * We need a 3-level page table then simply fold pud into pgd level
>>>> +     * and finally jump back to relocate with 3-level parameters.
>>>> +     */
>>>> +    call setup_vm_fold_pud
>>>> +
>>>> +    la a0, early_pg_dir
>>>> +    li a1, SATP_MODE_39
>>>> +    mv ra, s1
>>>> +    tail relocate
>>>> +#endif
>>>>   .align 2
>>>>   1:
>>>>       /* Set trap vector to spin forever to help debug */
>>>> @@ -213,6 +232,8 @@ relocate:
>>>>   #ifdef CONFIG_MMU
>>>>       /* Enable virtual memory and relocate to virtual address */
>>>>       la a0, swapper_pg_dir
>>>> +    la a1, satp_mode
>>>> +    REG_L a1, (a1)
>>>>       call relocate
>>>>   #endif
>>>>
>>>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
>>>> index 613ec81a8979..152b423c02ea 100644
>>>> --- a/arch/riscv/mm/context.c
>>>> +++ b/arch/riscv/mm/context.c
>>>> @@ -9,6 +9,8 @@
>>>>   #include <asm/cacheflush.h>
>>>>   #include <asm/mmu_context.h>
>>>>
>>>> +extern uint64_t satp_mode;
>>>> +
>>>>   /*
>>>>    * When necessary, performs a deferred icache flush for the given MM
>>>> context,
>>>>    * on the local CPU.  RISC-V has no direct mechanism for instruction
>>>> cache
>>>> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct
>>>> mm_struct *next,
>>>>       cpumask_set_cpu(cpu, mm_cpumask(next));
>>>>
>>>>   #ifdef CONFIG_MMU
>>>> -    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
>>>> +    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
>>>>       local_flush_tlb_all();
>>>>   #endif
>>>>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 18bbb426848e..ad96667d2ab6 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -24,6 +24,17 @@
>>>>
>>>>   #include "../kernel/head.h"
>>>>
>>>> +#ifdef CONFIG_64BIT
>>>> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
>>>> +                SATP_MODE_39 : SATP_MODE_48;
>>>> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false :
>>>> true;
>>>> +#else
>>>> +uint64_t satp_mode = SATP_MODE_32;
>>>> +bool pgtable_l4_enabled = false;
>>>> +#endif
>>>> +EXPORT_SYMBOL(pgtable_l4_enabled);
>>>> +EXPORT_SYMBOL(satp_mode);
>>>> +
>>>>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>>                               __page_aligned_bss;
>>>>   EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
>>>>
>>>>   #ifndef __PAGETABLE_PMD_FOLDED
>>>>
>>>> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
>>>>   pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
>>>> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
>>>>   pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
>>>>   pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>>>> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
>>>>
>>>>   static pmd_t *__init get_pmd_virt(phys_addr_t pa)
>>>>   {
>>>> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>>>>       if (mmu_enabled)
>>>>           return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>>
>>>> -    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>>>> +    /* Only one PMD is available for early mapping */
>>>> +    BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
>>>>
>>>>       return (uintptr_t)early_pmd;
>>>>   }
>>>> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
>>>>       create_pte_mapping(ptep, va, pa, sz, prot);
>>>>   }
>>>>
>>>> -#define pgd_next_t        pmd_t
>>>> -#define alloc_pgd_next(__va)    alloc_pmd(__va)
>>>> -#define get_pgd_next_virt(__pa)    get_pmd_virt(__pa)
>>>> +static pud_t *__init get_pud_virt(phys_addr_t pa)
>>>> +{
>>>> +    if (mmu_enabled) {
>>>> +        clear_fixmap(FIX_PUD);
>>>> +        return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>>>> +    } else {
>>>> +        return (pud_t *)((uintptr_t)pa);
>>>> +    }
>>>> +}
>>>> +
>>>> +static phys_addr_t __init alloc_pud(uintptr_t va)
>>>> +{
>>>> +    if (mmu_enabled)
>>>> +        return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>>> +
>>>> +    /* Only one PUD is available for early mapping */
>>>> +    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
>>>> +
>>>> +    return (uintptr_t)early_pud;
>>>> +}
>>>> +
>>>> +static void __init create_pud_mapping(pud_t *pudp,
>>>> +                      uintptr_t va, phys_addr_t pa,
>>>> +                      phys_addr_t sz, pgprot_t prot)
>>>> +{
>>>> +    pmd_t *nextp;
>>>> +    phys_addr_t next_phys;
>>>> +    uintptr_t pud_index = pud_index(va);
>>>> +
>>>> +    if (sz == PUD_SIZE) {
>>>> +        if (pud_val(pudp[pud_index]) == 0)
>>>> +            pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (pud_val(pudp[pud_index]) == 0) {
>>>> +        next_phys = alloc_pmd(va);
>>>> +        pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
>>>> +        nextp = get_pmd_virt(next_phys);
>>>> +        memset(nextp, 0, PAGE_SIZE);
>>>> +    } else {
>>>> +        next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
>>>> +        nextp = get_pmd_virt(next_phys);
>>>> +    }
>>>> +
>>>> +    create_pmd_mapping(nextp, va, pa, sz, prot);
>>>> +}
>>>> +
>>>> +#define pgd_next_t        pud_t
>>>> +#define alloc_pgd_next(__va)    alloc_pud(__va)
>>>> +#define get_pgd_next_virt(__pa)    get_pud_virt(__pa)
>>>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
>>>> -    create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
>>>> -#define fixmap_pgd_next        fixmap_pmd
>>>> +    create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
>>>> +#define fixmap_pgd_next        (pgtable_l4_enabled ?            \
>>>> +            (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
>>>> +#define trampoline_pgd_next    (pgtable_l4_enabled ?            \
>>>> +            (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
>>>>   #else
>>>>   #define pgd_next_t        pte_t
>>>>   #define alloc_pgd_next(__va)    alloc_pte(__va)
>>>>   #define get_pgd_next_virt(__pa)    get_pte_virt(__pa)
>>>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
>>>>       create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
>>>> -#define fixmap_pgd_next        fixmap_pte
>>>> +#define fixmap_pgd_next        ((uintptr_t)fixmap_pte)
>>>>   #endif
>>>>
>>>>   static void __init create_pgd_mapping(pgd_t *pgdp,
>>>> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>>>>       phys_addr_t next_phys;
>>>>       uintptr_t pgd_index = pgd_index(va);
>>>>
>>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>>> +    if (!pgtable_l4_enabled) {
>>>> +        create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
>>>> +        return;
>>>> +    }
>>>> +#endif
>>>> +
>>>>       if (sz == PGDIR_SIZE) {
>>>>           if (pgd_val(pgdp[pgd_index]) == 0)
>>>>               pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
>>>> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>
>>>>       /* Setup early PGD for fixmap */
>>>>       create_pgd_mapping(early_pg_dir, FIXADDR_START,
>>>> -               (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>>>> +               fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>>>>
>>>>   #ifndef __PAGETABLE_PMD_FOLDED
>>>> -    /* Setup fixmap PMD */
>>>> +    /* Setup fixmap PUD and PMD */
>>>> +    if (pgtable_l4_enabled)
>>>> +        create_pud_mapping(fixmap_pud, FIXADDR_START,
>>>> +               (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
>>>>       create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>>                  (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>> +
>>>>       /* Setup trampoline PGD and PMD */
>>>>       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> -               (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> +               trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>>>> +    if (pgtable_l4_enabled)
>>>> +        create_pud_mapping(trampoline_pud, PAGE_OFFSET,
>>>> +               (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
>>>>       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>>                  load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>>   #else
>>>> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>       dtb_early_pa = dtb_pa;
>>>>   }
>>>>
>>>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
>>>> +/*
>>>> + * This function is called only if the current kernel is 64bit and
>>>> the HW
>>>> + * does not support sv48.
>>>> + */
>>>> +asmlinkage __init void setup_vm_fold_pud(void)
>>>> +{
>>>> +    pgtable_l4_enabled = false;
>>>> +    kernel_virt_addr = PAGE_OFFSET_L3;
>>>> +    satp_mode = SATP_MODE_39;
>>>> +
>>>> +    /*
>>>> +     * PTE/PMD levels do not need to be cleared as they are common
>>>> between
>>>> +     * 3- and 4-level page tables: the 30 least significant bits
>>>> +     * (2 * 9 + 12) are common.
>>>> +     */
>>>> +    memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
>>>> +    memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
>>>> +
>>>> +    setup_vm(dtb_early_pa);
>>>> +}
>>>> +#endif
>>>> +
>>>>   static void __init setup_vm_final(void)
>>>>   {
>>>>       uintptr_t va, map_size;
>>>> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
>>>>           }
>>>>       }
>>>>
>>>> -    /* Clear fixmap PTE and PMD mappings */
>>>> +    /* Clear fixmap page table mappings */
>>>>       clear_fixmap(FIX_PTE);
>>>>       clear_fixmap(FIX_PMD);
>>>> +    clear_fixmap(FIX_PUD);
>>>>
>>>>       /* Move to swapper page table */
>>>> -    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
>>>> SATP_MODE);
>>>> +    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
>>>> satp_mode);
>>>>       local_flush_tlb_all();
>>>>   }
>>>>   #else
>>
>> Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/7] riscv: Implement sv48 support
  2020-04-08  4:39         ` Alex Ghiti
@ 2020-04-08  5:06           ` Anup Patel
  0 siblings, 0 replies; 35+ messages in thread
From: Anup Patel @ 2020-04-08  5:06 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: linux-kernel@vger.kernel.org List, Palmer Dabbelt, Zong Li,
	Paul Walmsley, linux-riscv, Christoph Hellwig

On Wed, Apr 8, 2020 at 10:09 AM Alex Ghiti <alex@ghiti.fr> wrote:
>
> Hi Anup,
>
> On 4/7/20 1:56 AM, Anup Patel wrote:
> > On Tue, Apr 7, 2020 at 10:44 AM Alex Ghiti <alex@ghiti.fr> wrote:
> >>
> >>
> >> On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> >>> On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@ghiti.fr wrote:
> >>>> By adding a new 4th level of page table, give the possibility to 64bit
> >>>> kernel to address 2^48 bytes of virtual address: in practice, that
> >>>> roughly
> >>>> offers ~160TB of virtual address space to userspace and allows up to 64TB
> >>>> of physical memory.
> >>>>
> >>>> By default, the kernel will try to boot with a 4-level page table. If the
> >>>> underlying hardware does not support it, we will automatically
> >>>> fallback to
> >>>> a standard 3-level page table by folding the new PUD level into PGDIR
> >>>> level.
> >>>>
> >>>> Early page table preparation is too early in the boot process to use any
> >>>> device-tree entry, then in order to detect HW capabilities at runtime, we
> >>>> use SATP feature that ignores writes with an unsupported mode. The
> >>>> current
> >>>> mode used by the kernel is then made available through cpuinfo.
> >>>
> >>> Ya, I think that's the right way to go about this.  There's no reason to
> >>> rely on duplicate DT mechanisms for things the ISA defines for us.
> >>>
> >>>>
> >>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> >>>> ---
> >>>>   arch/riscv/Kconfig                  |   6 +-
> >>>>   arch/riscv/include/asm/csr.h        |   3 +-
> >>>>   arch/riscv/include/asm/fixmap.h     |   1 +
> >>>>   arch/riscv/include/asm/page.h       |  15 +++-
> >>>>   arch/riscv/include/asm/pgalloc.h    |  36 ++++++++
> >>>>   arch/riscv/include/asm/pgtable-64.h |  98 ++++++++++++++++++++-
> >>>>   arch/riscv/include/asm/pgtable.h    |   5 +-
> >>>>   arch/riscv/kernel/head.S            |  37 ++++++--
> >>>>   arch/riscv/mm/context.c             |   4 +-
> >>>>   arch/riscv/mm/init.c                | 128 +++++++++++++++++++++++++---
> >>>>   10 files changed, 302 insertions(+), 31 deletions(-)
> >>>>
> >>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >>>> index a475c78e66bc..79560e94cc7c 100644
> >>>> --- a/arch/riscv/Kconfig
> >>>> +++ b/arch/riscv/Kconfig
> >>>> @@ -66,6 +66,7 @@ config RISCV
> >>>>       select ARCH_HAS_GCOV_PROFILE_ALL
> >>>>       select HAVE_COPY_THREAD_TLS
> >>>>       select HAVE_ARCH_KASAN if MMU && 64BIT
> >>>> +    select RELOCATABLE if 64BIT
> >>>>
> >>>>   config ARCH_MMAP_RND_BITS_MIN
> >>>>       default 18 if 64BIT
> >>>> @@ -104,7 +105,7 @@ config PAGE_OFFSET
> >>>>       default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
> >>>>       default 0x80000000 if 64BIT && !MMU
> >>>>       default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> >>>> -    default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
> >>>> +    default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
> >>>>
> >>>>   config ARCH_FLATMEM_ENABLE
> >>>>       def_bool y
> >>>> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
> >>>>   config FIX_EARLYCON_MEM
> >>>>       def_bool MMU
> >>>>
> >>>> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime
> >>>> folded
> >>>> +# on a 3-level page table when sv48 is not supported.
> >>>>   config PGTABLE_LEVELS
> >>>>       int
> >>>> +    default 4 if 64BIT && RELOCATABLE
> >>>>       default 3 if 64BIT
> >>>>       default 2
> >>>
> >>> I assume this means you're relying on relocation to move the kernel around
> >>> independently of PAGE_OFFSET in order to fold in the missing page table
> >>> level?
> >>
> >> Yes, relocation is needed to fallback to 3-level and move PAGE_OFFSET
> >> accordingly.
> >>
> >>> That seems reasonable, but it does impose a performance penalty as
> >>> relocatable
> >>> kernels necessitate slower generated code.  Additionally, there will
> >>> likely be
> >>> a performance penalty due to the extra memory access on TLB misses that is
> >>> unnecessary for workloads that don't necessitate the longer VA width on
> >>> machines that support it.
> >>
> >> Sorry, I had no time to answer your previous mail regarding performance:
> >> I have no number. But the only penalty caused by this patchset on
> >> 3-level page table is the check in page table management functions to
> >> know if 4-level is activated or not. And as you said, the extra cost of
> >> relocatable kernel that I had ignored since necessary anyway.
> >
> > I guess we don't need relocation if we can avoid page table folding by
> > detecting Sv48 mode very early in setup_vm(). Is there any other place
> > where relocation would be required ?
>
> Folding the 4th level is only a part of the problem, we also have to
> dynamically change the virtual address of the kernel: how can we achieve
> that without relocations ?
>
> KASLR also uses relocations, see Zong's recent patchset.

Good to know that relocation is not just for page table folding.

Thanks,
Anup

>
> Thanks,
>
> Alex
>
> >
> > If we can totally avoid relocation then it will certainly help in performance.
> >
> > Regards,
> > Anup
> >
> >>
> >>>
> >>> I think the best bet here would be to have a Kconfig option for the
> >>> number of
> >>> page table levels (which could be MAXPHYSMEM or a second partially free
> >>> parameter) and then another boolean argument along the lines of "also
> >>> support
> >>> machines with smaller VA widths".  It seems best to turn on the largest VA
> >>> width and support for folding by default, as I assume that's what
> >>> distros would
> >>> do.
> >>
> >> I'm not a big fan of a new Kconfig option to allow people to have a
> >> 3-level page table because that implies maintaining a new kernel, even
> >> for us, having to compile 2 kernels each time we change something to mm
> >> code will be painful.
> >>
> >> I have just reviewed Zong's KASLR patchset: he needs to parse the dtb to
> >> find out the reserved regions in order to not override one of them when
> >> copying the kernel to its new destination. And after that, he loops back
> >> to setup_vm to re-create the mapping to the new kernel.
> >> If that's the way we take for KASLR, we can follow the same path here:
> >> boot with 4-level by default, go to check what is wanted in the device
> >> tree and if it is 3-level, loop back to setup_vm.
> >>
> >>>
> >>> I didn't really look closely at the rest of this, but it generally
> >>> smells OK.
> >>> The diff will need to be somewhat different for the next version, anyway :)
> >>>
> >>> Thanks for doing this!
> >>>
> >>>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> >>>> index 435b65532e29..3828d55af85e 100644
> >>>> --- a/arch/riscv/include/asm/csr.h
> >>>> +++ b/arch/riscv/include/asm/csr.h
> >>>> @@ -40,11 +40,10 @@
> >>>>   #ifndef CONFIG_64BIT
> >>>>   #define SATP_PPN    _AC(0x003FFFFF, UL)
> >>>>   #define SATP_MODE_32    _AC(0x80000000, UL)
> >>>> -#define SATP_MODE    SATP_MODE_32
> >>>>   #else
> >>>>   #define SATP_PPN    _AC(0x00000FFFFFFFFFFF, UL)
> >>>>   #define SATP_MODE_39    _AC(0x8000000000000000, UL)
> >>>> -#define SATP_MODE    SATP_MODE_39
> >>>> +#define SATP_MODE_48    _AC(0x9000000000000000, UL)
> >>>>   #endif
> >>>>
> >>>>   /* Exception cause high bit - is an interrupt if set */
> >>>> diff --git a/arch/riscv/include/asm/fixmap.h
> >>>> b/arch/riscv/include/asm/fixmap.h
> >>>> index 42d2c42f3cc9..26e7799c5675 100644
> >>>> --- a/arch/riscv/include/asm/fixmap.h
> >>>> +++ b/arch/riscv/include/asm/fixmap.h
> >>>> @@ -27,6 +27,7 @@ enum fixed_addresses {
> >>>>       FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
> >>>>       FIX_PTE,
> >>>>       FIX_PMD,
> >>>> +    FIX_PUD,
> >>>>       FIX_EARLYCON_MEM_BASE,
> >>>>       __end_of_fixed_addresses
> >>>>   };
> >>>> diff --git a/arch/riscv/include/asm/page.h
> >>>> b/arch/riscv/include/asm/page.h
> >>>> index 691f2f9ded2f..f1a26a0690ef 100644
> >>>> --- a/arch/riscv/include/asm/page.h
> >>>> +++ b/arch/riscv/include/asm/page.h
> >>>> @@ -32,11 +32,19 @@
> >>>>    * physical memory (aligned on a page boundary).
> >>>>    */
> >>>>   #ifdef CONFIG_RELOCATABLE
> >>>> -extern unsigned long kernel_virt_addr;
> >>>>   #define PAGE_OFFSET        kernel_virt_addr
> >>>> +
> >>>> +#ifdef CONFIG_64BIT
> >>>> +/*
> >>>> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address
> >>>> space so
> >>>> + * define the PAGE_OFFSET value for SV39.
> >>>> + */
> >>>> +#define PAGE_OFFSET_L3        0xffffffe000000000
> >>>> +#define PAGE_OFFSET_L4        _AC(CONFIG_PAGE_OFFSET, UL)
> >>>> +#endif /* CONFIG_64BIT */
> >>>>   #else
> >>>>   #define PAGE_OFFSET        _AC(CONFIG_PAGE_OFFSET, UL)
> >>>> -#endif
> >>>> +#endif /* CONFIG_RELOCATABLE */
> >>>>
> >>>>   #define KERN_VIRT_SIZE        -PAGE_OFFSET
> >>>>
> >>>> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
> >>>>
> >>>>   extern unsigned long max_low_pfn;
> >>>>   extern unsigned long min_low_pfn;
> >>>> +#ifdef CONFIG_RELOCATABLE
> >>>> +extern unsigned long kernel_virt_addr;
> >>>> +#endif
> >>>>
> >>>>   #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) +
> >>>> va_pa_offset))
> >>>>   #define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
> >>>> diff --git a/arch/riscv/include/asm/pgalloc.h
> >>>> b/arch/riscv/include/asm/pgalloc.h
> >>>> index 3f601ee8233f..540eaa5a8658 100644
> >>>> --- a/arch/riscv/include/asm/pgalloc.h
> >>>> +++ b/arch/riscv/include/asm/pgalloc.h
> >>>> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct
> >>>> *mm, pud_t *pud, pmd_t *pmd)
> >>>>
> >>>>       set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>>>   }
> >>>> +
> >>>> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d,
> >>>> pud_t *pud)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled) {
> >>>> +        unsigned long pfn = virt_to_pfn(pud);
> >>>> +
> >>>> +        set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> >>>> +                     pud_t *pud)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled) {
> >>>> +        unsigned long pfn = virt_to_pfn(pud);
> >>>> +
> >>>> +        set_p4d_safe(p4d,
> >>>> +                 __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned
> >>>> long addr)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return (pud_t *)__get_free_page(
> >>>> +                GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
> >>>> +    return NULL;
> >>>> +}
> >>>> +
> >>>> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        free_page((unsigned long)pud);
> >>>> +}
> >>>> +
> >>>> +#define __pud_free_tlb(tlb, pud, addr)  pud_free((tlb)->mm, pud)
> >>>>   #endif /* __PAGETABLE_PMD_FOLDED */
> >>>>
> >>>>   #define pmd_pgtable(pmd)    pmd_page(pmd)
> >>>> diff --git a/arch/riscv/include/asm/pgtable-64.h
> >>>> b/arch/riscv/include/asm/pgtable-64.h
> >>>> index b15f70a1fdfa..cc4ffbe778f3 100644
> >>>> --- a/arch/riscv/include/asm/pgtable-64.h
> >>>> +++ b/arch/riscv/include/asm/pgtable-64.h
> >>>> @@ -8,16 +8,32 @@
> >>>>
> >>>>   #include <linux/const.h>
> >>>>
> >>>> -#define PGDIR_SHIFT     30
> >>>> +extern bool pgtable_l4_enabled;
> >>>> +
> >>>> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? 39 : 30)
> >>>>   /* Size of region mapped by a page global directory */
> >>>>   #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
> >>>>   #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
> >>>>
> >>>> +/* pud is folded into pgd in case of 3-level page table */
> >>>> +#define PUD_SHIFT    30
> >>>> +#define PUD_SIZE    (_AC(1, UL) << PUD_SHIFT)
> >>>> +#define PUD_MASK    (~(PUD_SIZE - 1))
> >>>> +
> >>>>   #define PMD_SHIFT       21
> >>>>   /* Size of region mapped by a page middle directory */
> >>>>   #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
> >>>>   #define PMD_MASK        (~(PMD_SIZE - 1))
> >>>>
> >>>> +/* Page Upper Directory entry */
> >>>> +typedef struct {
> >>>> +    unsigned long pud;
> >>>> +} pud_t;
> >>>> +
> >>>> +#define pud_val(x)      ((x).pud)
> >>>> +#define __pud(x)        ((pud_t) { (x) })
> >>>> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
> >>>> +
> >>>>   /* Page Middle Directory entry */
> >>>>   typedef struct {
> >>>>       unsigned long pmd;
> >>>> @@ -25,7 +41,6 @@ typedef struct {
> >>>>
> >>>>   #define pmd_val(x)      ((x).pmd)
> >>>>   #define __pmd(x)        ((pmd_t) { (x) })
> >>>> -
> >>>>   #define PTRS_PER_PMD    (PAGE_SIZE / sizeof(pmd_t))
> >>>>
> >>>>   static inline int pud_present(pud_t pud)
> >>>> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
> >>>>       set_pud(pudp, __pud(0));
> >>>>   }
> >>>>
> >>>> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> >>>> +{
> >>>> +    return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> >>>> +}
> >>>> +
> >>>> +static inline unsigned long _pud_pfn(pud_t pud)
> >>>> +{
> >>>> +    return pud_val(pud) >> _PAGE_PFN_SHIFT;
> >>>> +}
> >>>> +
> >>>>   static inline unsigned long pud_page_vaddr(pud_t pud)
> >>>>   {
> >>>>       return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> >>>> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
> >>>>       return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> >>>>   }
> >>>>
> >>>> +#define mm_pud_folded    mm_pud_folded
> >>>> +static inline bool mm_pud_folded(struct mm_struct *mm)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return false;
> >>>> +
> >>>> +    return true;
> >>>> +}
> >>>> +
> >>>>   #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> >>>>
> >>>>   static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
> >>>> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> >>>>   #define pmd_ERROR(e) \
> >>>>       pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
> >>>>
> >>>> +#define pud_ERROR(e)    \
> >>>> +    pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> >>>> +
> >>>> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        *p4dp = p4d;
> >>>> +    else
> >>>> +        set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> >>>> +}
> >>>> +
> >>>> +static inline int p4d_none(p4d_t p4d)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return (p4d_val(p4d) == 0);
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static inline int p4d_present(p4d_t p4d)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return (p4d_val(p4d) & _PAGE_PRESENT);
> >>>> +
> >>>> +    return 1;
> >>>> +}
> >>>> +
> >>>> +static inline int p4d_bad(p4d_t p4d)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return !p4d_present(p4d);
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static inline void p4d_clear(p4d_t *p4d)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        set_p4d(p4d, __p4d(0));
> >>>> +}
> >>>> +
> >>>> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return (unsigned long)pfn_to_virt(
> >>>> +                p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> >>>> +
> >>>> +    return pud_page_vaddr((pud_t) { p4d_val(p4d) });
> >>>> +}
> >>>> +
> >>>> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> >>>> +
> >>>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> >>>> +{
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> >>>> +
> >>>> +    return (pud_t *)p4d;
> >>>> +}
> >>>> +
> >>>>   #endif /* _ASM_RISCV_PGTABLE_64_H */
> >>>> diff --git a/arch/riscv/include/asm/pgtable.h
> >>>> b/arch/riscv/include/asm/pgtable.h
> >>>> index dce401eed1d3..06361db3f486 100644
> >>>> --- a/arch/riscv/include/asm/pgtable.h
> >>>> +++ b/arch/riscv/include/asm/pgtable.h
> >>>> @@ -13,8 +13,7 @@
> >>>>
> >>>>   #ifndef __ASSEMBLY__
> >>>>
> >>>> -/* Page Upper Directory not used in RISC-V */
> >>>> -#include <asm-generic/pgtable-nopud.h>
> >>>> +#include <asm-generic/pgtable-nop4d.h>
> >>>>   #include <asm/page.h>
> >>>>   #include <asm/tlbflush.h>
> >>>>   #include <linux/mm_types.h>
> >>>> @@ -27,7 +26,7 @@
> >>>>
> >>>>   #ifdef CONFIG_MMU
> >>>>   #ifdef CONFIG_64BIT
> >>>> -#define VA_BITS        39
> >>>> +#define VA_BITS        (pgtable_l4_enabled ? 48 : 39)
> >>>>   #define PA_BITS        56
> >>>>   #else
> >>>>   #define VA_BITS        32
> >>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >>>> index 1c2fbefb8786..22617bd7477f 100644
> >>>> --- a/arch/riscv/kernel/head.S
> >>>> +++ b/arch/riscv/kernel/head.S
> >>>> @@ -113,6 +113,8 @@ clear_bss_done:
> >>>>       call setup_vm
> >>>>   #ifdef CONFIG_MMU
> >>>>       la a0, early_pg_dir
> >>>> +    la a1, satp_mode
> >>>> +    REG_L a1, (a1)
> >>>>       call relocate
> >>>>   #endif /* CONFIG_MMU */
> >>>>
> >>>> @@ -131,24 +133,28 @@ clear_bss_done:
> >>>>   #ifdef CONFIG_MMU
> >>>>   relocate:
> >>>>   #ifdef CONFIG_RELOCATABLE
> >>>> -    /* Relocate return address */
> >>>> -    la a1, kernel_virt_addr
> >>>> -    REG_L a1, 0(a1)
> >>>> +    /*
> >>>> +     * Relocate return address but save it in case 4-level page table is
> >>>> +     * not supported.
> >>>> +     */
> >>>> +    mv s1, ra
> >>>> +    la a3, kernel_virt_addr
> >>>> +    REG_L a3, 0(a3)
> >>>>   #else
> >>>> -    li a1, PAGE_OFFSET
> >>>> +    li a3, PAGE_OFFSET
> >>>>   #endif
> >>>>       la a2, _start
> >>>> -    sub a1, a1, a2
> >>>> -    add ra, ra, a1
> >>>> +    sub a3, a3, a2
> >>>> +    add ra, ra, a3
> >>>>
> >>>>       /* Point stvec to virtual address of intruction after satp write */
> >>>>       la a2, 1f
> >>>> -    add a2, a2, a1
> >>>> +    add a2, a2, a3
> >>>>       csrw CSR_TVEC, a2
> >>>>
> >>>> +    /* First try with a 4-level page table */
> >>>>       /* Compute satp for kernel page tables, but don't load it yet */
> >>>>       srl a2, a0, PAGE_SHIFT
> >>>> -    li a1, SATP_MODE
> >>>>       or a2, a2, a1
> >>>>
> >>>>       /*
> >>>> @@ -162,6 +168,19 @@ relocate:
> >>>>       or a0, a0, a1
> >>>>       sfence.vma
> >>>>       csrw CSR_SATP, a0
> >>>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> >>>> +    /*
> >>>> +     * If we fall through here, that means the HW does not support SV48.
> >>>> +     * We need a 3-level page table then simply fold pud into pgd level
> >>>> +     * and finally jump back to relocate with 3-level parameters.
> >>>> +     */
> >>>> +    call setup_vm_fold_pud
> >>>> +
> >>>> +    la a0, early_pg_dir
> >>>> +    li a1, SATP_MODE_39
> >>>> +    mv ra, s1
> >>>> +    tail relocate
> >>>> +#endif
> >>>>   .align 2
> >>>>   1:
> >>>>       /* Set trap vector to spin forever to help debug */
> >>>> @@ -213,6 +232,8 @@ relocate:
> >>>>   #ifdef CONFIG_MMU
> >>>>       /* Enable virtual memory and relocate to virtual address */
> >>>>       la a0, swapper_pg_dir
> >>>> +    la a1, satp_mode
> >>>> +    REG_L a1, (a1)
> >>>>       call relocate
> >>>>   #endif
> >>>>
> >>>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> >>>> index 613ec81a8979..152b423c02ea 100644
> >>>> --- a/arch/riscv/mm/context.c
> >>>> +++ b/arch/riscv/mm/context.c
> >>>> @@ -9,6 +9,8 @@
> >>>>   #include <asm/cacheflush.h>
> >>>>   #include <asm/mmu_context.h>
> >>>>
> >>>> +extern uint64_t satp_mode;
> >>>> +
> >>>>   /*
> >>>>    * When necessary, performs a deferred icache flush for the given MM
> >>>> context,
> >>>>    * on the local CPU.  RISC-V has no direct mechanism for instruction
> >>>> cache
> >>>> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct
> >>>> mm_struct *next,
> >>>>       cpumask_set_cpu(cpu, mm_cpumask(next));
> >>>>
> >>>>   #ifdef CONFIG_MMU
> >>>> -    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
> >>>> +    csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
> >>>>       local_flush_tlb_all();
> >>>>   #endif
> >>>>
> >>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >>>> index 18bbb426848e..ad96667d2ab6 100644
> >>>> --- a/arch/riscv/mm/init.c
> >>>> +++ b/arch/riscv/mm/init.c
> >>>> @@ -24,6 +24,17 @@
> >>>>
> >>>>   #include "../kernel/head.h"
> >>>>
> >>>> +#ifdef CONFIG_64BIT
> >>>> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
> >>>> +                SATP_MODE_39 : SATP_MODE_48;
> >>>> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false :
> >>>> true;
> >>>> +#else
> >>>> +uint64_t satp_mode = SATP_MODE_32;
> >>>> +bool pgtable_l4_enabled = false;
> >>>> +#endif
> >>>> +EXPORT_SYMBOL(pgtable_l4_enabled);
> >>>> +EXPORT_SYMBOL(satp_mode);
> >>>> +
> >>>>   unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >>>>                               __page_aligned_bss;
> >>>>   EXPORT_SYMBOL(empty_zero_page);
> >>>> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
> >>>>
> >>>>   #ifndef __PAGETABLE_PMD_FOLDED
> >>>>
> >>>> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
> >>>>   pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >>>> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
> >>>>   pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >>>>   pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> >>>> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> >>>>
> >>>>   static pmd_t *__init get_pmd_virt(phys_addr_t pa)
> >>>>   {
> >>>> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >>>>       if (mmu_enabled)
> >>>>           return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>>>
> >>>> -    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> >>>> +    /* Only one PMD is available for early mapping */
> >>>> +    BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
> >>>>
> >>>>       return (uintptr_t)early_pmd;
> >>>>   }
> >>>> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
> >>>>       create_pte_mapping(ptep, va, pa, sz, prot);
> >>>>   }
> >>>>
> >>>> -#define pgd_next_t        pmd_t
> >>>> -#define alloc_pgd_next(__va)    alloc_pmd(__va)
> >>>> -#define get_pgd_next_virt(__pa)    get_pmd_virt(__pa)
> >>>> +static pud_t *__init get_pud_virt(phys_addr_t pa)
> >>>> +{
> >>>> +    if (mmu_enabled) {
> >>>> +        clear_fixmap(FIX_PUD);
> >>>> +        return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> >>>> +    } else {
> >>>> +        return (pud_t *)((uintptr_t)pa);
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +static phys_addr_t __init alloc_pud(uintptr_t va)
> >>>> +{
> >>>> +    if (mmu_enabled)
> >>>> +        return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>>> +
> >>>> +    /* Only one PUD is available for early mapping */
> >>>> +    BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> >>>> +
> >>>> +    return (uintptr_t)early_pud;
> >>>> +}
> >>>> +
> >>>> +static void __init create_pud_mapping(pud_t *pudp,
> >>>> +                      uintptr_t va, phys_addr_t pa,
> >>>> +                      phys_addr_t sz, pgprot_t prot)
> >>>> +{
> >>>> +    pmd_t *nextp;
> >>>> +    phys_addr_t next_phys;
> >>>> +    uintptr_t pud_index = pud_index(va);
> >>>> +
> >>>> +    if (sz == PUD_SIZE) {
> >>>> +        if (pud_val(pudp[pud_index]) == 0)
> >>>> +            pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    if (pud_val(pudp[pud_index]) == 0) {
> >>>> +        next_phys = alloc_pmd(va);
> >>>> +        pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> >>>> +        nextp = get_pmd_virt(next_phys);
> >>>> +        memset(nextp, 0, PAGE_SIZE);
> >>>> +    } else {
> >>>> +        next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> >>>> +        nextp = get_pmd_virt(next_phys);
> >>>> +    }
> >>>> +
> >>>> +    create_pmd_mapping(nextp, va, pa, sz, prot);
> >>>> +}
> >>>> +
> >>>> +#define pgd_next_t        pud_t
> >>>> +#define alloc_pgd_next(__va)    alloc_pud(__va)
> >>>> +#define get_pgd_next_virt(__pa)    get_pud_virt(__pa)
> >>>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
> >>>> -    create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> >>>> -#define fixmap_pgd_next        fixmap_pmd
> >>>> +    create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
> >>>> +#define fixmap_pgd_next        (pgtable_l4_enabled ?            \
> >>>> +            (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> >>>> +#define trampoline_pgd_next    (pgtable_l4_enabled ?            \
> >>>> +            (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
> >>>>   #else
> >>>>   #define pgd_next_t        pte_t
> >>>>   #define alloc_pgd_next(__va)    alloc_pte(__va)
> >>>>   #define get_pgd_next_virt(__pa)    get_pte_virt(__pa)
> >>>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot)    \
> >>>>       create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> >>>> -#define fixmap_pgd_next        fixmap_pte
> >>>> +#define fixmap_pgd_next        ((uintptr_t)fixmap_pte)
> >>>>   #endif
> >>>>
> >>>>   static void __init create_pgd_mapping(pgd_t *pgdp,
> >>>> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
> >>>>       phys_addr_t next_phys;
> >>>>       uintptr_t pgd_index = pgd_index(va);
> >>>>
> >>>> +#ifndef __PAGETABLE_PMD_FOLDED
> >>>> +    if (!pgtable_l4_enabled) {
> >>>> +        create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
> >>>> +        return;
> >>>> +    }
> >>>> +#endif
> >>>> +
> >>>>       if (sz == PGDIR_SIZE) {
> >>>>           if (pgd_val(pgdp[pgd_index]) == 0)
> >>>>               pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> >>>> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>>>
> >>>>       /* Setup early PGD for fixmap */
> >>>>       create_pgd_mapping(early_pg_dir, FIXADDR_START,
> >>>> -               (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>>> +               fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>>>
> >>>>   #ifndef __PAGETABLE_PMD_FOLDED
> >>>> -    /* Setup fixmap PMD */
> >>>> +    /* Setup fixmap PUD and PMD */
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        create_pud_mapping(fixmap_pud, FIXADDR_START,
> >>>> +               (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
> >>>>       create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >>>>                  (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >>>> +
> >>>>       /* Setup trampoline PGD and PMD */
> >>>>       create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >>>> -               (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >>>> +               trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>>> +    if (pgtable_l4_enabled)
> >>>> +        create_pud_mapping(trampoline_pud, PAGE_OFFSET,
> >>>> +               (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
> >>>>       create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >>>>                  load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >>>>   #else
> >>>> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>>>       dtb_early_pa = dtb_pa;
> >>>>   }
> >>>>
> >>>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> >>>> +/*
> >>>> + * This function is called only if the current kernel is 64bit and
> >>>> the HW
> >>>> + * does not support sv48.
> >>>> + */
> >>>> +asmlinkage __init void setup_vm_fold_pud(void)
> >>>> +{
> >>>> +    pgtable_l4_enabled = false;
> >>>> +    kernel_virt_addr = PAGE_OFFSET_L3;
> >>>> +    satp_mode = SATP_MODE_39;
> >>>> +
> >>>> +    /*
> >>>> +     * PTE/PMD levels do not need to be cleared as they are common
> >>>> between
> >>>> +     * 3- and 4-level page tables: the 30 least significant bits
> >>>> +     * (2 * 9 + 12) are common.
> >>>> +     */
> >>>> +    memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> >>>> +    memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> >>>> +
> >>>> +    setup_vm(dtb_early_pa);
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>   static void __init setup_vm_final(void)
> >>>>   {
> >>>>       uintptr_t va, map_size;
> >>>> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
> >>>>           }
> >>>>       }
> >>>>
> >>>> -    /* Clear fixmap PTE and PMD mappings */
> >>>> +    /* Clear fixmap page table mappings */
> >>>>       clear_fixmap(FIX_PTE);
> >>>>       clear_fixmap(FIX_PMD);
> >>>> +    clear_fixmap(FIX_PUD);
> >>>>
> >>>>       /* Move to swapper page table */
> >>>> -    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
> >>>> SATP_MODE);
> >>>> +    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
> >>>> satp_mode);
> >>>>       local_flush_tlb_all();
> >>>>   }
> >>>>   #else
> >>
> >> Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-04-08  5:06 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-22 11:00 [RFC PATCH 0/7] Introduce sv48 support Alexandre Ghiti
2020-03-22 11:00 ` [RFC PATCH 1/7] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE Alexandre Ghiti
2020-03-26  6:10   ` Anup Patel
2020-04-03 15:17   ` Palmer Dabbelt
2020-04-07  5:12     ` Alex Ghiti
2020-03-22 11:00 ` [RFC PATCH 2/7] riscv: Allow to dynamically define VA_BITS Alexandre Ghiti
2020-03-26  6:12   ` Anup Patel
2020-04-03 15:17   ` Palmer Dabbelt
2020-04-07  5:12     ` Alex Ghiti
2020-03-22 11:00 ` [RFC PATCH 3/7] riscv: Simplify MAXPHYSMEM config Alexandre Ghiti
2020-03-26  6:22   ` Anup Patel
2020-03-26  6:34   ` Anup Patel
2020-04-03 15:53   ` Palmer Dabbelt
2020-04-07  5:13     ` Alex Ghiti
2020-03-22 11:00 ` [RFC PATCH 4/7] riscv: Implement sv48 support Alexandre Ghiti
2020-03-26  7:00   ` Anup Patel
2020-03-31 16:31     ` Alex Ghiti
2020-04-03 15:53   ` Palmer Dabbelt
2020-04-07  5:14     ` Alex Ghiti
2020-04-07  5:56       ` Anup Patel
2020-04-08  4:39         ` Alex Ghiti
2020-04-08  5:06           ` Anup Patel
2020-03-22 11:00 ` [RFC PATCH 5/7] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo Alexandre Ghiti
2020-03-26  7:01   ` Anup Patel
2020-04-03 15:53   ` Palmer Dabbelt
2020-04-07  5:14     ` Alex Ghiti
2020-03-22 11:00 ` [RFC PATCH 6/7] dt-bindings: riscv: Remove "riscv, svXX" property from device-tree Alexandre Ghiti
2020-03-26  7:03   ` Anup Patel
2020-04-03 15:53   ` Palmer Dabbelt
2020-04-07  5:14     ` Alex Ghiti
2020-03-22 11:00 ` [RFC PATCH 7/7] riscv: Explicit comment about user virtual address space size Alexandre Ghiti
2020-03-26  7:05   ` Anup Patel
2020-04-03 15:53   ` Palmer Dabbelt
2020-04-07  5:15     ` Alex Ghiti
2020-03-31 19:53 ` [RFC PATCH 0/7] Introduce sv48 support Palmer Dabbelt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).