linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4
@ 2017-04-06 14:00 Kirill A. Shutemov
  2017-04-06 14:00 ` [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
                   ` (7 more replies)
  0 siblings, 8 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here's the fourth and the last bunch of of patches that brings initial
5-level paging enabling.

Please review and consider applying.

As Ingo requested I've tried to rewrite assembly parts of boot process
into C before bringing 5-level paging support. The only part where I
succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is
now in C.

I failed to rewrite startup_32 in arch/x86/boot/compressed/head_64.S in C.
The code I need to modify in still in 32-bit mode, but if I would move it
to C it will be compiled as 64-bit. I've tried to move it into separate
translation unit and compile it with -m32, but then linking phase fails
due to type mismatch of object files.

I also have trouble with rewriting secondary_startup_64. Stack breaks as
soon as we switch to new page tables when onlining secondary CPUs. I don't
know how to get around this.

I hope it's not show-stopper.

If you know how to get around these issues, let me know.

Kirill A. Shutemov (8):
  x86/boot/64: Rewrite startup_64 in C
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Add support of additional page table level during early
    boot
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add support for 5-level paging for KASLR
  x86: Enable 5-level paging support
  x86/mm: Allow to have userspace mappings above 47-bits

 arch/x86/Kconfig                            |   5 +
 arch/x86/boot/compressed/head_64.S          |  23 ++++-
 arch/x86/include/asm/elf.h                  |   2 +-
 arch/x86/include/asm/mpx.h                  |   9 ++
 arch/x86/include/asm/pgtable.h              |   2 +-
 arch/x86/include/asm/pgtable_64.h           |   6 +-
 arch/x86/include/asm/processor.h            |   9 +-
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/espfix_64.c                 |   2 +-
 arch/x86/kernel/head64.c                    | 137 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S                   | 132 ++++++---------------------
 arch/x86/kernel/machine_kexec_64.c          |   2 +-
 arch/x86/kernel/sys_x86_64.c                |  28 +++++-
 arch/x86/mm/dump_pagetables.c               |   2 +-
 arch/x86/mm/hugetlbpage.c                   |  27 +++++-
 arch/x86/mm/init_64.c                       | 104 +++++++++++++++++++--
 arch/x86/mm/kasan_init_64.c                 |  12 +--
 arch/x86/mm/kaslr.c                         |  81 ++++++++++++----
 arch/x86/mm/mmap.c                          |   2 +-
 arch/x86/mm/mpx.c                           |  33 ++++++-
 arch/x86/realmode/init.c                    |   2 +-
 arch/x86/xen/Kconfig                        |   1 +
 arch/x86/xen/mmu.c                          |  18 ++--
 arch/x86/xen/xen-pvh.S                      |   2 +-
 24 files changed, 463 insertions(+), 180 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
@ 2017-04-06 14:00 ` Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/head64.c  | 81 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 93 +----------------------------------------------
 2 files changed, 81 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fixup the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fixup phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..9656c5951b98 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,9 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	call	__startup_64
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
  2017-04-06 14:00 ` [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c5951b98..d44c350797bf 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -75,7 +75,7 @@ startup_64:
 	leaq	_text(%rip), %rdi
 	call	__startup_64
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -95,7 +95,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -326,7 +326,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -336,14 +336,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 9f305be71a72..6680cefc062e 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
  2017-04-06 14:00 ` [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-11  7:02   ` Ingo Molnar
  2017-04-06 14:01 ` [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
 arch/x86/include/asm/pgtable_64.h           |  2 ++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
 5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
 	pgdval_t *pgd;
+	p4dval_t *p4d;
 	pudval_t *pud;
 	pmdval_t *pmd, pmd_entry;
 	int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+		p4d[511] += load_delta;
+	}
+
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
 	pud[510] += load_delta;
 	pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
-	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+		pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+		p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	} else {
+		pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	}
 
 	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
 	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d44c350797bf..b24fc575a6da 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -98,11 +102,14 @@ ENTRY(secondary_startup_64)
 	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -328,7 +335,11 @@ GLOBAL(name)
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -341,9 +352,9 @@ NEXT_PAGE(init_top_pgt)
 #else
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -357,6 +368,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2017-04-06 14:01 ` [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2017-04-06 14:01 ` [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RAM) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RESERVED_KERN))
+				set_p4d(p4d, __p4d(0));
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2017-04-06 14:01 ` [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 7/8] x86: Enable 5-level paging support Kirill A. Shutemov
  2017-04-06 14:01 ` [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
  7 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2017-04-06 14:01 ` [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-06 14:52   ` Juergen Gross
  2017-04-06 14:01 ` [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
  7 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 5 +++++
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
 	bool "Xen guest support"
 	depends on PARAVIRT
+	depends on !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2017-04-06 14:01 ` [PATCH 7/8] x86: Enable 5-level paging support Kirill A. Shutemov
@ 2017-04-06 14:01 ` Kirill A. Shutemov
  2017-04-06 18:43   ` Dmitry Safonov
  2017-04-07 13:35   ` Anshuman Khandual
  7 siblings, 2 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/include/asm/elf.h       |  2 +-
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h |  9 ++++++---
 arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
 arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
 arch/x86/mm/mmap.c               |  2 +-
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..9f437aea7f57 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = min(end, TASK_SIZE);
+	else
+		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.low_limit = get_mmap_base(1);
 	info.high_limit = in_compat_syscall() ?
 		tasksize_32bit() : tasksize_64bit();
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = TASK_SIZE;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
 
 unsigned long tasksize_64bit(void)
 {
-	return TASK_SIZE_MAX;
+	return DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 14:01 ` [PATCH 7/8] x86: Enable 5-level paging support Kirill A. Shutemov
@ 2017-04-06 14:52   ` Juergen Gross
  2017-04-06 15:24     ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Juergen Gross @ 2017-04-06 14:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 06/04/17 16:01, Kirill A. Shutemov wrote:
> Most of things are in place and we can enable support of 5-level paging.
> 
> Enabling XEN with 5-level paging requires more work. The patch makes XEN
> dependent on !X86_5LEVEL.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig     | 5 +++++
>  arch/x86/xen/Kconfig | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4e153e93273f..7a76dcac357e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>  
>  config PGTABLE_LEVELS
>  	int
> +	default 5 if X86_5LEVEL
>  	default 4 if X86_64
>  	default 3 if X86_PAE
>  	default 2
> @@ -1390,6 +1391,10 @@ config X86_PAE
>  	  has the cost of more pagetable lookup overhead, and also
>  	  consumes more pagetable space per process.
>  
> +config X86_5LEVEL
> +	bool "Enable 5-level page tables support"
> +	depends on X86_64
> +
>  config ARCH_PHYS_ADDR_T_64BIT
>  	def_bool y
>  	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> index 76b6dbd627df..b90d481ce5a1 100644
> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -5,6 +5,7 @@
>  config XEN
>  	bool "Xen guest support"
>  	depends on PARAVIRT
> +	depends on !X86_5LEVEL
>  	select PARAVIRT_CLOCK
>  	select XEN_HAVE_PVMMU
>  	select XEN_HAVE_VPMU
> 

Just a heads up: this last change will conflict with the Xen tree.

Can't we just ignore the additional level in Xen pv mode and run with
4 levels instead?


Juergen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 14:52   ` Juergen Gross
@ 2017-04-06 15:24     ` Kirill A. Shutemov
  2017-04-06 15:56       ` Juergen Gross
  0 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 15:24 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
> On 06/04/17 16:01, Kirill A. Shutemov wrote:
> > Most of things are in place and we can enable support of 5-level paging.
> > 
> > Enabling XEN with 5-level paging requires more work. The patch makes XEN
> > dependent on !X86_5LEVEL.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig     | 5 +++++
> >  arch/x86/xen/Kconfig | 1 +
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 4e153e93273f..7a76dcac357e 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
> >  
> >  config PGTABLE_LEVELS
> >  	int
> > +	default 5 if X86_5LEVEL
> >  	default 4 if X86_64
> >  	default 3 if X86_PAE
> >  	default 2
> > @@ -1390,6 +1391,10 @@ config X86_PAE
> >  	  has the cost of more pagetable lookup overhead, and also
> >  	  consumes more pagetable space per process.
> >  
> > +config X86_5LEVEL
> > +	bool "Enable 5-level page tables support"
> > +	depends on X86_64
> > +
> >  config ARCH_PHYS_ADDR_T_64BIT
> >  	def_bool y
> >  	depends on X86_64 || X86_PAE
> > diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> > index 76b6dbd627df..b90d481ce5a1 100644
> > --- a/arch/x86/xen/Kconfig
> > +++ b/arch/x86/xen/Kconfig
> > @@ -5,6 +5,7 @@
> >  config XEN
> >  	bool "Xen guest support"
> >  	depends on PARAVIRT
> > +	depends on !X86_5LEVEL
> >  	select PARAVIRT_CLOCK
> >  	select XEN_HAVE_PVMMU
> >  	select XEN_HAVE_VPMU
> > 
> 
> Just a heads up: this last change will conflict with the Xen tree.

It should be trivial to fix, right? It's one-liner after all.

> Can't we just ignore the additional level in Xen pv mode and run with
> 4 levels instead?

We don't have yet boot-time switching between paging modes yet. It will
come later. So the answer is no.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 15:24     ` Kirill A. Shutemov
@ 2017-04-06 15:56       ` Juergen Gross
  0 siblings, 0 replies; 46+ messages in thread
From: Juergen Gross @ 2017-04-06 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On 06/04/17 17:24, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
>> On 06/04/17 16:01, Kirill A. Shutemov wrote:
>>> Most of things are in place and we can enable support of 5-level paging.
>>>
>>> Enabling XEN with 5-level paging requires more work. The patch makes XEN
>>> dependent on !X86_5LEVEL.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>  arch/x86/Kconfig     | 5 +++++
>>>  arch/x86/xen/Kconfig | 1 +
>>>  2 files changed, 6 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 4e153e93273f..7a76dcac357e 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>>>  
>>>  config PGTABLE_LEVELS
>>>  	int
>>> +	default 5 if X86_5LEVEL
>>>  	default 4 if X86_64
>>>  	default 3 if X86_PAE
>>>  	default 2
>>> @@ -1390,6 +1391,10 @@ config X86_PAE
>>>  	  has the cost of more pagetable lookup overhead, and also
>>>  	  consumes more pagetable space per process.
>>>  
>>> +config X86_5LEVEL
>>> +	bool "Enable 5-level page tables support"
>>> +	depends on X86_64
>>> +
>>>  config ARCH_PHYS_ADDR_T_64BIT
>>>  	def_bool y
>>>  	depends on X86_64 || X86_PAE
>>> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
>>> index 76b6dbd627df..b90d481ce5a1 100644
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -5,6 +5,7 @@
>>>  config XEN
>>>  	bool "Xen guest support"
>>>  	depends on PARAVIRT
>>> +	depends on !X86_5LEVEL
>>>  	select PARAVIRT_CLOCK
>>>  	select XEN_HAVE_PVMMU
>>>  	select XEN_HAVE_VPMU
>>>
>>
>> Just a heads up: this last change will conflict with the Xen tree.
> 
> It should be trivial to fix, right? It's one-liner after all.

Right. Just wanted to mention it.


Juergen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 14:01 ` [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
@ 2017-04-06 18:43   ` Dmitry Safonov
  2017-04-06 19:15     ` Dmitry Safonov
  2017-04-07 13:35   ` Anshuman Khandual
  1 sibling, 1 reply; 46+ messages in thread
From: Dmitry Safonov @ 2017-04-06 18:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

Hi Kirill,

On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.

Do you wish after the first over-47-bit mapping the following mmap()
calls return also over-47-bits if there is free space?
It so, you could simplify all this code by changing only mm->mmap_base
on the first over-47-bit mmap() call.
This will do simple trick.

>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h |  9 ++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)

This will kill 32-bit userspace:
As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.

Here is the test:
[root@localhost test]# cat hello-world.c
#include <stdio.h>

int main(int argc, char **argv)
{
	printf("Maybe this world is another planet's hell.\n");
         return 0;
}
[root@localhost test]# gcc -m32 hello-world.c -o hello-world
[root@localhost test]# ./hello-world
[   35.306726] hello-world[1948]: segfault at ffa5288c ip 
00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
Segmentation fault (core dumped)

So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.


>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..9f437aea7f57 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

ditto

>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
is set, which can do 64-bit syscalls - the subtraction will be
a negative..


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

ditto

> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;

ditto about 32-bits

>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 18:43   ` Dmitry Safonov
@ 2017-04-06 19:15     ` Dmitry Safonov
  2017-04-06 23:21       ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Dmitry Safonov @ 2017-04-06 19:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> Hi Kirill,
>
> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>> Not all user space is ready to handle wide addresses. It's known that
>> at least some JIT compilers use higher bits in pointers to encode their
>> information. It collides with valid pointers with 5-level paging and
>> leads to crashes.
>>
>> To mitigate this, we are not going to allocate virtual address space
>> above 47-bit by default.
>>
>> But userspace can ask for allocation from full address space by
>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>
>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>> to look for unmapped area by specified address. If it's already
>> occupied, we look for unmapped area in *full* address space, rather than
>> from 47-bit window.
>
> Do you wish after the first over-47-bit mapping the following mmap()
> calls return also over-47-bits if there is free space?
> It so, you could simplify all this code by changing only mm->mmap_base
> on the first over-47-bit mmap() call.
> This will do simple trick.
>
>>
>> This approach helps to easily make application's memory allocator aware
>> about large address space without manually tracking allocated virtual
>> address space.
>>
>> One important case we need to handle here is interaction with MPX.
>> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
>> need to make sure that MPX cannot be enabled we already have VMA above
>> the boundary and forbid creating such VMAs once MPX is enabled.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>>  arch/x86/include/asm/elf.h       |  2 +-
>>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>>  arch/x86/include/asm/processor.h |  9 ++++++---
>>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>>  arch/x86/mm/mmap.c               |  2 +-
>>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>>  7 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
>> index d4d3ed456cb7..67260dbe1688 100644
>> --- a/arch/x86/include/asm/elf.h
>> +++ b/arch/x86/include/asm/elf.h
>> @@ -250,7 +250,7 @@ extern int force_personality32;
>>     the loader.  We need to make sure that it is out of the way of the
>> program
>>     that it will "exec", and that there is sufficient room for the
>> brk.  */
>>
>> -#define ELF_ET_DYN_BASE        (TASK_SIZE / 3 * 2)
>> +#define ELF_ET_DYN_BASE        (DEFAULT_MAP_WINDOW / 3 * 2)
>
> This will kill 32-bit userspace:
> As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
> not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.
>
> Here is the test:
> [root@localhost test]# cat hello-world.c
> #include <stdio.h>
>
> int main(int argc, char **argv)
> {
>     printf("Maybe this world is another planet's hell.\n");
>         return 0;
> }
> [root@localhost test]# gcc -m32 hello-world.c -o hello-world
> [root@localhost test]# ./hello-world
> [   35.306726] hello-world[1948]: segfault at ffa5288c ip
> 00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
> Segmentation fault (core dumped)
>
> So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.

I just tried to define it like this:
-#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
+#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
+                               IA32_PAGE_OFFSET : ((1UL << 47) - 
PAGE_SIZE))

And it looks working better.

>
>
>>
>>  /* This yields a mask that user programs can use to figure out what
>>     instruction set this CPU supports.  This could be done in user space,
>> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
>> index a0d662be4c5b..7d7404756bb4 100644
>> --- a/arch/x86/include/asm/mpx.h
>> +++ b/arch/x86/include/asm/mpx.h
>> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>>  }
>>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>>                unsigned long start, unsigned long end);
>> +
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags);
>>  #else
>>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>>  {
>> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct
>> mm_struct *mm,
>>                      unsigned long start, unsigned long end)
>>  {
>>  }
>> +
>> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
>> +        unsigned long len, unsigned long flags)
>> +{
>> +    return addr;
>> +}
>>  #endif /* CONFIG_X86_INTEL_MPX */
>>
>>  #endif /* _ASM_X86_MPX_H */
>> diff --git a/arch/x86/include/asm/processor.h
>> b/arch/x86/include/asm/processor.h
>> index 3cada998a402..9f437aea7f57 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define IA32_PAGE_OFFSET    PAGE_OFFSET
>>  #define TASK_SIZE        PAGE_OFFSET
>>  #define TASK_SIZE_MAX        TASK_SIZE
>> +#define DEFAULT_MAP_WINDOW    TASK_SIZE
>>  #define STACK_TOP        TASK_SIZE
>>  #define STACK_TOP_MAX        STACK_TOP
>>
>> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>>   * particular problem by preventing anything from being mapped
>>   * at the maximum canonical address.
>>   */
>> -#define TASK_SIZE_MAX    ((1UL << 47) - PAGE_SIZE)
>> +#define TASK_SIZE_MAX    ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>> +
>> +#define DEFAULT_MAP_WINDOW    ((1UL << 47) - PAGE_SIZE)
>>
>>  /* This decides where the kernel will search for a free chunk of vm
>>   * space during mmap's.
>> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define TASK_SIZE_OF(child)    ((test_tsk_thread_flag(child,
>> TIF_ADDR32)) ? \
>>                      IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>>
>> -#define STACK_TOP        TASK_SIZE
>> +#define STACK_TOP        DEFAULT_MAP_WINDOW
>>  #define STACK_TOP_MAX        TASK_SIZE_MAX
>>
>>  #define INIT_THREAD  {                        \
>> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs,
>> unsigned long new_ip,
>>   * space during mmap's.
>>   */
>>  #define __TASK_UNMAPPED_BASE(task_size)    (PAGE_ALIGN(task_size / 3))
>> -#define TASK_UNMAPPED_BASE        __TASK_UNMAPPED_BASE(TASK_SIZE)
>> +#define TASK_UNMAPPED_BASE
>> __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
> ditto
>
>>
>>  #define KSTK_EIP(task)        (task_pt_regs(task)->ip)
>>
>> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
>> index 207b8f2582c7..593a31e93812 100644
>> --- a/arch/x86/kernel/sys_x86_64.c
>> +++ b/arch/x86/kernel/sys_x86_64.c
>> @@ -21,6 +21,7 @@
>>  #include <asm/compat.h>
>>  #include <asm/ia32.h>
>>  #include <asm/syscalls.h>
>> +#include <asm/mpx.h>
>>
>>  /*
>>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
>> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      struct vm_unmapped_area_info info;
>>      unsigned long begin, end;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (flags & MAP_FIXED)
>>          return addr;
>>
>> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      info.flags = 0;
>>      info.length = len;
>>      info.low_limit = begin;
>> -    info.high_limit = end;
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = min(end, TASK_SIZE);
>> +    else
>> +        info.high_limit = min(end, DEFAULT_MAP_WINDOW);
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      unsigned long addr = addr0;
>>      struct vm_unmapped_area_info info;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      /* requested length too big for entire address space */
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> is set, which can do 64-bit syscalls - the subtraction will be
> a negative..
>
>
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 302f43fd9c28..9a0b89252c52 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -18,6 +18,7 @@
>>  #include <asm/tlbflush.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/elf.h>
>> +#include <asm/mpx.h>
>>
>>  #if 0    /* This is just for testing */
>>  struct page *
>> @@ -87,23 +88,38 @@ static unsigned long
>> hugetlb_get_unmapped_area_bottomup(struct file *file,
>>      info.low_limit = get_mmap_base(1);
>>      info.high_limit = in_compat_syscall() ?
>>          tasksize_32bit() : tasksize_64bit();
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = TASK_SIZE;
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      return vm_unmapped_area(&info);
>>  }
>>
>>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file
>> *file,
>> -        unsigned long addr0, unsigned long len,
>> +        unsigned long addr, unsigned long len,
>>          unsigned long pgoff, unsigned long flags)
>>  {
>>      struct hstate *h = hstate_file(file);
>>      struct vm_unmapped_area_info info;
>> -    unsigned long addr;
>>
>>      info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> ditto
>
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      addr = vm_unmapped_area(&info);
>> @@ -118,7 +134,7 @@ static unsigned long
>> hugetlb_get_unmapped_area_topdown(struct file *file,
>>          VM_BUG_ON(addr != -ENOMEM);
>>          info.flags = 0;
>>          info.low_limit = TASK_UNMAPPED_BASE;
>> -        info.high_limit = TASK_SIZE;
>> +        info.high_limit = DEFAULT_MAP_WINDOW;
>
> ditto about 32-bits
>
>>          addr = vm_unmapped_area(&info);
>>      }
>>
>> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file,
>> unsigned long addr,
>>
>>      if (len & ~huge_page_mask(h))
>>          return -EINVAL;
>> +
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>>
>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>> index 19ad095b41df..d63232a31945 100644
>> --- a/arch/x86/mm/mmap.c
>> +++ b/arch/x86/mm/mmap.c
>> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>>
>>  unsigned long tasksize_64bit(void)
>>  {
>> -    return TASK_SIZE_MAX;
>> +    return DEFAULT_MAP_WINDOW;
>>  }
>>
>>  static unsigned long stack_maxrandom_size(unsigned long task_size)
>> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
>> index cd44ae727df7..a26a1b373fd0 100644
>> --- a/arch/x86/mm/mpx.c
>> +++ b/arch/x86/mm/mpx.c
>> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>>       */
>>      bd_base = mpx_get_bounds_dir();
>>      down_write(&mm->mmap_sem);
>> +
>> +    /* MPX doesn't support addresses above 47-bits yet. */
>> +    if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
>> +        pr_warn_once("%s (%d): MPX cannot handle addresses "
>> +                "above 47-bits. Disabling.",
>> +                current->comm, current->pid);
>> +        ret = -ENXIO;
>> +        goto out;
>> +    }
>>      mm->context.bd_addr = bd_base;
>>      if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>>          ret = -ENXIO;
>> -
>> +out:
>>      up_write(&mm->mmap_sem);
>>      return ret;
>>  }
>> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>>      if (ret)
>>          force_sig(SIGSEGV, current);
>>  }
>> +
>> +/* MPX cannot handle addresses above 47-bits yet. */
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags)
>> +{
>> +    if (!kernel_managing_mpx_tables(current->mm))
>> +        return addr;
>> +    if (addr + len <= DEFAULT_MAP_WINDOW)
>> +        return addr;
>> +    if (flags & MAP_FIXED)
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Requested len is larger than whole area we're allowed to map in.
>> +     * Resetting hinting address wouldn't do much good -- fail early.
>> +     */
>> +    if (len > DEFAULT_MAP_WINDOW)
>> +        return -ENOMEM;
>> +
>> +    /* Look for unmap area within DEFAULT_MAP_WINDOW */
>> +    return 0;
>> +}
>>
>
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 19:15     ` Dmitry Safonov
@ 2017-04-06 23:21       ` Kirill A. Shutemov
  2017-04-06 23:24         ` [PATCHv2 " Kirill A. Shutemov
  2017-04-07 10:06         ` [PATCH 8/8] " Dmitry Safonov
  0 siblings, 2 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 23:21 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> > Hi Kirill,
> > 
> > On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> > > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > > Not all user space is ready to handle wide addresses. It's known that
> > > at least some JIT compilers use higher bits in pointers to encode their
> > > information. It collides with valid pointers with 5-level paging and
> > > leads to crashes.
> > > 
> > > To mitigate this, we are not going to allocate virtual address space
> > > above 47-bit by default.
> > > 
> > > But userspace can ask for allocation from full address space by
> > > specifying hint address (with or without MAP_FIXED) above 47-bits.
> > > 
> > > If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> > > to look for unmapped area by specified address. If it's already
> > > occupied, we look for unmapped area in *full* address space, rather than
> > > from 47-bit window.
> > 
> > Do you wish after the first over-47-bit mapping the following mmap()
> > calls return also over-47-bits if there is free space?
> > It so, you could simplify all this code by changing only mm->mmap_base
> > on the first over-47-bit mmap() call.
> > This will do simple trick.

No.

I want every allocation to explicitely opt-in large address space. It's
additional fail-safe: if a library can't handle large addresses it has
better chance to survive if its own allocation will stay within 47-bits.

> I just tried to define it like this:
> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
> PAGE_SIZE))
> 
> And it looks working better.

Okay, thanks. I'll send v2.

> > > +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> > > +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> > 
> > Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> > That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> > is set, which can do 64-bit syscalls - the subtraction will be
> > a negative..

With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
okay, right?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 23:21       ` Kirill A. Shutemov
@ 2017-04-06 23:24         ` Kirill A. Shutemov
  2017-04-07 11:32           ` Dmitry Safonov
  2017-04-07 10:06         ` [PATCH 8/8] " Dmitry Safonov
  1 sibling, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 23:24 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/include/asm/elf.h       |  2 +-
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 10 +++++++---
 arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
 arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
 arch/x86/mm/mmap.c               |  2 +-
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..a98395e89ac6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
+				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = min(end, TASK_SIZE);
+	else
+		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.low_limit = get_mmap_base(1);
 	info.high_limit = in_compat_syscall() ?
 		tasksize_32bit() : tasksize_64bit();
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = TASK_SIZE;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
 
 unsigned long tasksize_64bit(void)
 {
-	return TASK_SIZE_MAX;
+	return DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 23:21       ` Kirill A. Shutemov
  2017-04-06 23:24         ` [PATCHv2 " Kirill A. Shutemov
@ 2017-04-07 10:06         ` Dmitry Safonov
  1 sibling, 0 replies; 46+ messages in thread
From: Dmitry Safonov @ 2017-04-07 10:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:21 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
>> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
>>> Hi Kirill,
>>>
>>> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>>> Not all user space is ready to handle wide addresses. It's known that
>>>> at least some JIT compilers use higher bits in pointers to encode their
>>>> information. It collides with valid pointers with 5-level paging and
>>>> leads to crashes.
>>>>
>>>> To mitigate this, we are not going to allocate virtual address space
>>>> above 47-bit by default.
>>>>
>>>> But userspace can ask for allocation from full address space by
>>>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>>>
>>>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>>>> to look for unmapped area by specified address. If it's already
>>>> occupied, we look for unmapped area in *full* address space, rather than
>>>> from 47-bit window.
>>>
>>> Do you wish after the first over-47-bit mapping the following mmap()
>>> calls return also over-47-bits if there is free space?
>>> It so, you could simplify all this code by changing only mm->mmap_base
>>> on the first over-47-bit mmap() call.
>>> This will do simple trick.
>
> No.
>
> I want every allocation to explicitely opt-in large address space. It's
> additional fail-safe: if a library can't handle large addresses it has
> better chance to survive if its own allocation will stay within 47-bits.

Ok

>
>> I just tried to define it like this:
>> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
>> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
>> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
>> PAGE_SIZE))
>>
>> And it looks working better.
>
> Okay, thanks. I'll send v2.
>
>>>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>>>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>>>
>>> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
>>> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
>>> is set, which can do 64-bit syscalls - the subtraction will be
>>> a negative..
>
> With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
> okay, right?

I'll comment to v2 to keep all in one place.


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 23:24         ` [PATCHv2 " Kirill A. Shutemov
@ 2017-04-07 11:32           ` Dmitry Safonov
  2017-04-07 15:44             ` [PATCHv3 " Kirill A. Shutemov
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
  0 siblings, 2 replies; 46+ messages in thread
From: Dmitry Safonov @ 2017-04-07 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:24 AM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 10 +++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 101 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..a98395e89ac6 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
> +				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))

That fixes 32-bit, but we need to adjust some places, AFAICS, I'll
point them below.

>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);

That looks not working.
`end' is choosed between tasksize_32bit() and tasksize_64bit().
Which is ~4Gb or 47-bit. So, info.high_limit will never go
above DEFAULT_MAP_WINDOW with this min().

Can we move this logic into find_start_end()?

May it be something like:
if (in_compat_syscall())
   *end = tasksize_32bit();
else if (addr > task_size_64bit())
   *end = TASK_SIZE_MAX;
else
   *end = tasksize_64bit();

In my point of view, it could be even simpler if we add a parameter
to task_size_64bit():

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))

unsigned long task_size_64bit(int full_addr_space)
{
    return (full_addr_space) ? TASK_SIZE_MAX : TASK_SIZE_47BIT;
}

> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, looks like we do need in_compat_syscall() as you did
because x32 mmap() syscall has 8 byte parameter.
Maybe worth a comment.

Anyway, maybe something like that:
if (addr > tasksize_64bit() && !in_compat_syscall())
    info.high_limit += TASK_SIZE_MAX - tasksize_64bit();

This way it's more readable and clear because we don't
need to keep in mind that TIF_ADDR32 flag, while reading.


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;

My suggestion about new parameter is above, but at least
we need to omit depending on TIF_ADDR32 here and return
64-bit size independent of flag value:

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))
unsigned long task_size_64bit(void)
{
    return TASK_SIZE_47BIT;
}

Because for 32-bit ELFs it would be always 4Gb in your
case, while 32-bit ELFs can do 64-bit syscalls.

>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 14:01 ` [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
  2017-04-06 18:43   ` Dmitry Safonov
@ 2017-04-07 13:35   ` Anshuman Khandual
  2017-04-07 15:59     ` Kirill A. Shutemov
  1 sibling, 1 reply; 46+ messages in thread
From: Anshuman Khandual @ 2017-04-07 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Dmitry Safonov

On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
> 
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.

I am wondering if the commitment of virtual space range to the
user space is kind of an API which needs to be maintained there
after. If that is the case then we need to have some plans when
increasing it from the current level.

Will those JIT compilers keep using the higher bit positions of
the pointer for ever ? Then it will limit the ability of the
kernel to expand the virtual address range later as well. I am
not saying we should not increase till the extent it does not
affect any *known* user but then we should not increase twice
for now, create the hint mechanism to be passed from the user
to avail beyond that (which will settle in as a expectation
from the kernel later on). Do the same thing again while
expanding the address range next time around. I think we need
to have a plan for this and particularly around 'hint' mechanism
and whether it should be decided per mmap() request or at the
task level.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 11:32           ` Dmitry Safonov
@ 2017-04-07 15:44             ` Kirill A. Shutemov
  2017-04-07 16:37               ` Dmitry Safonov
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
  1 sibling, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 15:44 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 v3:
   - Address Dmitry feedback;
   - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
     instead, which would task TIF_ADDR32 into account.
---
 arch/x86/include/asm/elf.h       |  4 ++--
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 11 ++++++++---
 arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
 arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
 arch/x86/mm/mmap.c               |  6 +++---
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..2501ef7970f9 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
 }
 
 extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
 					0xc0000000 : 0xFFFFe000)
 
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
 #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 	return error;
 }
 
-static void find_start_end(unsigned long flags, unsigned long *begin,
-			   unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+		unsigned long *begin, unsigned long *end)
 {
 	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
 		/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 	}
 
 	*begin	= get_mmap_base(1);
-	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+	if (in_compat_syscall())
+		*end = tasksize_32bit();
+	else
+		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
 }
 
 unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
-	find_start_end(flags, &begin, &end);
+	find_start_end(addr, flags, &begin, &end);
 
 	if (len > end)
 		return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 *
+	 * !in_compat_syscall() check to avoid high addresses for x32.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = get_mmap_base(1);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
 	info.high_limit = in_compat_syscall() ?
-		tasksize_32bit() : tasksize_64bit();
+		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = TASK_SIZE_LOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
 	return IA32_PAGE_OFFSET;
 }
 
-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
 {
-	return TASK_SIZE_MAX;
+	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
 	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
-			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
 
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
 	/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 13:35   ` Anshuman Khandual
@ 2017-04-07 15:59     ` Kirill A. Shutemov
  2017-04-07 16:09       ` hpa
  2017-04-12 10:41       ` Michael Ellerman
  0 siblings, 2 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 15:59 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use higher bits in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> > 
> > To mitigate this, we are not going to allocate virtual address space
> > above 47-bit by default.
> 
> I am wondering if the commitment of virtual space range to the
> user space is kind of an API which needs to be maintained there
> after. If that is the case then we need to have some plans when
> increasing it from the current level.

I don't think we should ever enable full address space for all
applications. There's no point.

/bin/true doesn't need more than 64TB of virtual memory.
And I hope never will.

By increasing virtual address space for everybody we will pay (assuming
current page table format) at least one extra page per process for moving
stack at very end of address space.

Yes, you can gain something in security by having more bits for ASLR, but
I don't think it worth the cost.

> Will those JIT compilers keep using the higher bit positions of
> the pointer for ever ? Then it will limit the ability of the
> kernel to expand the virtual address range later as well. I am
> not saying we should not increase till the extent it does not
> affect any *known* user but then we should not increase twice
> for now, create the hint mechanism to be passed from the user
> to avail beyond that (which will settle in as a expectation
> from the kernel later on). Do the same thing again while
> expanding the address range next time around. I think we need
> to have a plan for this and particularly around 'hint' mechanism
> and whether it should be decided per mmap() request or at the
> task level.

I think the reasonable way for an application to claim it's 63-bit clean
is to make allocations with (void *)-1 as hint address.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 15:59     ` Kirill A. Shutemov
@ 2017-04-07 16:09       ` hpa
  2017-04-07 16:20         ` Kirill A. Shutemov
  2017-04-12 10:41       ` Michael Ellerman
  1 sibling, 1 reply; 46+ messages in thread
From: hpa @ 2017-04-07 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Andi Kleen, Dave Hansen,
	Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On April 7, 2017 8:59:45 AM PDT, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address
>space.
>> > Not all user space is ready to handle wide addresses. It's known
>that
>> > at least some JIT compilers use higher bits in pointers to encode
>their
>> > information. It collides with valid pointers with 5-level paging
>and
>> > leads to crashes.
>> > 
>> > To mitigate this, we are not going to allocate virtual address
>space
>> > above 47-bit by default.
>> 
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
>I don't think we should ever enable full address space for all
>applications. There's no point.
>
>/bin/true doesn't need more than 64TB of virtual memory.
>And I hope never will.
>
>By increasing virtual address space for everybody we will pay (assuming
>current page table format) at least one extra page per process for
>moving
>stack at very end of address space.
>
>Yes, you can gain something in security by having more bits for ASLR,
>but
>I don't think it worth the cost.
>
>> Will those JIT compilers keep using the higher bit positions of
>> the pointer for ever ? Then it will limit the ability of the
>> kernel to expand the virtual address range later as well. I am
>> not saying we should not increase till the extent it does not
>> affect any *known* user but then we should not increase twice
>> for now, create the hint mechanism to be passed from the user
>> to avail beyond that (which will settle in as a expectation
>> from the kernel later on). Do the same thing again while
>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
>I think the reasonable way for an application to claim it's 63-bit
>clean
>is to make allocations with (void *)-1 as hint address.

You realize that people have said that about just about every memory threshold from 64K onward?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 16:09       ` hpa
@ 2017-04-07 16:20         ` Kirill A. Shutemov
  0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 16:20 UTC (permalink / raw)
  To: hpa
  Cc: Anshuman Khandual, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On Fri, Apr 07, 2017 at 09:09:27AM -0700, hpa@zytor.com wrote:
> >I think the reasonable way for an application to claim it's 63-bit
> >clean
> >is to make allocations with (void *)-1 as hint address.
> 
> You realize that people have said that about just about every memory

Any better solution?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 15:44             ` [PATCHv3 " Kirill A. Shutemov
@ 2017-04-07 16:37               ` Dmitry Safonov
  0 siblings, 0 replies; 46+ messages in thread
From: Dmitry Safonov @ 2017-04-07 16:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 04/07/2017 06:44 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>

LGTM,
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>

Thou, I'm not very excited about TASK_SIZE_LOW naming, but I'm not good
at naming either, so maybe tglx will help.
Anyway, I don't see any problems with code's logic now.
I've run it through CRIU ia32 tests, where there is
32/64-bit mmap(), 64-bit mmap() from 32-bit binary, the same with
MAP_32BIT and some other not very pleasant corner-cases.
That doesn't prove that mmap() works in *all* possible cases, thou.

P.S.:
JFYI: there is a rule to send new patch versions in a new thread -
otherwise the patch can lose maintainers attention. So, they may ask
you to resend it.

> ---
>  v3:
>    - Address Dmitry feedback;
>    - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
>      instead, which would task TIF_ADDR32 into account.
> ---
>  arch/x86/include/asm/elf.h       |  4 ++--
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 11 ++++++++---
>  arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
>  arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
>  arch/x86/mm/mmap.c               |  6 +++---
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 103 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..2501ef7970f9 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> @@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
>  }
>
>  extern unsigned long tasksize_32bit(void);
> -extern unsigned long tasksize_64bit(void);
> +extern unsigned long tasksize_64bit(int full_addr_space);
>  extern unsigned long get_mmap_base(int is_legacy);
>
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..aaed58b03ddb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
>  					0xc0000000 : 0xFFFFe000)
>
> +#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
> +					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
>  #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		TASK_SIZE_LOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..74d1587b181d 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
>  	return error;
>  }
>
> -static void find_start_end(unsigned long flags, unsigned long *begin,
> -			   unsigned long *end)
> +static void find_start_end(unsigned long addr, unsigned long flags,
> +		unsigned long *begin, unsigned long *end)
>  {
>  	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
>  		/* This is usually used needed to map code in small
> @@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
>  	}
>
>  	*begin	= get_mmap_base(1);
> -	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
> +	if (in_compat_syscall())
> +		*end = tasksize_32bit();
> +	else
> +		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
>  }
>
>  unsigned long
> @@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> -	find_start_end(flags, &begin, &end);
> +	find_start_end(addr, flags, &begin, &end);
>
>  	if (len > end)
>  		return -ENOMEM;
> @@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 *
> +	 * !in_compat_syscall() check to avoid high addresses for x32.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..730f00250acb 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = get_mmap_base(1);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
>  	info.high_limit = in_compat_syscall() ?
> -		tasksize_32bit() : tasksize_64bit();
> +		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = TASK_SIZE_LOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..199050249d60 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
>  	return IA32_PAGE_OFFSET;
>  }
>
> -unsigned long tasksize_64bit(void)
> +unsigned long tasksize_64bit(int full_addr_space)
>  {
> -	return TASK_SIZE_MAX;
> +	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> @@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>  		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
>
>  	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
> -			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
> +			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
>
>  #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
>  	/*
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-06 14:01 ` [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
@ 2017-04-11  7:02   ` Ingo Molnar
  2017-04-11 10:51     ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2017-04-11  7:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:

> This patch adds support for 5-level paging during early boot.
> It generalizes boot for 4- and 5-level paging on 64-bit systems with
> compile-time switch between them.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
>  arch/x86/include/asm/pgtable_64.h           |  2 ++
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
>  arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
>  5 files changed, 85 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index d2ae1f821e0c..3ed26769810b 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -122,9 +122,12 @@ ENTRY(startup_32)
>  	addl	%ebp, gdt+2(%ebp)
>  	lgdt	gdt(%ebp)
>  
> -	/* Enable PAE mode */
> +	/* Enable PAE and LA57 mode */
>  	movl	%cr4, %eax
>  	orl	$X86_CR4_PAE, %eax
> +#ifdef CONFIG_X86_5LEVEL
> +	orl	$X86_CR4_LA57, %eax
> +#endif
>  	movl	%eax, %cr4
>  
>   /*
> @@ -136,13 +139,24 @@ ENTRY(startup_32)
>  	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
>  	rep	stosl
>  
> +	xorl	%edx, %edx
> +
> +	/* Build Top Level */
> +	leal	pgtable(%ebx,%edx,1), %edi
> +	leal	0x1007 (%edi), %eax
> +	movl	%eax, 0(%edi)
> +
> +#ifdef CONFIG_X86_5LEVEL
>  	/* Build Level 4 */
> -	leal	pgtable + 0(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	leal	0x1007 (%edi), %eax
>  	movl	%eax, 0(%edi)
> +#endif
>  
>  	/* Build Level 3 */
> -	leal	pgtable + 0x1000(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	leal	0x1007(%edi), %eax
>  	movl	$4, %ecx
>  1:	movl	%eax, 0x00(%edi)
> @@ -152,7 +166,8 @@ ENTRY(startup_32)
>  	jnz	1b
>  
>  	/* Build Level 2 */
> -	leal	pgtable + 0x2000(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	movl	$0x00000183, %eax
>  	movl	$2048, %ecx
>  1:	movl	%eax, 0(%edi)

I realize that you had difficulties converting this to C, but it's not going to 
get any easier in the future either, with one more paging mode/level added!

If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
trivial .c, build and link it in and call it separately. Then once that works, 
move functionality from asm to C step by step and test it at every step.

I've applied the first two patches of this series, but we really should convert 
this assembly bit to C too.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11  7:02   ` Ingo Molnar
@ 2017-04-11 10:51     ` Kirill A. Shutemov
  2017-04-11 11:28       ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> I realize that you had difficulties converting this to C, but it's not going to 
> get any easier in the future either, with one more paging mode/level added!
> 
> If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> trivial .c, build and link it in and call it separately. Then once that works, 
> move functionality from asm to C step by step and test it at every step.

I've described the specific issue with converting this code to C in cover
letter: how to make compiler to generate 32-bit code for a specific
function or translation unit, without breaking linking afterwards (-m32
break it).

I would be glad to convert it, but I'm stuck.

Do you have an idea how to get around the issue.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 10:51     ` Kirill A. Shutemov
@ 2017-04-11 11:28       ` Ingo Molnar
  2017-04-11 11:46         ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2017-04-11 11:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > I realize that you had difficulties converting this to C, but it's not going to 
> > get any easier in the future either, with one more paging mode/level added!
> > 
> > If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> > trivial .c, build and link it in and call it separately. Then once that works, 
> > move functionality from asm to C step by step and test it at every step.
> 
> I've described the specific issue with converting this code to C in cover
> letter: how to make compiler to generate 32-bit code for a specific
> function or translation unit, without breaking linking afterwards (-m32
> break it).

Have you tried putting it into a separate .c file, and building it 32-bit?

I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit 
code even on 64-bit kernels.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 11:28       ` Ingo Molnar
@ 2017-04-11 11:46         ` Kirill A. Shutemov
  2017-04-11 14:09           ` Andi Kleen
  0 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 11:46 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 01:28:45PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > > I realize that you had difficulties converting this to C, but it's not going to 
> > > get any easier in the future either, with one more paging mode/level added!
> > > 
> > > If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> > > trivial .c, build and link it in and call it separately. Then once that works, 
> > > move functionality from asm to C step by step and test it at every step.
> > 
> > I've described the specific issue with converting this code to C in cover
> > letter: how to make compiler to generate 32-bit code for a specific
> > function or translation unit, without breaking linking afterwards (-m32
> > break it).
> 
> Have you tried putting it into a separate .c file, and building it 32-bit?

Yes, I have. The patch below fails linking:

ld: i386 architecture of input file `arch/x86/boot/compressed/head64.o' is incompatible with i386:x86-64 output

> 
> I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit 
> code even on 64-bit kernels.

I'll look closer (building proccess it's rather complicated), but my
understanding is that VDSO is stand-alone binary and doesn't really links
with the rest of the kernel, rather included as blob, no?

Andy, may be you have an idea?

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 44163e8c3868..8c1acacf408e 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -76,6 +76,8 @@ vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
 ifdef CONFIG_X86_64
 	vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/pagetable.o
+	vmlinux-objs-y += $(obj)/head64.o
+$(obj)/head64.o: KBUILD_CFLAGS := -m32 -D__KERNEL__ -O2
 endif
 
 $(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
diff --git a/arch/x86/boot/compressed/head64.c b/arch/x86/boot/compressed/head64.c
new file mode 100644
index 000000000000..42e1d64a15f4
--- /dev/null
+++ b/arch/x86/boot/compressed/head64.c
@@ -0,0 +1,3 @@
+void __startup32(void)
+{
+}
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 11:46         ` Kirill A. Shutemov
@ 2017-04-11 14:09           ` Andi Kleen
  2017-04-12 10:18             ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Andi Kleen @ 2017-04-11 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Andy Lutomirski, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

> I'll look closer (building proccess it's rather complicated), but my
> understanding is that VDSO is stand-alone binary and doesn't really links
> with the rest of the kernel, rather included as blob, no?
> 
> Andy, may be you have an idea?

There isn't any way I know of to directly link them together. The ELF 
format wasn't designed for that. You would need to merge blobs and then use
manual jump vectors, like the 16bit startup code does. It would be likely
complicated and ugly.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 14:09           ` Andi Kleen
@ 2017-04-12 10:18             ` Kirill A. Shutemov
  2017-04-17 10:32               ` Ingo Molnar
  0 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-12 10:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > I'll look closer (building proccess it's rather complicated), but my
> > understanding is that VDSO is stand-alone binary and doesn't really links
> > with the rest of the kernel, rather included as blob, no?
> > 
> > Andy, may be you have an idea?
> 
> There isn't any way I know of to directly link them together. The ELF 
> format wasn't designed for that. You would need to merge blobs and then use
> manual jump vectors, like the 16bit startup code does. It would be likely
> complicated and ugly.

Ingo, can we proceed without coverting this assembly to C?

I'm committed to convert it to C later if we'll find reasonable solution
to the issue.

We're pretty late into release cycle. It would be nice to give the whole
thing time in tip/master and -next before the merge window.

Can I repost part 4?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 15:59     ` Kirill A. Shutemov
  2017-04-07 16:09       ` hpa
@ 2017-04-12 10:41       ` Michael Ellerman
  2017-04-12 11:11         ` Kirill A. Shutemov
  1 sibling, 1 reply; 46+ messages in thread
From: Michael Ellerman @ 2017-04-12 10:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov, Aneesh Kumar K.V

Hi Kirill,

I'm interested in this because we're doing pretty much the same thing on
powerpc at the moment, and I want to make sure x86 & powerpc end up with
compatible behaviour.

"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use higher bits in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> > 
>> > To mitigate this, we are not going to allocate virtual address space
>> > above 47-bit by default.
>> 
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
> I don't think we should ever enable full address space for all
> applications. There's no point.
>
> /bin/true doesn't need more than 64TB of virtual memory.
> And I hope never will.
>
> By increasing virtual address space for everybody we will pay (assuming
> current page table format) at least one extra page per process for moving
> stack at very end of address space.

That assumes the current layout though, it could be different.

> Yes, you can gain something in security by having more bits for ASLR, but
> I don't think it worth the cost.

It may not be worth the cost now, for you, but that trade off will be
different for other people and at other times.

So I think it's quite likely some folks will be interested in the full
address range for ASLR.

>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
> I think the reasonable way for an application to claim it's 63-bit clean
> is to make allocations with (void *)-1 as hint address.

I do like the simplicity of that.

But I wouldn't be surprised if some (crappy) code out there already
passes an address of -1. Probably it won't break if it starts getting
high addresses, but who knows.

An alternative would be to only interpret the hint as requesting a large
address if it's >= 64TB && < TASK_SIZE_MAX.

If we're really worried about breaking userspace then a new MMAP flag
seems like the safest option?

I don't feel particularly strongly about any option, but like I said my
main concern is that x86 & powerpc end up with the same behaviour.

And whatever we end up with someone will need to do an update to the man
page for mmap.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-12 10:41       ` Michael Ellerman
@ 2017-04-12 11:11         ` Kirill A. Shutemov
  0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-12 11:11 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Anshuman Khandual, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Dmitry Safonov, Aneesh Kumar K.V

On Wed, Apr 12, 2017 at 08:41:29PM +1000, Michael Ellerman wrote:
> Hi Kirill,
> 
> I'm interested in this because we're doing pretty much the same thing on
> powerpc at the moment, and I want to make sure x86 & powerpc end up with
> compatible behaviour.
> 
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> > On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> >> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use higher bits in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> > 
> >> > To mitigate this, we are not going to allocate virtual address space
> >> > above 47-bit by default.
> >> 
> >> I am wondering if the commitment of virtual space range to the
> >> user space is kind of an API which needs to be maintained there
> >> after. If that is the case then we need to have some plans when
> >> increasing it from the current level.
> >
> > I don't think we should ever enable full address space for all
> > applications. There's no point.
> >
> > /bin/true doesn't need more than 64TB of virtual memory.
> > And I hope never will.
> >
> > By increasing virtual address space for everybody we will pay (assuming
> > current page table format) at least one extra page per process for moving
> > stack at very end of address space.
> 
> That assumes the current layout though, it could be different.

True.

> > Yes, you can gain something in security by having more bits for ASLR, but
> > I don't think it worth the cost.
> 
> It may not be worth the cost now, for you, but that trade off will be
> different for other people and at other times.
> 
> So I think it's quite likely some folks will be interested in the full
> address range for ASLR.

We always can extend interface if/when userspace demand materialize.

Let's not invent interfaces unless we're sure there's demand.

> >> expanding the address range next time around. I think we need
> >> to have a plan for this and particularly around 'hint' mechanism
> >> and whether it should be decided per mmap() request or at the
> >> task level.
> >
> > I think the reasonable way for an application to claim it's 63-bit clean
> > is to make allocations with (void *)-1 as hint address.
> 
> I do like the simplicity of that.
> 
> But I wouldn't be surprised if some (crappy) code out there already
> passes an address of -1. Probably it won't break if it starts getting
> high addresses, but who knows.

To make an application break we need two thing:

 - it sets hint address to -1 by mistake;
 - it uses upper bit to encode its info;

I would be surprise if such combination exists in real world.

But let me know if you have any particular code in mind.

> An alternative would be to only interpret the hint as requesting a large
> address if it's >= 64TB && < TASK_SIZE_MAX.

Nope. That doesn't work if you take into accounting further extension of the
address space.

Consider extension x86 to 6-level page tables. User-space has 63-bit
address space. TASK_SIZE_MAX is bumped to (1UL << 63) - PAGE_SIZE.

An application wants access to full address space. It gets recompiled
using new TASK_SIZE_MAX as hint address. And everything works fine.

But only on machine with 6-level paging enabled.

If we run the same application binary on machine with older kernel and
5-level paging, the application will get access to only 47-bit address
space, not 56-bit, as hint address is more than TASK_SIZE_MAX in this
configuration.

> If we're really worried about breaking userspace then a new MMAP flag
> seems like the safest option?
> 
> I don't feel particularly strongly about any option, but like I said my
> main concern is that x86 & powerpc end up with the same behaviour.
> 
> And whatever we end up with someone will need to do an update to the man
> page for mmap.

Sure.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4
  2017-04-07 11:32           ` Dmitry Safonov
  2017-04-07 15:44             ` [PATCHv3 " Kirill A. Shutemov
@ 2017-04-13 11:30             ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64 Kirill A. Shutemov
                                 ` (8 more replies)
  1 sibling, 9 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here's updated version the fourth and the last bunch of of patches that brings
initial 5-level paging enabling.

Please review and consider applying.

The situation with assembly hasn't changed much. I still not see a way to get
it work.

In this version I've included patch to fix comment in return_from_SYSCALL_64,
fixed bug in coverting startup_64 to C and updated the patch which allows to
opt-in full address space.

Kirill A. Shutemov (9):
  x86/asm: Fix comment in return_from_SYSCALL_64
  x86/boot/64: Rewrite startup_64 in C
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Add support of additional page table level during early
    boot
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add support for 5-level paging for KASLR
  x86: Enable 5-level paging support
  x86/mm: Allow to have userspace mappings above 47-bits

 arch/x86/Kconfig                            |   5 +
 arch/x86/boot/compressed/head_64.S          |  23 ++++-
 arch/x86/entry/entry_64.S                   |   3 +-
 arch/x86/include/asm/elf.h                  |   4 +-
 arch/x86/include/asm/mpx.h                  |   9 ++
 arch/x86/include/asm/pgtable.h              |   2 +-
 arch/x86/include/asm/pgtable_64.h           |   6 +-
 arch/x86/include/asm/processor.h            |  11 ++-
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/espfix_64.c                 |   2 +-
 arch/x86/kernel/head64.c                    | 137 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S                   | 134 +++++++--------------------
 arch/x86/kernel/machine_kexec_64.c          |   2 +-
 arch/x86/kernel/sys_x86_64.c                |  30 +++++-
 arch/x86/mm/dump_pagetables.c               |   2 +-
 arch/x86/mm/hugetlbpage.c                   |  27 +++++-
 arch/x86/mm/init_64.c                       | 104 +++++++++++++++++++--
 arch/x86/mm/kasan_init_64.c                 |  12 +--
 arch/x86/mm/kaslr.c                         |  81 ++++++++++++----
 arch/x86/mm/mmap.c                          |   6 +-
 arch/x86/mm/mpx.c                           |  33 ++++++-
 arch/x86/realmode/init.c                    |   2 +-
 arch/x86/xen/Kconfig                        |   1 +
 arch/x86/xen/mmu.c                          |  18 ++--
 arch/x86/xen/xen-pvh.S                      |   2 +-
 25 files changed, 470 insertions(+), 188 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
                                 ` (7 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

On x86-64 __VIRTUAL_MASK_SHIFT depends on paging mode now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 607d72c4a485..edec30584eb8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -266,7 +266,8 @@ return_from_SYSCALL_64:
 	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 *
-	 * Change top 16 bits to be the sign-extension of 47th bit
+	 * Change top bits to match most significant bit (47th or 56th bit
+	 * depending on paging mode) in the address.
 	 */
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64 Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
                                 ` (6 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/head64.c  | 81 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 95 ++---------------------------------------------
 2 files changed, 83 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fixup the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fixup phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..1432d530fa35 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,11 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	pushq	%rsi
+	call	__startup_64
+	popq	%rsi
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64 Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
                                 ` (5 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1432d530fa35..0ae0bad4d4d5 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -77,7 +77,7 @@ startup_64:
 	call	__startup_64
 	popq	%rsi
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -97,7 +97,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -328,7 +328,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -338,14 +338,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index bce6990b1d81..0470826d2bdc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                                 ` (2 preceding siblings ...)
  2017-04-13 11:30               ` [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
                                 ` (4 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
 arch/x86/include/asm/pgtable_64.h           |  2 ++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
 5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
 	pgdval_t *pgd;
+	p4dval_t *p4d;
 	pudval_t *pud;
 	pmdval_t *pmd, pmd_entry;
 	int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+		p4d[511] += load_delta;
+	}
+
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
 	pud[510] += load_delta;
 	pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
-	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+		pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+		p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	} else {
+		pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	}
 
 	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
 	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 0ae0bad4d4d5..7b527fa47536 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -100,11 +104,14 @@ ENTRY(secondary_startup_64)
 	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -330,7 +337,11 @@ GLOBAL(name)
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -343,9 +354,9 @@ NEXT_PAGE(init_top_pgt)
 #else
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -359,6 +370,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                                 ` (3 preceding siblings ...)
  2017-04-13 11:30               ` [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
                                 ` (3 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                                 ` (4 preceding siblings ...)
  2017-04-13 11:30               ` [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
                                 ` (2 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RAM) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RESERVED_KERN))
+				set_p4d(p4d, __p4d(0));
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                                 ` (5 preceding siblings ...)
  2017-04-13 11:30               ` [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 8/9] x86: Enable 5-level paging support Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 8/9] x86: Enable 5-level paging support
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                                 ` (6 preceding siblings ...)
  2017-04-13 11:30               ` [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  2017-04-13 11:30               ` [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 5 +++++
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
 	bool "Xen guest support"
 	depends on PARAVIRT
+	depends on !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
                                 ` (7 preceding siblings ...)
  2017-04-13 11:30               ` [PATCHv4 8/9] x86: Enable 5-level paging support Kirill A. Shutemov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  8 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, linux-api

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

A high hint address would only affect the allocation in question, but not
any future mmap()s.

Specifying high hint address on older kernel or on machine without 5-level
paging support is safe. The hint will be ignored and kernel will fall back
to allocation from 47-bit address space.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: linux-api@vger.kernel.org
---
 arch/x86/include/asm/elf.h       |  4 ++--
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 11 ++++++++---
 arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
 arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
 arch/x86/mm/mmap.c               |  6 +++---
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e8ab9a46bc68..7a30513a4046 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
 }
 
 extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
 					0xc0000000 : 0xFFFFe000)
 
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
 #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 	return error;
 }
 
-static void find_start_end(unsigned long flags, unsigned long *begin,
-			   unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+		unsigned long *begin, unsigned long *end)
 {
 	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
 		/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 	}
 
 	*begin	= get_mmap_base(1);
-	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+	if (in_compat_syscall())
+		*end = tasksize_32bit();
+	else
+		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
 }
 
 unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
-	find_start_end(flags, &begin, &end);
+	find_start_end(addr, flags, &begin, &end);
 
 	if (len > end)
 		return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 *
+	 * !in_compat_syscall() check to avoid high addresses for x32.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = get_mmap_base(1);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
 	info.high_limit = in_compat_syscall() ?
-		tasksize_32bit() : tasksize_64bit();
+		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = TASK_SIZE_LOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
 	return IA32_PAGE_OFFSET;
 }
 
-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
 {
-	return TASK_SIZE_MAX;
+	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
 	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
-			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
 
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
 	/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 1c34b767c84c..8c8da27e8549 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1030,3 +1039,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-12 10:18             ` Kirill A. Shutemov
@ 2017-04-17 10:32               ` Ingo Molnar
  2017-04-18  8:59                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2017-04-17 10:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > I'll look closer (building proccess it's rather complicated), but my
> > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > with the rest of the kernel, rather included as blob, no?
> > > 
> > > Andy, may be you have an idea?
> > 
> > There isn't any way I know of to directly link them together. The ELF 
> > format wasn't designed for that. You would need to merge blobs and then use
> > manual jump vectors, like the 16bit startup code does. It would be likely
> > complicated and ugly.
> 
> Ingo, can we proceed without coverting this assembly to C?
> 
> I'm committed to convert it to C later if we'll find reasonable solution
> to the issue.

So one way to do it would be to build it standalone as a .o, then add it not to 
the regular kernel objects link target (as you found out it's not possible to link 
32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.

But there would be other complications with this approach, such as we'd have to 
add a size field and there might be symbol linking problems ...

Another, pretty hacky way would be to generate a .S from the .c, then post-process 
the .S and essentially generate today's 32-bit .S from it.

Probably not worth the trouble.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-17 10:32               ` Ingo Molnar
@ 2017-04-18  8:59                 ` Kirill A. Shutemov
  2017-04-18 10:15                   ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > I'll look closer (building proccess it's rather complicated), but my
> > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > with the rest of the kernel, rather included as blob, no?
> > > > 
> > > > Andy, may be you have an idea?
> > > 
> > > There isn't any way I know of to directly link them together. The ELF 
> > > format wasn't designed for that. You would need to merge blobs and then use
> > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > complicated and ugly.
> > 
> > Ingo, can we proceed without coverting this assembly to C?
> > 
> > I'm committed to convert it to C later if we'll find reasonable solution
> > to the issue.
> 
> So one way to do it would be to build it standalone as a .o, then add it not to 
> the regular kernel objects link target (as you found out it's not possible to link 
> 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> 
> But there would be other complications with this approach, such as we'd have to 
> add a size field and there might be symbol linking problems ...
> 
> Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> the .S and essentially generate today's 32-bit .S from it.
> 
> Probably not worth the trouble.

So, do I need to do anything else to get part 4 applied?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-18  8:59                 ` Kirill A. Shutemov
@ 2017-04-18 10:15                   ` Kirill A. Shutemov
  2017-04-18 11:10                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18 10:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > with the rest of the kernel, rather included as blob, no?
> > > > > 
> > > > > Andy, may be you have an idea?
> > > > 
> > > > There isn't any way I know of to directly link them together. The ELF 
> > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > complicated and ugly.
> > > 
> > > Ingo, can we proceed without coverting this assembly to C?
> > > 
> > > I'm committed to convert it to C later if we'll find reasonable solution
> > > to the issue.
> > 
> > So one way to do it would be to build it standalone as a .o, then add it not to 
> > the regular kernel objects link target (as you found out it's not possible to link 
> > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > 
> > But there would be other complications with this approach, such as we'd have to 
> > add a size field and there might be symbol linking problems ...
> > 
> > Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> > the .S and essentially generate today's 32-bit .S from it.
> > 
> > Probably not worth the trouble.
> 
> So, do I need to do anything else to get part 4 applied?

Doh!

I've just realized we don't really need to enable 5-level paging in
decompression code. Leaving 4-level paging there works perfectly fine.

I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
patchset.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-18 10:15                   ` Kirill A. Shutemov
@ 2017-04-18 11:10                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 46+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18 11:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 18, 2017 at 01:15:34PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > > with the rest of the kernel, rather included as blob, no?
> > > > > > 
> > > > > > Andy, may be you have an idea?
> > > > > 
> > > > > There isn't any way I know of to directly link them together. The ELF 
> > > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > > complicated and ugly.
> > > > 
> > > > Ingo, can we proceed without coverting this assembly to C?
> > > > 
> > > > I'm committed to convert it to C later if we'll find reasonable solution
> > > > to the issue.
> > > 
> > > So one way to do it would be to build it standalone as a .o, then add it not to 
> > > the regular kernel objects link target (as you found out it's not possible to link 
> > > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> > > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > > 
> > > But there would be other complications with this approach, such as we'd have to 
> > > add a size field and there might be symbol linking problems ...
> > > 
> > > Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> > > the .S and essentially generate today's 32-bit .S from it.
> > > 
> > > Probably not worth the trouble.
> > 
> > So, do I need to do anything else to get part 4 applied?
> 
> Doh!
> 
> I've just realized we don't really need to enable 5-level paging in
> decompression code. Leaving 4-level paging there works perfectly fine.
> 
> I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
> patchset.

No. This breaks KASLR. Decompression code has to use 5-level paging to
keep KASLR working.

So, v4 of part 4 is up-to-date.

Sorry for noise.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-04-18 11:10 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
2017-04-06 14:00 ` [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
2017-04-11  7:02   ` Ingo Molnar
2017-04-11 10:51     ` Kirill A. Shutemov
2017-04-11 11:28       ` Ingo Molnar
2017-04-11 11:46         ` Kirill A. Shutemov
2017-04-11 14:09           ` Andi Kleen
2017-04-12 10:18             ` Kirill A. Shutemov
2017-04-17 10:32               ` Ingo Molnar
2017-04-18  8:59                 ` Kirill A. Shutemov
2017-04-18 10:15                   ` Kirill A. Shutemov
2017-04-18 11:10                     ` Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 7/8] x86: Enable 5-level paging support Kirill A. Shutemov
2017-04-06 14:52   ` Juergen Gross
2017-04-06 15:24     ` Kirill A. Shutemov
2017-04-06 15:56       ` Juergen Gross
2017-04-06 14:01 ` [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
2017-04-06 18:43   ` Dmitry Safonov
2017-04-06 19:15     ` Dmitry Safonov
2017-04-06 23:21       ` Kirill A. Shutemov
2017-04-06 23:24         ` [PATCHv2 " Kirill A. Shutemov
2017-04-07 11:32           ` Dmitry Safonov
2017-04-07 15:44             ` [PATCHv3 " Kirill A. Shutemov
2017-04-07 16:37               ` Dmitry Safonov
2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64 Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 8/9] x86: Enable 5-level paging support Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
2017-04-07 10:06         ` [PATCH 8/8] " Dmitry Safonov
2017-04-07 13:35   ` Anshuman Khandual
2017-04-07 15:59     ` Kirill A. Shutemov
2017-04-07 16:09       ` hpa
2017-04-07 16:20         ` Kirill A. Shutemov
2017-04-12 10:41       ` Michael Ellerman
2017-04-12 11:11         ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).