All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4
@ 2017-04-06 14:00 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here's the fourth and the last bunch of of patches that brings initial
5-level paging enabling.

Please review and consider applying.

As Ingo requested I've tried to rewrite assembly parts of boot process
into C before bringing 5-level paging support. The only part where I
succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is
now in C.

I failed to rewrite startup_32 in arch/x86/boot/compressed/head_64.S in C.
The code I need to modify in still in 32-bit mode, but if I would move it
to C it will be compiled as 64-bit. I've tried to move it into separate
translation unit and compile it with -m32, but then linking phase fails
due to type mismatch of object files.

I also have trouble with rewriting secondary_startup_64. Stack breaks as
soon as we switch to new page tables when onlining secondary CPUs. I don't
know how to get around this.

I hope it's not show-stopper.

If you know how to get around these issues, let me know.

Kirill A. Shutemov (8):
  x86/boot/64: Rewrite startup_64 in C
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Add support of additional page table level during early
    boot
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add support for 5-level paging for KASLR
  x86: Enable 5-level paging support
  x86/mm: Allow to have userspace mappings above 47-bits

 arch/x86/Kconfig                            |   5 +
 arch/x86/boot/compressed/head_64.S          |  23 ++++-
 arch/x86/include/asm/elf.h                  |   2 +-
 arch/x86/include/asm/mpx.h                  |   9 ++
 arch/x86/include/asm/pgtable.h              |   2 +-
 arch/x86/include/asm/pgtable_64.h           |   6 +-
 arch/x86/include/asm/processor.h            |   9 +-
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/espfix_64.c                 |   2 +-
 arch/x86/kernel/head64.c                    | 137 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S                   | 132 ++++++---------------------
 arch/x86/kernel/machine_kexec_64.c          |   2 +-
 arch/x86/kernel/sys_x86_64.c                |  28 +++++-
 arch/x86/mm/dump_pagetables.c               |   2 +-
 arch/x86/mm/hugetlbpage.c                   |  27 +++++-
 arch/x86/mm/init_64.c                       | 104 +++++++++++++++++++--
 arch/x86/mm/kasan_init_64.c                 |  12 +--
 arch/x86/mm/kaslr.c                         |  81 ++++++++++++----
 arch/x86/mm/mmap.c                          |   2 +-
 arch/x86/mm/mpx.c                           |  33 ++++++-
 arch/x86/realmode/init.c                    |   2 +-
 arch/x86/xen/Kconfig                        |   1 +
 arch/x86/xen/mmu.c                          |  18 ++--
 arch/x86/xen/xen-pvh.S                      |   2 +-
 24 files changed, 463 insertions(+), 180 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4
@ 2017-04-06 14:00 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here's the fourth and the last bunch of of patches that brings initial
5-level paging enabling.

Please review and consider applying.

As Ingo requested I've tried to rewrite assembly parts of boot process
into C before bringing 5-level paging support. The only part where I
succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is
now in C.

I failed to rewrite startup_32 in arch/x86/boot/compressed/head_64.S in C.
The code I need to modify in still in 32-bit mode, but if I would move it
to C it will be compiled as 64-bit. I've tried to move it into separate
translation unit and compile it with -m32, but then linking phase fails
due to type mismatch of object files.

I also have trouble with rewriting secondary_startup_64. Stack breaks as
soon as we switch to new page tables when onlining secondary CPUs. I don't
know how to get around this.

I hope it's not show-stopper.

If you know how to get around these issues, let me know.

Kirill A. Shutemov (8):
  x86/boot/64: Rewrite startup_64 in C
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Add support of additional page table level during early
    boot
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add support for 5-level paging for KASLR
  x86: Enable 5-level paging support
  x86/mm: Allow to have userspace mappings above 47-bits

 arch/x86/Kconfig                            |   5 +
 arch/x86/boot/compressed/head_64.S          |  23 ++++-
 arch/x86/include/asm/elf.h                  |   2 +-
 arch/x86/include/asm/mpx.h                  |   9 ++
 arch/x86/include/asm/pgtable.h              |   2 +-
 arch/x86/include/asm/pgtable_64.h           |   6 +-
 arch/x86/include/asm/processor.h            |   9 +-
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/espfix_64.c                 |   2 +-
 arch/x86/kernel/head64.c                    | 137 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S                   | 132 ++++++---------------------
 arch/x86/kernel/machine_kexec_64.c          |   2 +-
 arch/x86/kernel/sys_x86_64.c                |  28 +++++-
 arch/x86/mm/dump_pagetables.c               |   2 +-
 arch/x86/mm/hugetlbpage.c                   |  27 +++++-
 arch/x86/mm/init_64.c                       | 104 +++++++++++++++++++--
 arch/x86/mm/kasan_init_64.c                 |  12 +--
 arch/x86/mm/kaslr.c                         |  81 ++++++++++++----
 arch/x86/mm/mmap.c                          |   2 +-
 arch/x86/mm/mpx.c                           |  33 ++++++-
 arch/x86/realmode/init.c                    |   2 +-
 arch/x86/xen/Kconfig                        |   1 +
 arch/x86/xen/mmu.c                          |  18 ++--
 arch/x86/xen/xen-pvh.S                      |   2 +-
 24 files changed, 463 insertions(+), 180 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:00   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/head64.c  | 81 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 93 +----------------------------------------------
 2 files changed, 81 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fixup the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fixup phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..9656c5951b98 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,9 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	call	__startup_64
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C
@ 2017-04-06 14:00   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:00 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/head64.c  | 81 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 93 +----------------------------------------------
 2 files changed, 81 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fixup the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fixup phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..9656c5951b98 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,9 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	call	__startup_64
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c5951b98..d44c350797bf 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -75,7 +75,7 @@ startup_64:
 	leaq	_text(%rip), %rdi
 	call	__startup_64
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -95,7 +95,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -326,7 +326,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -336,14 +336,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 9f305be71a72..6680cefc062e 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c5951b98..d44c350797bf 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -75,7 +75,7 @@ startup_64:
 	leaq	_text(%rip), %rdi
 	call	__startup_64
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -95,7 +95,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -326,7 +326,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -336,14 +336,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 9f305be71a72..6680cefc062e 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
 arch/x86/include/asm/pgtable_64.h           |  2 ++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
 5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
 	pgdval_t *pgd;
+	p4dval_t *p4d;
 	pudval_t *pud;
 	pmdval_t *pmd, pmd_entry;
 	int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+		p4d[511] += load_delta;
+	}
+
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
 	pud[510] += load_delta;
 	pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
-	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+		pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+		p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	} else {
+		pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	}
 
 	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
 	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d44c350797bf..b24fc575a6da 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -98,11 +102,14 @@ ENTRY(secondary_startup_64)
 	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -328,7 +335,11 @@ GLOBAL(name)
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -341,9 +352,9 @@ NEXT_PAGE(init_top_pgt)
 #else
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -357,6 +368,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
 arch/x86/include/asm/pgtable_64.h           |  2 ++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
 5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
 	pgdval_t *pgd;
+	p4dval_t *p4d;
 	pudval_t *pud;
 	pmdval_t *pmd, pmd_entry;
 	int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+		p4d[511] += load_delta;
+	}
+
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
 	pud[510] += load_delta;
 	pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
-	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+		pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+		p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	} else {
+		pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	}
 
 	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
 	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d44c350797bf..b24fc575a6da 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -98,11 +102,14 @@ ENTRY(secondary_startup_64)
 	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -328,7 +335,11 @@ GLOBAL(name)
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -341,9 +352,9 @@ NEXT_PAGE(init_top_pgt)
 #else
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -357,6 +368,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RAM) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RESERVED_KERN))
+				set_p4d(p4d, __p4d(0));
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support 5-level paging
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RAM) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RESERVED_KERN))
+				set_p4d(p4d, __p4d(0));
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 5 +++++
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
 	bool "Xen guest support"
 	depends on PARAVIRT
+	depends on !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 7/8] x86: Enable 5-level paging support
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 5 +++++
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
 	bool "Xen guest support"
 	depends on PARAVIRT
+	depends on !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 14:00 ` Kirill A. Shutemov
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/include/asm/elf.h       |  2 +-
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h |  9 ++++++---
 arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
 arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
 arch/x86/mm/mmap.c               |  2 +-
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..9f437aea7f57 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = min(end, TASK_SIZE);
+	else
+		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.low_limit = get_mmap_base(1);
 	info.high_limit = in_compat_syscall() ?
 		tasksize_32bit() : tasksize_64bit();
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = TASK_SIZE;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
 
 unsigned long tasksize_64bit(void)
 {
-	return TASK_SIZE_MAX;
+	return DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 14:01   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 14:01 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/include/asm/elf.h       |  2 +-
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h |  9 ++++++---
 arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
 arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
 arch/x86/mm/mmap.c               |  2 +-
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..9f437aea7f57 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = min(end, TASK_SIZE);
+	else
+		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.low_limit = get_mmap_base(1);
 	info.high_limit = in_compat_syscall() ?
 		tasksize_32bit() : tasksize_64bit();
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = TASK_SIZE;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
 
 unsigned long tasksize_64bit(void)
 {
-	return TASK_SIZE_MAX;
+	return DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 14:01   ` Kirill A. Shutemov
@ 2017-04-06 14:52     ` Juergen Gross
  -1 siblings, 0 replies; 101+ messages in thread
From: Juergen Gross @ 2017-04-06 14:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 06/04/17 16:01, Kirill A. Shutemov wrote:
> Most of things are in place and we can enable support of 5-level paging.
> 
> Enabling XEN with 5-level paging requires more work. The patch makes XEN
> dependent on !X86_5LEVEL.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig     | 5 +++++
>  arch/x86/xen/Kconfig | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4e153e93273f..7a76dcac357e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>  
>  config PGTABLE_LEVELS
>  	int
> +	default 5 if X86_5LEVEL
>  	default 4 if X86_64
>  	default 3 if X86_PAE
>  	default 2
> @@ -1390,6 +1391,10 @@ config X86_PAE
>  	  has the cost of more pagetable lookup overhead, and also
>  	  consumes more pagetable space per process.
>  
> +config X86_5LEVEL
> +	bool "Enable 5-level page tables support"
> +	depends on X86_64
> +
>  config ARCH_PHYS_ADDR_T_64BIT
>  	def_bool y
>  	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> index 76b6dbd627df..b90d481ce5a1 100644
> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -5,6 +5,7 @@
>  config XEN
>  	bool "Xen guest support"
>  	depends on PARAVIRT
> +	depends on !X86_5LEVEL
>  	select PARAVIRT_CLOCK
>  	select XEN_HAVE_PVMMU
>  	select XEN_HAVE_VPMU
> 

Just a heads up: this last change will conflict with the Xen tree.

Can't we just ignore the additional level in Xen pv mode and run with
4 levels instead?


Juergen

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
@ 2017-04-06 14:52     ` Juergen Gross
  0 siblings, 0 replies; 101+ messages in thread
From: Juergen Gross @ 2017-04-06 14:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 06/04/17 16:01, Kirill A. Shutemov wrote:
> Most of things are in place and we can enable support of 5-level paging.
> 
> Enabling XEN with 5-level paging requires more work. The patch makes XEN
> dependent on !X86_5LEVEL.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/Kconfig     | 5 +++++
>  arch/x86/xen/Kconfig | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4e153e93273f..7a76dcac357e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>  
>  config PGTABLE_LEVELS
>  	int
> +	default 5 if X86_5LEVEL
>  	default 4 if X86_64
>  	default 3 if X86_PAE
>  	default 2
> @@ -1390,6 +1391,10 @@ config X86_PAE
>  	  has the cost of more pagetable lookup overhead, and also
>  	  consumes more pagetable space per process.
>  
> +config X86_5LEVEL
> +	bool "Enable 5-level page tables support"
> +	depends on X86_64
> +
>  config ARCH_PHYS_ADDR_T_64BIT
>  	def_bool y
>  	depends on X86_64 || X86_PAE
> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> index 76b6dbd627df..b90d481ce5a1 100644
> --- a/arch/x86/xen/Kconfig
> +++ b/arch/x86/xen/Kconfig
> @@ -5,6 +5,7 @@
>  config XEN
>  	bool "Xen guest support"
>  	depends on PARAVIRT
> +	depends on !X86_5LEVEL
>  	select PARAVIRT_CLOCK
>  	select XEN_HAVE_PVMMU
>  	select XEN_HAVE_VPMU
> 

Just a heads up: this last change will conflict with the Xen tree.

Can't we just ignore the additional level in Xen pv mode and run with
4 levels instead?


Juergen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 14:52     ` Juergen Gross
@ 2017-04-06 15:24       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 15:24 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
> On 06/04/17 16:01, Kirill A. Shutemov wrote:
> > Most of things are in place and we can enable support of 5-level paging.
> > 
> > Enabling XEN with 5-level paging requires more work. The patch makes XEN
> > dependent on !X86_5LEVEL.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig     | 5 +++++
> >  arch/x86/xen/Kconfig | 1 +
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 4e153e93273f..7a76dcac357e 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
> >  
> >  config PGTABLE_LEVELS
> >  	int
> > +	default 5 if X86_5LEVEL
> >  	default 4 if X86_64
> >  	default 3 if X86_PAE
> >  	default 2
> > @@ -1390,6 +1391,10 @@ config X86_PAE
> >  	  has the cost of more pagetable lookup overhead, and also
> >  	  consumes more pagetable space per process.
> >  
> > +config X86_5LEVEL
> > +	bool "Enable 5-level page tables support"
> > +	depends on X86_64
> > +
> >  config ARCH_PHYS_ADDR_T_64BIT
> >  	def_bool y
> >  	depends on X86_64 || X86_PAE
> > diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> > index 76b6dbd627df..b90d481ce5a1 100644
> > --- a/arch/x86/xen/Kconfig
> > +++ b/arch/x86/xen/Kconfig
> > @@ -5,6 +5,7 @@
> >  config XEN
> >  	bool "Xen guest support"
> >  	depends on PARAVIRT
> > +	depends on !X86_5LEVEL
> >  	select PARAVIRT_CLOCK
> >  	select XEN_HAVE_PVMMU
> >  	select XEN_HAVE_VPMU
> > 
> 
> Just a heads up: this last change will conflict with the Xen tree.

It should be trivial to fix, right? It's one-liner after all.

> Can't we just ignore the additional level in Xen pv mode and run with
> 4 levels instead?

We don't have yet boot-time switching between paging modes yet. It will
come later. So the answer is no.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
@ 2017-04-06 15:24       ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 15:24 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
> On 06/04/17 16:01, Kirill A. Shutemov wrote:
> > Most of things are in place and we can enable support of 5-level paging.
> > 
> > Enabling XEN with 5-level paging requires more work. The patch makes XEN
> > dependent on !X86_5LEVEL.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/Kconfig     | 5 +++++
> >  arch/x86/xen/Kconfig | 1 +
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 4e153e93273f..7a76dcac357e 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
> >  
> >  config PGTABLE_LEVELS
> >  	int
> > +	default 5 if X86_5LEVEL
> >  	default 4 if X86_64
> >  	default 3 if X86_PAE
> >  	default 2
> > @@ -1390,6 +1391,10 @@ config X86_PAE
> >  	  has the cost of more pagetable lookup overhead, and also
> >  	  consumes more pagetable space per process.
> >  
> > +config X86_5LEVEL
> > +	bool "Enable 5-level page tables support"
> > +	depends on X86_64
> > +
> >  config ARCH_PHYS_ADDR_T_64BIT
> >  	def_bool y
> >  	depends on X86_64 || X86_PAE
> > diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
> > index 76b6dbd627df..b90d481ce5a1 100644
> > --- a/arch/x86/xen/Kconfig
> > +++ b/arch/x86/xen/Kconfig
> > @@ -5,6 +5,7 @@
> >  config XEN
> >  	bool "Xen guest support"
> >  	depends on PARAVIRT
> > +	depends on !X86_5LEVEL
> >  	select PARAVIRT_CLOCK
> >  	select XEN_HAVE_PVMMU
> >  	select XEN_HAVE_VPMU
> > 
> 
> Just a heads up: this last change will conflict with the Xen tree.

It should be trivial to fix, right? It's one-liner after all.

> Can't we just ignore the additional level in Xen pv mode and run with
> 4 levels instead?

We don't have yet boot-time switching between paging modes yet. It will
come later. So the answer is no.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
  2017-04-06 15:24       ` Kirill A. Shutemov
@ 2017-04-06 15:56         ` Juergen Gross
  -1 siblings, 0 replies; 101+ messages in thread
From: Juergen Gross @ 2017-04-06 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On 06/04/17 17:24, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
>> On 06/04/17 16:01, Kirill A. Shutemov wrote:
>>> Most of things are in place and we can enable support of 5-level paging.
>>>
>>> Enabling XEN with 5-level paging requires more work. The patch makes XEN
>>> dependent on !X86_5LEVEL.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>  arch/x86/Kconfig     | 5 +++++
>>>  arch/x86/xen/Kconfig | 1 +
>>>  2 files changed, 6 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 4e153e93273f..7a76dcac357e 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>>>  
>>>  config PGTABLE_LEVELS
>>>  	int
>>> +	default 5 if X86_5LEVEL
>>>  	default 4 if X86_64
>>>  	default 3 if X86_PAE
>>>  	default 2
>>> @@ -1390,6 +1391,10 @@ config X86_PAE
>>>  	  has the cost of more pagetable lookup overhead, and also
>>>  	  consumes more pagetable space per process.
>>>  
>>> +config X86_5LEVEL
>>> +	bool "Enable 5-level page tables support"
>>> +	depends on X86_64
>>> +
>>>  config ARCH_PHYS_ADDR_T_64BIT
>>>  	def_bool y
>>>  	depends on X86_64 || X86_PAE
>>> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
>>> index 76b6dbd627df..b90d481ce5a1 100644
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -5,6 +5,7 @@
>>>  config XEN
>>>  	bool "Xen guest support"
>>>  	depends on PARAVIRT
>>> +	depends on !X86_5LEVEL
>>>  	select PARAVIRT_CLOCK
>>>  	select XEN_HAVE_PVMMU
>>>  	select XEN_HAVE_VPMU
>>>
>>
>> Just a heads up: this last change will conflict with the Xen tree.
> 
> It should be trivial to fix, right? It's one-liner after all.

Right. Just wanted to mention it.


Juergen

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 7/8] x86: Enable 5-level paging support
@ 2017-04-06 15:56         ` Juergen Gross
  0 siblings, 0 replies; 101+ messages in thread
From: Juergen Gross @ 2017-04-06 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On 06/04/17 17:24, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 04:52:11PM +0200, Juergen Gross wrote:
>> On 06/04/17 16:01, Kirill A. Shutemov wrote:
>>> Most of things are in place and we can enable support of 5-level paging.
>>>
>>> Enabling XEN with 5-level paging requires more work. The patch makes XEN
>>> dependent on !X86_5LEVEL.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>  arch/x86/Kconfig     | 5 +++++
>>>  arch/x86/xen/Kconfig | 1 +
>>>  2 files changed, 6 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 4e153e93273f..7a76dcac357e 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
>>>  
>>>  config PGTABLE_LEVELS
>>>  	int
>>> +	default 5 if X86_5LEVEL
>>>  	default 4 if X86_64
>>>  	default 3 if X86_PAE
>>>  	default 2
>>> @@ -1390,6 +1391,10 @@ config X86_PAE
>>>  	  has the cost of more pagetable lookup overhead, and also
>>>  	  consumes more pagetable space per process.
>>>  
>>> +config X86_5LEVEL
>>> +	bool "Enable 5-level page tables support"
>>> +	depends on X86_64
>>> +
>>>  config ARCH_PHYS_ADDR_T_64BIT
>>>  	def_bool y
>>>  	depends on X86_64 || X86_PAE
>>> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
>>> index 76b6dbd627df..b90d481ce5a1 100644
>>> --- a/arch/x86/xen/Kconfig
>>> +++ b/arch/x86/xen/Kconfig
>>> @@ -5,6 +5,7 @@
>>>  config XEN
>>>  	bool "Xen guest support"
>>>  	depends on PARAVIRT
>>> +	depends on !X86_5LEVEL
>>>  	select PARAVIRT_CLOCK
>>>  	select XEN_HAVE_PVMMU
>>>  	select XEN_HAVE_VPMU
>>>
>>
>> Just a heads up: this last change will conflict with the Xen tree.
> 
> It should be trivial to fix, right? It's one-liner after all.

Right. Just wanted to mention it.


Juergen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 14:01   ` Kirill A. Shutemov
  (?)
@ 2017-04-06 18:43     ` Dmitry Safonov
  -1 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-06 18:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

Hi Kirill,

On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.

Do you wish after the first over-47-bit mapping the following mmap()
calls return also over-47-bits if there is free space?
It so, you could simplify all this code by changing only mm->mmap_base
on the first over-47-bit mmap() call.
This will do simple trick.

>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h |  9 ++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)

This will kill 32-bit userspace:
As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.

Here is the test:
[root@localhost test]# cat hello-world.c
#include <stdio.h>

int main(int argc, char **argv)
{
	printf("Maybe this world is another planet's hell.\n");
         return 0;
}
[root@localhost test]# gcc -m32 hello-world.c -o hello-world
[root@localhost test]# ./hello-world
[   35.306726] hello-world[1948]: segfault at ffa5288c ip 
00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
Segmentation fault (core dumped)

So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.


>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..9f437aea7f57 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

ditto

>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
is set, which can do 64-bit syscalls - the subtraction will be
a negative..


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

ditto

> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;

ditto about 32-bits

>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 18:43     ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-06 18:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

Hi Kirill,

On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.

Do you wish after the first over-47-bit mapping the following mmap()
calls return also over-47-bits if there is free space?
It so, you could simplify all this code by changing only mm->mmap_base
on the first over-47-bit mmap() call.
This will do simple trick.

>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h |  9 ++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)

This will kill 32-bit userspace:
As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.

Here is the test:
[root@localhost test]# cat hello-world.c
#include <stdio.h>

int main(int argc, char **argv)
{
	printf("Maybe this world is another planet's hell.\n");
         return 0;
}
[root@localhost test]# gcc -m32 hello-world.c -o hello-world
[root@localhost test]# ./hello-world
[   35.306726] hello-world[1948]: segfault at ffa5288c ip 
00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
Segmentation fault (core dumped)

So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.


>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..9f437aea7f57 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

ditto

>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
is set, which can do 64-bit syscalls - the subtraction will be
a negative..


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

ditto

> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;

ditto about 32-bits

>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 18:43     ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-06 18:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

Hi Kirill,

On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.

Do you wish after the first over-47-bit mapping the following mmap()
calls return also over-47-bits if there is free space?
It so, you could simplify all this code by changing only mm->mmap_base
on the first over-47-bit mmap() call.
This will do simple trick.

>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h |  9 ++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 100 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)

This will kill 32-bit userspace:
As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.

Here is the test:
[root@localhost test]# cat hello-world.c
#include <stdio.h>

int main(int argc, char **argv)
{
	printf("Maybe this world is another planet's hell.\n");
         return 0;
}
[root@localhost test]# gcc -m32 hello-world.c -o hello-world
[root@localhost test]# ./hello-world
[   35.306726] hello-world[1948]: segfault at ffa5288c ip 
00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
Segmentation fault (core dumped)

So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.


>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..9f437aea7f57 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)

ditto

>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
is set, which can do 64-bit syscalls - the subtraction will be
a negative..


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

ditto

> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;

ditto about 32-bits

>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 18:43     ` Dmitry Safonov
  (?)
@ 2017-04-06 19:15       ` Dmitry Safonov
  -1 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-06 19:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> Hi Kirill,
>
> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>> Not all user space is ready to handle wide addresses. It's known that
>> at least some JIT compilers use higher bits in pointers to encode their
>> information. It collides with valid pointers with 5-level paging and
>> leads to crashes.
>>
>> To mitigate this, we are not going to allocate virtual address space
>> above 47-bit by default.
>>
>> But userspace can ask for allocation from full address space by
>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>
>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>> to look for unmapped area by specified address. If it's already
>> occupied, we look for unmapped area in *full* address space, rather than
>> from 47-bit window.
>
> Do you wish after the first over-47-bit mapping the following mmap()
> calls return also over-47-bits if there is free space?
> It so, you could simplify all this code by changing only mm->mmap_base
> on the first over-47-bit mmap() call.
> This will do simple trick.
>
>>
>> This approach helps to easily make application's memory allocator aware
>> about large address space without manually tracking allocated virtual
>> address space.
>>
>> One important case we need to handle here is interaction with MPX.
>> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
>> need to make sure that MPX cannot be enabled we already have VMA above
>> the boundary and forbid creating such VMAs once MPX is enabled.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>>  arch/x86/include/asm/elf.h       |  2 +-
>>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>>  arch/x86/include/asm/processor.h |  9 ++++++---
>>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>>  arch/x86/mm/mmap.c               |  2 +-
>>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>>  7 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
>> index d4d3ed456cb7..67260dbe1688 100644
>> --- a/arch/x86/include/asm/elf.h
>> +++ b/arch/x86/include/asm/elf.h
>> @@ -250,7 +250,7 @@ extern int force_personality32;
>>     the loader.  We need to make sure that it is out of the way of the
>> program
>>     that it will "exec", and that there is sufficient room for the
>> brk.  */
>>
>> -#define ELF_ET_DYN_BASE        (TASK_SIZE / 3 * 2)
>> +#define ELF_ET_DYN_BASE        (DEFAULT_MAP_WINDOW / 3 * 2)
>
> This will kill 32-bit userspace:
> As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
> not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.
>
> Here is the test:
> [root@localhost test]# cat hello-world.c
> #include <stdio.h>
>
> int main(int argc, char **argv)
> {
>     printf("Maybe this world is another planet's hell.\n");
>         return 0;
> }
> [root@localhost test]# gcc -m32 hello-world.c -o hello-world
> [root@localhost test]# ./hello-world
> [   35.306726] hello-world[1948]: segfault at ffa5288c ip
> 00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
> Segmentation fault (core dumped)
>
> So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.

I just tried to define it like this:
-#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
+#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
+                               IA32_PAGE_OFFSET : ((1UL << 47) - 
PAGE_SIZE))

And it looks working better.

>
>
>>
>>  /* This yields a mask that user programs can use to figure out what
>>     instruction set this CPU supports.  This could be done in user space,
>> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
>> index a0d662be4c5b..7d7404756bb4 100644
>> --- a/arch/x86/include/asm/mpx.h
>> +++ b/arch/x86/include/asm/mpx.h
>> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>>  }
>>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>>                unsigned long start, unsigned long end);
>> +
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags);
>>  #else
>>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>>  {
>> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct
>> mm_struct *mm,
>>                      unsigned long start, unsigned long end)
>>  {
>>  }
>> +
>> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
>> +        unsigned long len, unsigned long flags)
>> +{
>> +    return addr;
>> +}
>>  #endif /* CONFIG_X86_INTEL_MPX */
>>
>>  #endif /* _ASM_X86_MPX_H */
>> diff --git a/arch/x86/include/asm/processor.h
>> b/arch/x86/include/asm/processor.h
>> index 3cada998a402..9f437aea7f57 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define IA32_PAGE_OFFSET    PAGE_OFFSET
>>  #define TASK_SIZE        PAGE_OFFSET
>>  #define TASK_SIZE_MAX        TASK_SIZE
>> +#define DEFAULT_MAP_WINDOW    TASK_SIZE
>>  #define STACK_TOP        TASK_SIZE
>>  #define STACK_TOP_MAX        STACK_TOP
>>
>> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>>   * particular problem by preventing anything from being mapped
>>   * at the maximum canonical address.
>>   */
>> -#define TASK_SIZE_MAX    ((1UL << 47) - PAGE_SIZE)
>> +#define TASK_SIZE_MAX    ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>> +
>> +#define DEFAULT_MAP_WINDOW    ((1UL << 47) - PAGE_SIZE)
>>
>>  /* This decides where the kernel will search for a free chunk of vm
>>   * space during mmap's.
>> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define TASK_SIZE_OF(child)    ((test_tsk_thread_flag(child,
>> TIF_ADDR32)) ? \
>>                      IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>>
>> -#define STACK_TOP        TASK_SIZE
>> +#define STACK_TOP        DEFAULT_MAP_WINDOW
>>  #define STACK_TOP_MAX        TASK_SIZE_MAX
>>
>>  #define INIT_THREAD  {                        \
>> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs,
>> unsigned long new_ip,
>>   * space during mmap's.
>>   */
>>  #define __TASK_UNMAPPED_BASE(task_size)    (PAGE_ALIGN(task_size / 3))
>> -#define TASK_UNMAPPED_BASE        __TASK_UNMAPPED_BASE(TASK_SIZE)
>> +#define TASK_UNMAPPED_BASE
>> __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
> ditto
>
>>
>>  #define KSTK_EIP(task)        (task_pt_regs(task)->ip)
>>
>> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
>> index 207b8f2582c7..593a31e93812 100644
>> --- a/arch/x86/kernel/sys_x86_64.c
>> +++ b/arch/x86/kernel/sys_x86_64.c
>> @@ -21,6 +21,7 @@
>>  #include <asm/compat.h>
>>  #include <asm/ia32.h>
>>  #include <asm/syscalls.h>
>> +#include <asm/mpx.h>
>>
>>  /*
>>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
>> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      struct vm_unmapped_area_info info;
>>      unsigned long begin, end;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (flags & MAP_FIXED)
>>          return addr;
>>
>> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      info.flags = 0;
>>      info.length = len;
>>      info.low_limit = begin;
>> -    info.high_limit = end;
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = min(end, TASK_SIZE);
>> +    else
>> +        info.high_limit = min(end, DEFAULT_MAP_WINDOW);
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      unsigned long addr = addr0;
>>      struct vm_unmapped_area_info info;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      /* requested length too big for entire address space */
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> is set, which can do 64-bit syscalls - the subtraction will be
> a negative..
>
>
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 302f43fd9c28..9a0b89252c52 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -18,6 +18,7 @@
>>  #include <asm/tlbflush.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/elf.h>
>> +#include <asm/mpx.h>
>>
>>  #if 0    /* This is just for testing */
>>  struct page *
>> @@ -87,23 +88,38 @@ static unsigned long
>> hugetlb_get_unmapped_area_bottomup(struct file *file,
>>      info.low_limit = get_mmap_base(1);
>>      info.high_limit = in_compat_syscall() ?
>>          tasksize_32bit() : tasksize_64bit();
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = TASK_SIZE;
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      return vm_unmapped_area(&info);
>>  }
>>
>>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file
>> *file,
>> -        unsigned long addr0, unsigned long len,
>> +        unsigned long addr, unsigned long len,
>>          unsigned long pgoff, unsigned long flags)
>>  {
>>      struct hstate *h = hstate_file(file);
>>      struct vm_unmapped_area_info info;
>> -    unsigned long addr;
>>
>>      info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> ditto
>
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      addr = vm_unmapped_area(&info);
>> @@ -118,7 +134,7 @@ static unsigned long
>> hugetlb_get_unmapped_area_topdown(struct file *file,
>>          VM_BUG_ON(addr != -ENOMEM);
>>          info.flags = 0;
>>          info.low_limit = TASK_UNMAPPED_BASE;
>> -        info.high_limit = TASK_SIZE;
>> +        info.high_limit = DEFAULT_MAP_WINDOW;
>
> ditto about 32-bits
>
>>          addr = vm_unmapped_area(&info);
>>      }
>>
>> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file,
>> unsigned long addr,
>>
>>      if (len & ~huge_page_mask(h))
>>          return -EINVAL;
>> +
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>>
>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>> index 19ad095b41df..d63232a31945 100644
>> --- a/arch/x86/mm/mmap.c
>> +++ b/arch/x86/mm/mmap.c
>> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>>
>>  unsigned long tasksize_64bit(void)
>>  {
>> -    return TASK_SIZE_MAX;
>> +    return DEFAULT_MAP_WINDOW;
>>  }
>>
>>  static unsigned long stack_maxrandom_size(unsigned long task_size)
>> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
>> index cd44ae727df7..a26a1b373fd0 100644
>> --- a/arch/x86/mm/mpx.c
>> +++ b/arch/x86/mm/mpx.c
>> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>>       */
>>      bd_base = mpx_get_bounds_dir();
>>      down_write(&mm->mmap_sem);
>> +
>> +    /* MPX doesn't support addresses above 47-bits yet. */
>> +    if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
>> +        pr_warn_once("%s (%d): MPX cannot handle addresses "
>> +                "above 47-bits. Disabling.",
>> +                current->comm, current->pid);
>> +        ret = -ENXIO;
>> +        goto out;
>> +    }
>>      mm->context.bd_addr = bd_base;
>>      if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>>          ret = -ENXIO;
>> -
>> +out:
>>      up_write(&mm->mmap_sem);
>>      return ret;
>>  }
>> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>>      if (ret)
>>          force_sig(SIGSEGV, current);
>>  }
>> +
>> +/* MPX cannot handle addresses above 47-bits yet. */
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags)
>> +{
>> +    if (!kernel_managing_mpx_tables(current->mm))
>> +        return addr;
>> +    if (addr + len <= DEFAULT_MAP_WINDOW)
>> +        return addr;
>> +    if (flags & MAP_FIXED)
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Requested len is larger than whole area we're allowed to map in.
>> +     * Resetting hinting address wouldn't do much good -- fail early.
>> +     */
>> +    if (len > DEFAULT_MAP_WINDOW)
>> +        return -ENOMEM;
>> +
>> +    /* Look for unmap area within DEFAULT_MAP_WINDOW */
>> +    return 0;
>> +}
>>
>
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 19:15       ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-06 19:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> Hi Kirill,
>
> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>> Not all user space is ready to handle wide addresses. It's known that
>> at least some JIT compilers use higher bits in pointers to encode their
>> information. It collides with valid pointers with 5-level paging and
>> leads to crashes.
>>
>> To mitigate this, we are not going to allocate virtual address space
>> above 47-bit by default.
>>
>> But userspace can ask for allocation from full address space by
>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>
>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>> to look for unmapped area by specified address. If it's already
>> occupied, we look for unmapped area in *full* address space, rather than
>> from 47-bit window.
>
> Do you wish after the first over-47-bit mapping the following mmap()
> calls return also over-47-bits if there is free space?
> It so, you could simplify all this code by changing only mm->mmap_base
> on the first over-47-bit mmap() call.
> This will do simple trick.
>
>>
>> This approach helps to easily make application's memory allocator aware
>> about large address space without manually tracking allocated virtual
>> address space.
>>
>> One important case we need to handle here is interaction with MPX.
>> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
>> need to make sure that MPX cannot be enabled we already have VMA above
>> the boundary and forbid creating such VMAs once MPX is enabled.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>>  arch/x86/include/asm/elf.h       |  2 +-
>>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>>  arch/x86/include/asm/processor.h |  9 ++++++---
>>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>>  arch/x86/mm/mmap.c               |  2 +-
>>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>>  7 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
>> index d4d3ed456cb7..67260dbe1688 100644
>> --- a/arch/x86/include/asm/elf.h
>> +++ b/arch/x86/include/asm/elf.h
>> @@ -250,7 +250,7 @@ extern int force_personality32;
>>     the loader.  We need to make sure that it is out of the way of the
>> program
>>     that it will "exec", and that there is sufficient room for the
>> brk.  */
>>
>> -#define ELF_ET_DYN_BASE        (TASK_SIZE / 3 * 2)
>> +#define ELF_ET_DYN_BASE        (DEFAULT_MAP_WINDOW / 3 * 2)
>
> This will kill 32-bit userspace:
> As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
> not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.
>
> Here is the test:
> [root@localhost test]# cat hello-world.c
> #include <stdio.h>
>
> int main(int argc, char **argv)
> {
>     printf("Maybe this world is another planet's hell.\n");
>         return 0;
> }
> [root@localhost test]# gcc -m32 hello-world.c -o hello-world
> [root@localhost test]# ./hello-world
> [   35.306726] hello-world[1948]: segfault at ffa5288c ip
> 00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
> Segmentation fault (core dumped)
>
> So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.

I just tried to define it like this:
-#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
+#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
+                               IA32_PAGE_OFFSET : ((1UL << 47) - 
PAGE_SIZE))

And it looks working better.

>
>
>>
>>  /* This yields a mask that user programs can use to figure out what
>>     instruction set this CPU supports.  This could be done in user space,
>> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
>> index a0d662be4c5b..7d7404756bb4 100644
>> --- a/arch/x86/include/asm/mpx.h
>> +++ b/arch/x86/include/asm/mpx.h
>> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>>  }
>>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>>                unsigned long start, unsigned long end);
>> +
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags);
>>  #else
>>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>>  {
>> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct
>> mm_struct *mm,
>>                      unsigned long start, unsigned long end)
>>  {
>>  }
>> +
>> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
>> +        unsigned long len, unsigned long flags)
>> +{
>> +    return addr;
>> +}
>>  #endif /* CONFIG_X86_INTEL_MPX */
>>
>>  #endif /* _ASM_X86_MPX_H */
>> diff --git a/arch/x86/include/asm/processor.h
>> b/arch/x86/include/asm/processor.h
>> index 3cada998a402..9f437aea7f57 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define IA32_PAGE_OFFSET    PAGE_OFFSET
>>  #define TASK_SIZE        PAGE_OFFSET
>>  #define TASK_SIZE_MAX        TASK_SIZE
>> +#define DEFAULT_MAP_WINDOW    TASK_SIZE
>>  #define STACK_TOP        TASK_SIZE
>>  #define STACK_TOP_MAX        STACK_TOP
>>
>> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>>   * particular problem by preventing anything from being mapped
>>   * at the maximum canonical address.
>>   */
>> -#define TASK_SIZE_MAX    ((1UL << 47) - PAGE_SIZE)
>> +#define TASK_SIZE_MAX    ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>> +
>> +#define DEFAULT_MAP_WINDOW    ((1UL << 47) - PAGE_SIZE)
>>
>>  /* This decides where the kernel will search for a free chunk of vm
>>   * space during mmap's.
>> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define TASK_SIZE_OF(child)    ((test_tsk_thread_flag(child,
>> TIF_ADDR32)) ? \
>>                      IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>>
>> -#define STACK_TOP        TASK_SIZE
>> +#define STACK_TOP        DEFAULT_MAP_WINDOW
>>  #define STACK_TOP_MAX        TASK_SIZE_MAX
>>
>>  #define INIT_THREAD  {                        \
>> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs,
>> unsigned long new_ip,
>>   * space during mmap's.
>>   */
>>  #define __TASK_UNMAPPED_BASE(task_size)    (PAGE_ALIGN(task_size / 3))
>> -#define TASK_UNMAPPED_BASE        __TASK_UNMAPPED_BASE(TASK_SIZE)
>> +#define TASK_UNMAPPED_BASE
>> __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
> ditto
>
>>
>>  #define KSTK_EIP(task)        (task_pt_regs(task)->ip)
>>
>> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
>> index 207b8f2582c7..593a31e93812 100644
>> --- a/arch/x86/kernel/sys_x86_64.c
>> +++ b/arch/x86/kernel/sys_x86_64.c
>> @@ -21,6 +21,7 @@
>>  #include <asm/compat.h>
>>  #include <asm/ia32.h>
>>  #include <asm/syscalls.h>
>> +#include <asm/mpx.h>
>>
>>  /*
>>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
>> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      struct vm_unmapped_area_info info;
>>      unsigned long begin, end;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (flags & MAP_FIXED)
>>          return addr;
>>
>> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      info.flags = 0;
>>      info.length = len;
>>      info.low_limit = begin;
>> -    info.high_limit = end;
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = min(end, TASK_SIZE);
>> +    else
>> +        info.high_limit = min(end, DEFAULT_MAP_WINDOW);
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      unsigned long addr = addr0;
>>      struct vm_unmapped_area_info info;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      /* requested length too big for entire address space */
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> is set, which can do 64-bit syscalls - the subtraction will be
> a negative..
>
>
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 302f43fd9c28..9a0b89252c52 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -18,6 +18,7 @@
>>  #include <asm/tlbflush.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/elf.h>
>> +#include <asm/mpx.h>
>>
>>  #if 0    /* This is just for testing */
>>  struct page *
>> @@ -87,23 +88,38 @@ static unsigned long
>> hugetlb_get_unmapped_area_bottomup(struct file *file,
>>      info.low_limit = get_mmap_base(1);
>>      info.high_limit = in_compat_syscall() ?
>>          tasksize_32bit() : tasksize_64bit();
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = TASK_SIZE;
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      return vm_unmapped_area(&info);
>>  }
>>
>>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file
>> *file,
>> -        unsigned long addr0, unsigned long len,
>> +        unsigned long addr, unsigned long len,
>>          unsigned long pgoff, unsigned long flags)
>>  {
>>      struct hstate *h = hstate_file(file);
>>      struct vm_unmapped_area_info info;
>> -    unsigned long addr;
>>
>>      info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> ditto
>
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      addr = vm_unmapped_area(&info);
>> @@ -118,7 +134,7 @@ static unsigned long
>> hugetlb_get_unmapped_area_topdown(struct file *file,
>>          VM_BUG_ON(addr != -ENOMEM);
>>          info.flags = 0;
>>          info.low_limit = TASK_UNMAPPED_BASE;
>> -        info.high_limit = TASK_SIZE;
>> +        info.high_limit = DEFAULT_MAP_WINDOW;
>
> ditto about 32-bits
>
>>          addr = vm_unmapped_area(&info);
>>      }
>>
>> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file,
>> unsigned long addr,
>>
>>      if (len & ~huge_page_mask(h))
>>          return -EINVAL;
>> +
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>>
>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>> index 19ad095b41df..d63232a31945 100644
>> --- a/arch/x86/mm/mmap.c
>> +++ b/arch/x86/mm/mmap.c
>> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>>
>>  unsigned long tasksize_64bit(void)
>>  {
>> -    return TASK_SIZE_MAX;
>> +    return DEFAULT_MAP_WINDOW;
>>  }
>>
>>  static unsigned long stack_maxrandom_size(unsigned long task_size)
>> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
>> index cd44ae727df7..a26a1b373fd0 100644
>> --- a/arch/x86/mm/mpx.c
>> +++ b/arch/x86/mm/mpx.c
>> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>>       */
>>      bd_base = mpx_get_bounds_dir();
>>      down_write(&mm->mmap_sem);
>> +
>> +    /* MPX doesn't support addresses above 47-bits yet. */
>> +    if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
>> +        pr_warn_once("%s (%d): MPX cannot handle addresses "
>> +                "above 47-bits. Disabling.",
>> +                current->comm, current->pid);
>> +        ret = -ENXIO;
>> +        goto out;
>> +    }
>>      mm->context.bd_addr = bd_base;
>>      if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>>          ret = -ENXIO;
>> -
>> +out:
>>      up_write(&mm->mmap_sem);
>>      return ret;
>>  }
>> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>>      if (ret)
>>          force_sig(SIGSEGV, current);
>>  }
>> +
>> +/* MPX cannot handle addresses above 47-bits yet. */
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags)
>> +{
>> +    if (!kernel_managing_mpx_tables(current->mm))
>> +        return addr;
>> +    if (addr + len <= DEFAULT_MAP_WINDOW)
>> +        return addr;
>> +    if (flags & MAP_FIXED)
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Requested len is larger than whole area we're allowed to map in.
>> +     * Resetting hinting address wouldn't do much good -- fail early.
>> +     */
>> +    if (len > DEFAULT_MAP_WINDOW)
>> +        return -ENOMEM;
>> +
>> +    /* Look for unmap area within DEFAULT_MAP_WINDOW */
>> +    return 0;
>> +}
>>
>
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 19:15       ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-06 19:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> Hi Kirill,
>
> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>> Not all user space is ready to handle wide addresses. It's known that
>> at least some JIT compilers use higher bits in pointers to encode their
>> information. It collides with valid pointers with 5-level paging and
>> leads to crashes.
>>
>> To mitigate this, we are not going to allocate virtual address space
>> above 47-bit by default.
>>
>> But userspace can ask for allocation from full address space by
>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>
>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>> to look for unmapped area by specified address. If it's already
>> occupied, we look for unmapped area in *full* address space, rather than
>> from 47-bit window.
>
> Do you wish after the first over-47-bit mapping the following mmap()
> calls return also over-47-bits if there is free space?
> It so, you could simplify all this code by changing only mm->mmap_base
> on the first over-47-bit mmap() call.
> This will do simple trick.
>
>>
>> This approach helps to easily make application's memory allocator aware
>> about large address space without manually tracking allocated virtual
>> address space.
>>
>> One important case we need to handle here is interaction with MPX.
>> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
>> need to make sure that MPX cannot be enabled we already have VMA above
>> the boundary and forbid creating such VMAs once MPX is enabled.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
>> ---
>>  arch/x86/include/asm/elf.h       |  2 +-
>>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>>  arch/x86/include/asm/processor.h |  9 ++++++---
>>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>>  arch/x86/mm/mmap.c               |  2 +-
>>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>>  7 files changed, 100 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
>> index d4d3ed456cb7..67260dbe1688 100644
>> --- a/arch/x86/include/asm/elf.h
>> +++ b/arch/x86/include/asm/elf.h
>> @@ -250,7 +250,7 @@ extern int force_personality32;
>>     the loader.  We need to make sure that it is out of the way of the
>> program
>>     that it will "exec", and that there is sufficient room for the
>> brk.  */
>>
>> -#define ELF_ET_DYN_BASE        (TASK_SIZE / 3 * 2)
>> +#define ELF_ET_DYN_BASE        (DEFAULT_MAP_WINDOW / 3 * 2)
>
> This will kill 32-bit userspace:
> As DEFAULT_MAP_WINDOW is defined as what previously was TASK_SIZE_MAX,
> not TASK_SIZE, for ia32/x32 ELF_ET_DYN_BASE will be over 4Gb.
>
> Here is the test:
> [root@localhost test]# cat hello-world.c
> #include <stdio.h>
>
> int main(int argc, char **argv)
> {
>     printf("Maybe this world is another planet's hell.\n");
>         return 0;
> }
> [root@localhost test]# gcc -m32 hello-world.c -o hello-world
> [root@localhost test]# ./hello-world
> [   35.306726] hello-world[1948]: segfault at ffa5288c ip
> 00000000f77b5a82 sp 00000000ffa52890 error 6 in ld-2.23.so[f77b5000+23000]
> Segmentation fault (core dumped)
>
> So, dynamic base should differ between 32/64-bits as it was with TASK_SIZE.

I just tried to define it like this:
-#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
+#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
+                               IA32_PAGE_OFFSET : ((1UL << 47) - 
PAGE_SIZE))

And it looks working better.

>
>
>>
>>  /* This yields a mask that user programs can use to figure out what
>>     instruction set this CPU supports.  This could be done in user space,
>> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
>> index a0d662be4c5b..7d7404756bb4 100644
>> --- a/arch/x86/include/asm/mpx.h
>> +++ b/arch/x86/include/asm/mpx.h
>> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>>  }
>>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>>                unsigned long start, unsigned long end);
>> +
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags);
>>  #else
>>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>>  {
>> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct
>> mm_struct *mm,
>>                      unsigned long start, unsigned long end)
>>  {
>>  }
>> +
>> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
>> +        unsigned long len, unsigned long flags)
>> +{
>> +    return addr;
>> +}
>>  #endif /* CONFIG_X86_INTEL_MPX */
>>
>>  #endif /* _ASM_X86_MPX_H */
>> diff --git a/arch/x86/include/asm/processor.h
>> b/arch/x86/include/asm/processor.h
>> index 3cada998a402..9f437aea7f57 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define IA32_PAGE_OFFSET    PAGE_OFFSET
>>  #define TASK_SIZE        PAGE_OFFSET
>>  #define TASK_SIZE_MAX        TASK_SIZE
>> +#define DEFAULT_MAP_WINDOW    TASK_SIZE
>>  #define STACK_TOP        TASK_SIZE
>>  #define STACK_TOP_MAX        STACK_TOP
>>
>> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>>   * particular problem by preventing anything from being mapped
>>   * at the maximum canonical address.
>>   */
>> -#define TASK_SIZE_MAX    ((1UL << 47) - PAGE_SIZE)
>> +#define TASK_SIZE_MAX    ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>> +
>> +#define DEFAULT_MAP_WINDOW    ((1UL << 47) - PAGE_SIZE)
>>
>>  /* This decides where the kernel will search for a free chunk of vm
>>   * space during mmap's.
>> @@ -847,7 +850,7 @@ static inline void spin_lock_prefetch(const void *x)
>>  #define TASK_SIZE_OF(child)    ((test_tsk_thread_flag(child,
>> TIF_ADDR32)) ? \
>>                      IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>>
>> -#define STACK_TOP        TASK_SIZE
>> +#define STACK_TOP        DEFAULT_MAP_WINDOW
>>  #define STACK_TOP_MAX        TASK_SIZE_MAX
>>
>>  #define INIT_THREAD  {                        \
>> @@ -870,7 +873,7 @@ extern void start_thread(struct pt_regs *regs,
>> unsigned long new_ip,
>>   * space during mmap's.
>>   */
>>  #define __TASK_UNMAPPED_BASE(task_size)    (PAGE_ALIGN(task_size / 3))
>> -#define TASK_UNMAPPED_BASE        __TASK_UNMAPPED_BASE(TASK_SIZE)
>> +#define TASK_UNMAPPED_BASE
>> __TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
> ditto
>
>>
>>  #define KSTK_EIP(task)        (task_pt_regs(task)->ip)
>>
>> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
>> index 207b8f2582c7..593a31e93812 100644
>> --- a/arch/x86/kernel/sys_x86_64.c
>> +++ b/arch/x86/kernel/sys_x86_64.c
>> @@ -21,6 +21,7 @@
>>  #include <asm/compat.h>
>>  #include <asm/ia32.h>
>>  #include <asm/syscalls.h>
>> +#include <asm/mpx.h>
>>
>>  /*
>>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
>> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      struct vm_unmapped_area_info info;
>>      unsigned long begin, end;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (flags & MAP_FIXED)
>>          return addr;
>>
>> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp,
>> unsigned long addr,
>>      info.flags = 0;
>>      info.length = len;
>>      info.low_limit = begin;
>> -    info.high_limit = end;
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = min(end, TASK_SIZE);
>> +    else
>> +        info.high_limit = min(end, DEFAULT_MAP_WINDOW);
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      unsigned long addr = addr0;
>>      struct vm_unmapped_area_info info;
>>
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      /* requested length too big for entire address space */
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp,
>> const unsigned long addr0,
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> is set, which can do 64-bit syscalls - the subtraction will be
> a negative..
>
>
>> +
>>      info.align_mask = 0;
>>      info.align_offset = pgoff << PAGE_SHIFT;
>>      if (filp) {
>> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
>> index 302f43fd9c28..9a0b89252c52 100644
>> --- a/arch/x86/mm/hugetlbpage.c
>> +++ b/arch/x86/mm/hugetlbpage.c
>> @@ -18,6 +18,7 @@
>>  #include <asm/tlbflush.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/elf.h>
>> +#include <asm/mpx.h>
>>
>>  #if 0    /* This is just for testing */
>>  struct page *
>> @@ -87,23 +88,38 @@ static unsigned long
>> hugetlb_get_unmapped_area_bottomup(struct file *file,
>>      info.low_limit = get_mmap_base(1);
>>      info.high_limit = in_compat_syscall() ?
>>          tasksize_32bit() : tasksize_64bit();
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW)
>> +        info.high_limit = TASK_SIZE;
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      return vm_unmapped_area(&info);
>>  }
>>
>>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file
>> *file,
>> -        unsigned long addr0, unsigned long len,
>> +        unsigned long addr, unsigned long len,
>>          unsigned long pgoff, unsigned long flags)
>>  {
>>      struct hstate *h = hstate_file(file);
>>      struct vm_unmapped_area_info info;
>> -    unsigned long addr;
>>
>>      info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>      info.length = len;
>>      info.low_limit = PAGE_SIZE;
>>      info.high_limit = get_mmap_base(0);
>> +
>> +    /*
>> +     * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped
>> area
>> +     * in the full address space.
>> +     */
>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>
> ditto
>
>> +
>>      info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>>      info.align_offset = 0;
>>      addr = vm_unmapped_area(&info);
>> @@ -118,7 +134,7 @@ static unsigned long
>> hugetlb_get_unmapped_area_topdown(struct file *file,
>>          VM_BUG_ON(addr != -ENOMEM);
>>          info.flags = 0;
>>          info.low_limit = TASK_UNMAPPED_BASE;
>> -        info.high_limit = TASK_SIZE;
>> +        info.high_limit = DEFAULT_MAP_WINDOW;
>
> ditto about 32-bits
>
>>          addr = vm_unmapped_area(&info);
>>      }
>>
>> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file,
>> unsigned long addr,
>>
>>      if (len & ~huge_page_mask(h))
>>          return -EINVAL;
>> +
>> +    addr = mpx_unmapped_area_check(addr, len, flags);
>> +    if (IS_ERR_VALUE(addr))
>> +        return addr;
>> +
>>      if (len > TASK_SIZE)
>>          return -ENOMEM;
>>
>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>> index 19ad095b41df..d63232a31945 100644
>> --- a/arch/x86/mm/mmap.c
>> +++ b/arch/x86/mm/mmap.c
>> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>>
>>  unsigned long tasksize_64bit(void)
>>  {
>> -    return TASK_SIZE_MAX;
>> +    return DEFAULT_MAP_WINDOW;
>>  }
>>
>>  static unsigned long stack_maxrandom_size(unsigned long task_size)
>> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
>> index cd44ae727df7..a26a1b373fd0 100644
>> --- a/arch/x86/mm/mpx.c
>> +++ b/arch/x86/mm/mpx.c
>> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>>       */
>>      bd_base = mpx_get_bounds_dir();
>>      down_write(&mm->mmap_sem);
>> +
>> +    /* MPX doesn't support addresses above 47-bits yet. */
>> +    if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
>> +        pr_warn_once("%s (%d): MPX cannot handle addresses "
>> +                "above 47-bits. Disabling.",
>> +                current->comm, current->pid);
>> +        ret = -ENXIO;
>> +        goto out;
>> +    }
>>      mm->context.bd_addr = bd_base;
>>      if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>>          ret = -ENXIO;
>> -
>> +out:
>>      up_write(&mm->mmap_sem);
>>      return ret;
>>  }
>> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>>      if (ret)
>>          force_sig(SIGSEGV, current);
>>  }
>> +
>> +/* MPX cannot handle addresses above 47-bits yet. */
>> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned
>> long len,
>> +        unsigned long flags)
>> +{
>> +    if (!kernel_managing_mpx_tables(current->mm))
>> +        return addr;
>> +    if (addr + len <= DEFAULT_MAP_WINDOW)
>> +        return addr;
>> +    if (flags & MAP_FIXED)
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Requested len is larger than whole area we're allowed to map in.
>> +     * Resetting hinting address wouldn't do much good -- fail early.
>> +     */
>> +    if (len > DEFAULT_MAP_WINDOW)
>> +        return -ENOMEM;
>> +
>> +    /* Look for unmap area within DEFAULT_MAP_WINDOW */
>> +    return 0;
>> +}
>>
>
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 19:15       ` Dmitry Safonov
@ 2017-04-06 23:21         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 23:21 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> > Hi Kirill,
> > 
> > On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> > > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > > Not all user space is ready to handle wide addresses. It's known that
> > > at least some JIT compilers use higher bits in pointers to encode their
> > > information. It collides with valid pointers with 5-level paging and
> > > leads to crashes.
> > > 
> > > To mitigate this, we are not going to allocate virtual address space
> > > above 47-bit by default.
> > > 
> > > But userspace can ask for allocation from full address space by
> > > specifying hint address (with or without MAP_FIXED) above 47-bits.
> > > 
> > > If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> > > to look for unmapped area by specified address. If it's already
> > > occupied, we look for unmapped area in *full* address space, rather than
> > > from 47-bit window.
> > 
> > Do you wish after the first over-47-bit mapping the following mmap()
> > calls return also over-47-bits if there is free space?
> > It so, you could simplify all this code by changing only mm->mmap_base
> > on the first over-47-bit mmap() call.
> > This will do simple trick.

No.

I want every allocation to explicitely opt-in large address space. It's
additional fail-safe: if a library can't handle large addresses it has
better chance to survive if its own allocation will stay within 47-bits.

> I just tried to define it like this:
> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
> PAGE_SIZE))
> 
> And it looks working better.

Okay, thanks. I'll send v2.

> > > +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> > > +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> > 
> > Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> > That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> > is set, which can do 64-bit syscalls - the subtraction will be
> > a negative..

With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
okay, right?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 23:21         ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 23:21 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
> > Hi Kirill,
> > 
> > On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
> > > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > > Not all user space is ready to handle wide addresses. It's known that
> > > at least some JIT compilers use higher bits in pointers to encode their
> > > information. It collides with valid pointers with 5-level paging and
> > > leads to crashes.
> > > 
> > > To mitigate this, we are not going to allocate virtual address space
> > > above 47-bit by default.
> > > 
> > > But userspace can ask for allocation from full address space by
> > > specifying hint address (with or without MAP_FIXED) above 47-bits.
> > > 
> > > If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> > > to look for unmapped area by specified address. If it's already
> > > occupied, we look for unmapped area in *full* address space, rather than
> > > from 47-bit window.
> > 
> > Do you wish after the first over-47-bit mapping the following mmap()
> > calls return also over-47-bits if there is free space?
> > It so, you could simplify all this code by changing only mm->mmap_base
> > on the first over-47-bit mmap() call.
> > This will do simple trick.

No.

I want every allocation to explicitely opt-in large address space. It's
additional fail-safe: if a library can't handle large addresses it has
better chance to survive if its own allocation will stay within 47-bits.

> I just tried to define it like this:
> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
> PAGE_SIZE))
> 
> And it looks working better.

Okay, thanks. I'll send v2.

> > > +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> > > +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> > 
> > Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
> > That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
> > is set, which can do 64-bit syscalls - the subtraction will be
> > a negative..

With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
okay, right?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 23:21         ` Kirill A. Shutemov
@ 2017-04-06 23:24           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 23:24 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/include/asm/elf.h       |  2 +-
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 10 +++++++---
 arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
 arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
 arch/x86/mm/mmap.c               |  2 +-
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..a98395e89ac6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
+				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = min(end, TASK_SIZE);
+	else
+		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.low_limit = get_mmap_base(1);
 	info.high_limit = in_compat_syscall() ?
 		tasksize_32bit() : tasksize_64bit();
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = TASK_SIZE;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
 
 unsigned long tasksize_64bit(void)
 {
-	return TASK_SIZE_MAX;
+	return DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-06 23:24           ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-06 23:24 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 arch/x86/include/asm/elf.h       |  2 +-
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 10 +++++++---
 arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
 arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
 arch/x86/mm/mmap.c               |  2 +-
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..67260dbe1688 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..a98395e89ac6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
+				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..593a31e93812 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
@@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = begin;
-	info.high_limit = end;
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = min(end, TASK_SIZE);
+	else
+		info.high_limit = min(end, DEFAULT_MAP_WINDOW);
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
@@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..9a0b89252c52 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.low_limit = get_mmap_base(1);
 	info.high_limit = in_compat_syscall() ?
 		tasksize_32bit() : tasksize_64bit();
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW)
+		info.high_limit = TASK_SIZE;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = DEFAULT_MAP_WINDOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..d63232a31945 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
 
 unsigned long tasksize_64bit(void)
 {
-	return TASK_SIZE_MAX;
+	return DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 23:21         ` Kirill A. Shutemov
  (?)
@ 2017-04-07 10:06           ` Dmitry Safonov
  -1 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 10:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:21 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
>> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
>>> Hi Kirill,
>>>
>>> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>>> Not all user space is ready to handle wide addresses. It's known that
>>>> at least some JIT compilers use higher bits in pointers to encode their
>>>> information. It collides with valid pointers with 5-level paging and
>>>> leads to crashes.
>>>>
>>>> To mitigate this, we are not going to allocate virtual address space
>>>> above 47-bit by default.
>>>>
>>>> But userspace can ask for allocation from full address space by
>>>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>>>
>>>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>>>> to look for unmapped area by specified address. If it's already
>>>> occupied, we look for unmapped area in *full* address space, rather than
>>>> from 47-bit window.
>>>
>>> Do you wish after the first over-47-bit mapping the following mmap()
>>> calls return also over-47-bits if there is free space?
>>> It so, you could simplify all this code by changing only mm->mmap_base
>>> on the first over-47-bit mmap() call.
>>> This will do simple trick.
>
> No.
>
> I want every allocation to explicitely opt-in large address space. It's
> additional fail-safe: if a library can't handle large addresses it has
> better chance to survive if its own allocation will stay within 47-bits.

Ok

>
>> I just tried to define it like this:
>> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
>> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
>> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
>> PAGE_SIZE))
>>
>> And it looks working better.
>
> Okay, thanks. I'll send v2.
>
>>>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>>>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>>>
>>> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
>>> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
>>> is set, which can do 64-bit syscalls - the subtraction will be
>>> a negative..
>
> With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
> okay, right?

I'll comment to v2 to keep all in one place.


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 10:06           ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 10:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:21 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
>> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
>>> Hi Kirill,
>>>
>>> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>>> Not all user space is ready to handle wide addresses. It's known that
>>>> at least some JIT compilers use higher bits in pointers to encode their
>>>> information. It collides with valid pointers with 5-level paging and
>>>> leads to crashes.
>>>>
>>>> To mitigate this, we are not going to allocate virtual address space
>>>> above 47-bit by default.
>>>>
>>>> But userspace can ask for allocation from full address space by
>>>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>>>
>>>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>>>> to look for unmapped area by specified address. If it's already
>>>> occupied, we look for unmapped area in *full* address space, rather than
>>>> from 47-bit window.
>>>
>>> Do you wish after the first over-47-bit mapping the following mmap()
>>> calls return also over-47-bits if there is free space?
>>> It so, you could simplify all this code by changing only mm->mmap_base
>>> on the first over-47-bit mmap() call.
>>> This will do simple trick.
>
> No.
>
> I want every allocation to explicitely opt-in large address space. It's
> additional fail-safe: if a library can't handle large addresses it has
> better chance to survive if its own allocation will stay within 47-bits.

Ok

>
>> I just tried to define it like this:
>> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
>> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
>> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
>> PAGE_SIZE))
>>
>> And it looks working better.
>
> Okay, thanks. I'll send v2.
>
>>>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>>>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>>>
>>> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
>>> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
>>> is set, which can do 64-bit syscalls - the subtraction will be
>>> a negative..
>
> With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
> okay, right?

I'll comment to v2 to keep all in one place.


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 10:06           ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 10:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:21 AM, Kirill A. Shutemov wrote:
> On Thu, Apr 06, 2017 at 10:15:47PM +0300, Dmitry Safonov wrote:
>> On 04/06/2017 09:43 PM, Dmitry Safonov wrote:
>>> Hi Kirill,
>>>
>>> On 04/06/2017 05:01 PM, Kirill A. Shutemov wrote:
>>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>>> Not all user space is ready to handle wide addresses. It's known that
>>>> at least some JIT compilers use higher bits in pointers to encode their
>>>> information. It collides with valid pointers with 5-level paging and
>>>> leads to crashes.
>>>>
>>>> To mitigate this, we are not going to allocate virtual address space
>>>> above 47-bit by default.
>>>>
>>>> But userspace can ask for allocation from full address space by
>>>> specifying hint address (with or without MAP_FIXED) above 47-bits.
>>>>
>>>> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
>>>> to look for unmapped area by specified address. If it's already
>>>> occupied, we look for unmapped area in *full* address space, rather than
>>>> from 47-bit window.
>>>
>>> Do you wish after the first over-47-bit mapping the following mmap()
>>> calls return also over-47-bits if there is free space?
>>> It so, you could simplify all this code by changing only mm->mmap_base
>>> on the first over-47-bit mmap() call.
>>> This will do simple trick.
>
> No.
>
> I want every allocation to explicitely opt-in large address space. It's
> additional fail-safe: if a library can't handle large addresses it has
> better chance to survive if its own allocation will stay within 47-bits.

Ok

>
>> I just tried to define it like this:
>> -#define DEFAULT_MAP_WINDOW     ((1UL << 47) - PAGE_SIZE)
>> +#define DEFAULT_MAP_WINDOW     (test_thread_flag(TIF_ADDR32) ?         \
>> +                               IA32_PAGE_OFFSET : ((1UL << 47) -
>> PAGE_SIZE))
>>
>> And it looks working better.
>
> Okay, thanks. I'll send v2.
>
>>>> +    if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
>>>> +        info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
>>>
>>> Hmm, TASK_SIZE depends now on TIF_ADDR32, which is set during exec().
>>> That means for ia32/x32 ELF which has TASK_SIZE < 4Gb as TIF_ADDR32
>>> is set, which can do 64-bit syscalls - the subtraction will be
>>> a negative..
>
> With your proposed change to DEFAULT_MAP_WINDOW difinition it should be
> okay, right?

I'll comment to v2 to keep all in one place.


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 23:24           ` Kirill A. Shutemov
  (?)
@ 2017-04-07 11:32             ` Dmitry Safonov
  -1 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:24 AM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 10 +++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 101 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..a98395e89ac6 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
> +				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))

That fixes 32-bit, but we need to adjust some places, AFAICS, I'll
point them below.

>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);

That looks not working.
`end' is choosed between tasksize_32bit() and tasksize_64bit().
Which is ~4Gb or 47-bit. So, info.high_limit will never go
above DEFAULT_MAP_WINDOW with this min().

Can we move this logic into find_start_end()?

May it be something like:
if (in_compat_syscall())
   *end = tasksize_32bit();
else if (addr > task_size_64bit())
   *end = TASK_SIZE_MAX;
else
   *end = tasksize_64bit();

In my point of view, it could be even simpler if we add a parameter
to task_size_64bit():

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))

unsigned long task_size_64bit(int full_addr_space)
{
    return (full_addr_space) ? TASK_SIZE_MAX : TASK_SIZE_47BIT;
}

> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, looks like we do need in_compat_syscall() as you did
because x32 mmap() syscall has 8 byte parameter.
Maybe worth a comment.

Anyway, maybe something like that:
if (addr > tasksize_64bit() && !in_compat_syscall())
    info.high_limit += TASK_SIZE_MAX - tasksize_64bit();

This way it's more readable and clear because we don't
need to keep in mind that TIF_ADDR32 flag, while reading.


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;

My suggestion about new parameter is above, but at least
we need to omit depending on TIF_ADDR32 here and return
64-bit size independent of flag value:

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))
unsigned long task_size_64bit(void)
{
    return TASK_SIZE_47BIT;
}

Because for 32-bit ELFs it would be always 4Gb in your
case, while 32-bit ELFs can do 64-bit syscalls.

>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 11:32             ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:24 AM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 10 +++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 101 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..a98395e89ac6 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
> +				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))

That fixes 32-bit, but we need to adjust some places, AFAICS, I'll
point them below.

>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);

That looks not working.
`end' is choosed between tasksize_32bit() and tasksize_64bit().
Which is ~4Gb or 47-bit. So, info.high_limit will never go
above DEFAULT_MAP_WINDOW with this min().

Can we move this logic into find_start_end()?

May it be something like:
if (in_compat_syscall())
   *end = tasksize_32bit();
else if (addr > task_size_64bit())
   *end = TASK_SIZE_MAX;
else
   *end = tasksize_64bit();

In my point of view, it could be even simpler if we add a parameter
to task_size_64bit():

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))

unsigned long task_size_64bit(int full_addr_space)
{
    return (full_addr_space) ? TASK_SIZE_MAX : TASK_SIZE_47BIT;
}

> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, looks like we do need in_compat_syscall() as you did
because x32 mmap() syscall has 8 byte parameter.
Maybe worth a comment.

Anyway, maybe something like that:
if (addr > tasksize_64bit() && !in_compat_syscall())
    info.high_limit += TASK_SIZE_MAX - tasksize_64bit();

This way it's more readable and clear because we don't
need to keep in mind that TIF_ADDR32 flag, while reading.


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;

My suggestion about new parameter is above, but at least
we need to omit depending on TIF_ADDR32 here and return
64-bit size independent of flag value:

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))
unsigned long task_size_64bit(void)
{
    return TASK_SIZE_47BIT;
}

Because for 32-bit ELFs it would be always 4Gb in your
case, while 32-bit ELFs can do 64-bit syscalls.

>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCHv2 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 11:32             ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel

On 04/07/2017 02:24 AM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
> ---
>  arch/x86/include/asm/elf.h       |  2 +-
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 10 +++++++---
>  arch/x86/kernel/sys_x86_64.c     | 28 +++++++++++++++++++++++++++-
>  arch/x86/mm/hugetlbpage.c        | 27 ++++++++++++++++++++++++---
>  arch/x86/mm/mmap.c               |  2 +-
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 101 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..67260dbe1688 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(DEFAULT_MAP_WINDOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..a98395e89ac6 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,10 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	(test_thread_flag(TIF_ADDR32) ? \
> +				IA32_PAGE_OFFSET : ((1UL << 47) - PAGE_SIZE))

That fixes 32-bit, but we need to adjust some places, AFAICS, I'll
point them below.

>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -847,7 +851,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		DEFAULT_MAP_WINDOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +874,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(DEFAULT_MAP_WINDOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..593a31e93812 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -132,6 +133,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> @@ -151,7 +156,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = begin;
> -	info.high_limit = end;
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = min(end, TASK_SIZE);
> +	else
> +		info.high_limit = min(end, DEFAULT_MAP_WINDOW);

That looks not working.
`end' is choosed between tasksize_32bit() and tasksize_64bit().
Which is ~4Gb or 47-bit. So, info.high_limit will never go
above DEFAULT_MAP_WINDOW with this min().

Can we move this logic into find_start_end()?

May it be something like:
if (in_compat_syscall())
   *end = tasksize_32bit();
else if (addr > task_size_64bit())
   *end = TASK_SIZE_MAX;
else
   *end = tasksize_64bit();

In my point of view, it could be even simpler if we add a parameter
to task_size_64bit():

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))

unsigned long task_size_64bit(int full_addr_space)
{
    return (full_addr_space) ? TASK_SIZE_MAX : TASK_SIZE_47BIT;
}

> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> @@ -171,6 +185,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +213,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;

Hmm, looks like we do need in_compat_syscall() as you did
because x32 mmap() syscall has 8 byte parameter.
Maybe worth a comment.

Anyway, maybe something like that:
if (addr > tasksize_64bit() && !in_compat_syscall())
    info.high_limit += TASK_SIZE_MAX - tasksize_64bit();

This way it's more readable and clear because we don't
need to keep in mind that TIF_ADDR32 flag, while reading.


> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..9a0b89252c52 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -87,23 +88,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.low_limit = get_mmap_base(1);
>  	info.high_limit = in_compat_syscall() ?
>  		tasksize_32bit() : tasksize_64bit();
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW)
> +		info.high_limit = TASK_SIZE;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +134,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = DEFAULT_MAP_WINDOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +151,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..d63232a31945 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -44,7 +44,7 @@ unsigned long tasksize_32bit(void)
>
>  unsigned long tasksize_64bit(void)
>  {
> -	return TASK_SIZE_MAX;
> +	return DEFAULT_MAP_WINDOW;

My suggestion about new parameter is above, but at least
we need to omit depending on TIF_ADDR32 here and return
64-bit size independent of flag value:

#define TASK_SIZE_47BIT ((1UL << 47) - PAGE_SIZE))
unsigned long task_size_64bit(void)
{
    return TASK_SIZE_47BIT;
}

Because for 32-bit ELFs it would be always 4Gb in your
case, while 32-bit ELFs can do 64-bit syscalls.

>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-06 14:01   ` Kirill A. Shutemov
@ 2017-04-07 13:35     ` Anshuman Khandual
  -1 siblings, 0 replies; 101+ messages in thread
From: Anshuman Khandual @ 2017-04-07 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Dmitry Safonov

On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
> 
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.

I am wondering if the commitment of virtual space range to the
user space is kind of an API which needs to be maintained there
after. If that is the case then we need to have some plans when
increasing it from the current level.

Will those JIT compilers keep using the higher bit positions of
the pointer for ever ? Then it will limit the ability of the
kernel to expand the virtual address range later as well. I am
not saying we should not increase till the extent it does not
affect any *known* user but then we should not increase twice
for now, create the hint mechanism to be passed from the user
to avail beyond that (which will settle in as a expectation
from the kernel later on). Do the same thing again while
expanding the address range next time around. I think we need
to have a plan for this and particularly around 'hint' mechanism
and whether it should be decided per mmap() request or at the
task level.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 13:35     ` Anshuman Khandual
  0 siblings, 0 replies; 101+ messages in thread
From: Anshuman Khandual @ 2017-04-07 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Dmitry Safonov

On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
> 
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.

I am wondering if the commitment of virtual space range to the
user space is kind of an API which needs to be maintained there
after. If that is the case then we need to have some plans when
increasing it from the current level.

Will those JIT compilers keep using the higher bit positions of
the pointer for ever ? Then it will limit the ability of the
kernel to expand the virtual address range later as well. I am
not saying we should not increase till the extent it does not
affect any *known* user but then we should not increase twice
for now, create the hint mechanism to be passed from the user
to avail beyond that (which will settle in as a expectation
from the kernel later on). Do the same thing again while
expanding the address range next time around. I think we need
to have a plan for this and particularly around 'hint' mechanism
and whether it should be decided per mmap() request or at the
task level.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 11:32             ` Dmitry Safonov
@ 2017-04-07 15:44               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 15:44 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 v3:
   - Address Dmitry feedback;
   - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
     instead, which would task TIF_ADDR32 into account.
---
 arch/x86/include/asm/elf.h       |  4 ++--
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 11 ++++++++---
 arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
 arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
 arch/x86/mm/mmap.c               |  6 +++---
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..2501ef7970f9 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
 }
 
 extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
 					0xc0000000 : 0xFFFFe000)
 
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
 #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 	return error;
 }
 
-static void find_start_end(unsigned long flags, unsigned long *begin,
-			   unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+		unsigned long *begin, unsigned long *end)
 {
 	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
 		/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 	}
 
 	*begin	= get_mmap_base(1);
-	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+	if (in_compat_syscall())
+		*end = tasksize_32bit();
+	else
+		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
 }
 
 unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
-	find_start_end(flags, &begin, &end);
+	find_start_end(addr, flags, &begin, &end);
 
 	if (len > end)
 		return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 *
+	 * !in_compat_syscall() check to avoid high addresses for x32.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = get_mmap_base(1);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
 	info.high_limit = in_compat_syscall() ?
-		tasksize_32bit() : tasksize_64bit();
+		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = TASK_SIZE_LOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
 	return IA32_PAGE_OFFSET;
 }
 
-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
 {
-	return TASK_SIZE_MAX;
+	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
 	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
-			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
 
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
 	/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 15:44               ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 15:44 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Dmitry Safonov

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
---
 v3:
   - Address Dmitry feedback;
   - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
     instead, which would task TIF_ADDR32 into account.
---
 arch/x86/include/asm/elf.h       |  4 ++--
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 11 ++++++++---
 arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
 arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
 arch/x86/mm/mmap.c               |  6 +++---
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index d4d3ed456cb7..2501ef7970f9 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
 }
 
 extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
 					0xc0000000 : 0xFFFFe000)
 
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
 #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 	return error;
 }
 
-static void find_start_end(unsigned long flags, unsigned long *begin,
-			   unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+		unsigned long *begin, unsigned long *end)
 {
 	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
 		/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 	}
 
 	*begin	= get_mmap_base(1);
-	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+	if (in_compat_syscall())
+		*end = tasksize_32bit();
+	else
+		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
 }
 
 unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
-	find_start_end(flags, &begin, &end);
+	find_start_end(addr, flags, &begin, &end);
 
 	if (len > end)
 		return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 *
+	 * !in_compat_syscall() check to avoid high addresses for x32.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = get_mmap_base(1);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
 	info.high_limit = in_compat_syscall() ?
-		tasksize_32bit() : tasksize_64bit();
+		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = TASK_SIZE_LOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
 	return IA32_PAGE_OFFSET;
 }
 
-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
 {
-	return TASK_SIZE_MAX;
+	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
 	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
-			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
 
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
 	/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index cd44ae727df7..a26a1b373fd0 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 13:35     ` Anshuman Khandual
@ 2017-04-07 15:59       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 15:59 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use higher bits in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> > 
> > To mitigate this, we are not going to allocate virtual address space
> > above 47-bit by default.
> 
> I am wondering if the commitment of virtual space range to the
> user space is kind of an API which needs to be maintained there
> after. If that is the case then we need to have some plans when
> increasing it from the current level.

I don't think we should ever enable full address space for all
applications. There's no point.

/bin/true doesn't need more than 64TB of virtual memory.
And I hope never will.

By increasing virtual address space for everybody we will pay (assuming
current page table format) at least one extra page per process for moving
stack at very end of address space.

Yes, you can gain something in security by having more bits for ASLR, but
I don't think it worth the cost.

> Will those JIT compilers keep using the higher bit positions of
> the pointer for ever ? Then it will limit the ability of the
> kernel to expand the virtual address range later as well. I am
> not saying we should not increase till the extent it does not
> affect any *known* user but then we should not increase twice
> for now, create the hint mechanism to be passed from the user
> to avail beyond that (which will settle in as a expectation
> from the kernel later on). Do the same thing again while
> expanding the address range next time around. I think we need
> to have a plan for this and particularly around 'hint' mechanism
> and whether it should be decided per mmap() request or at the
> task level.

I think the reasonable way for an application to claim it's 63-bit clean
is to make allocations with (void *)-1 as hint address.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 15:59       ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 15:59 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use higher bits in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> > 
> > To mitigate this, we are not going to allocate virtual address space
> > above 47-bit by default.
> 
> I am wondering if the commitment of virtual space range to the
> user space is kind of an API which needs to be maintained there
> after. If that is the case then we need to have some plans when
> increasing it from the current level.

I don't think we should ever enable full address space for all
applications. There's no point.

/bin/true doesn't need more than 64TB of virtual memory.
And I hope never will.

By increasing virtual address space for everybody we will pay (assuming
current page table format) at least one extra page per process for moving
stack at very end of address space.

Yes, you can gain something in security by having more bits for ASLR, but
I don't think it worth the cost.

> Will those JIT compilers keep using the higher bit positions of
> the pointer for ever ? Then it will limit the ability of the
> kernel to expand the virtual address range later as well. I am
> not saying we should not increase till the extent it does not
> affect any *known* user but then we should not increase twice
> for now, create the hint mechanism to be passed from the user
> to avail beyond that (which will settle in as a expectation
> from the kernel later on). Do the same thing again while
> expanding the address range next time around. I think we need
> to have a plan for this and particularly around 'hint' mechanism
> and whether it should be decided per mmap() request or at the
> task level.

I think the reasonable way for an application to claim it's 63-bit clean
is to make allocations with (void *)-1 as hint address.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 15:59       ` Kirill A. Shutemov
@ 2017-04-07 16:09         ` hpa
  -1 siblings, 0 replies; 101+ messages in thread
From: hpa @ 2017-04-07 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Andi Kleen, Dave Hansen,
	Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On April 7, 2017 8:59:45 AM PDT, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address
>space.
>> > Not all user space is ready to handle wide addresses. It's known
>that
>> > at least some JIT compilers use higher bits in pointers to encode
>their
>> > information. It collides with valid pointers with 5-level paging
>and
>> > leads to crashes.
>> > 
>> > To mitigate this, we are not going to allocate virtual address
>space
>> > above 47-bit by default.
>> 
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
>I don't think we should ever enable full address space for all
>applications. There's no point.
>
>/bin/true doesn't need more than 64TB of virtual memory.
>And I hope never will.
>
>By increasing virtual address space for everybody we will pay (assuming
>current page table format) at least one extra page per process for
>moving
>stack at very end of address space.
>
>Yes, you can gain something in security by having more bits for ASLR,
>but
>I don't think it worth the cost.
>
>> Will those JIT compilers keep using the higher bit positions of
>> the pointer for ever ? Then it will limit the ability of the
>> kernel to expand the virtual address range later as well. I am
>> not saying we should not increase till the extent it does not
>> affect any *known* user but then we should not increase twice
>> for now, create the hint mechanism to be passed from the user
>> to avail beyond that (which will settle in as a expectation
>> from the kernel later on). Do the same thing again while
>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
>I think the reasonable way for an application to claim it's 63-bit
>clean
>is to make allocations with (void *)-1 as hint address.

You realize that people have said that about just about every memory threshold from 64K onward?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 16:09         ` hpa
  0 siblings, 0 replies; 101+ messages in thread
From: hpa @ 2017-04-07 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov, Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Andi Kleen, Dave Hansen,
	Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On April 7, 2017 8:59:45 AM PDT, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address
>space.
>> > Not all user space is ready to handle wide addresses. It's known
>that
>> > at least some JIT compilers use higher bits in pointers to encode
>their
>> > information. It collides with valid pointers with 5-level paging
>and
>> > leads to crashes.
>> > 
>> > To mitigate this, we are not going to allocate virtual address
>space
>> > above 47-bit by default.
>> 
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
>I don't think we should ever enable full address space for all
>applications. There's no point.
>
>/bin/true doesn't need more than 64TB of virtual memory.
>And I hope never will.
>
>By increasing virtual address space for everybody we will pay (assuming
>current page table format) at least one extra page per process for
>moving
>stack at very end of address space.
>
>Yes, you can gain something in security by having more bits for ASLR,
>but
>I don't think it worth the cost.
>
>> Will those JIT compilers keep using the higher bit positions of
>> the pointer for ever ? Then it will limit the ability of the
>> kernel to expand the virtual address range later as well. I am
>> not saying we should not increase till the extent it does not
>> affect any *known* user but then we should not increase twice
>> for now, create the hint mechanism to be passed from the user
>> to avail beyond that (which will settle in as a expectation
>> from the kernel later on). Do the same thing again while
>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
>I think the reasonable way for an application to claim it's 63-bit
>clean
>is to make allocations with (void *)-1 as hint address.

You realize that people have said that about just about every memory threshold from 64K onward?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 16:09         ` hpa
@ 2017-04-07 16:20           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 16:20 UTC (permalink / raw)
  To: hpa
  Cc: Anshuman Khandual, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On Fri, Apr 07, 2017 at 09:09:27AM -0700, hpa@zytor.com wrote:
> >I think the reasonable way for an application to claim it's 63-bit
> >clean
> >is to make allocations with (void *)-1 as hint address.
> 
> You realize that people have said that about just about every memory

Any better solution?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 16:20           ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-07 16:20 UTC (permalink / raw)
  To: hpa
  Cc: Anshuman Khandual, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov

On Fri, Apr 07, 2017 at 09:09:27AM -0700, hpa@zytor.com wrote:
> >I think the reasonable way for an application to claim it's 63-bit
> >clean
> >is to make allocations with (void *)-1 as hint address.
> 
> You realize that people have said that about just about every memory

Any better solution?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 15:44               ` Kirill A. Shutemov
  (?)
@ 2017-04-07 16:37                 ` Dmitry Safonov
  -1 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 16:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 04/07/2017 06:44 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>

LGTM,
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>

Thou, I'm not very excited about TASK_SIZE_LOW naming, but I'm not good
at naming either, so maybe tglx will help.
Anyway, I don't see any problems with code's logic now.
I've run it through CRIU ia32 tests, where there is
32/64-bit mmap(), 64-bit mmap() from 32-bit binary, the same with
MAP_32BIT and some other not very pleasant corner-cases.
That doesn't prove that mmap() works in *all* possible cases, thou.

P.S.:
JFYI: there is a rule to send new patch versions in a new thread -
otherwise the patch can lose maintainers attention. So, they may ask
you to resend it.

> ---
>  v3:
>    - Address Dmitry feedback;
>    - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
>      instead, which would task TIF_ADDR32 into account.
> ---
>  arch/x86/include/asm/elf.h       |  4 ++--
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 11 ++++++++---
>  arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
>  arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
>  arch/x86/mm/mmap.c               |  6 +++---
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 103 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..2501ef7970f9 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> @@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
>  }
>
>  extern unsigned long tasksize_32bit(void);
> -extern unsigned long tasksize_64bit(void);
> +extern unsigned long tasksize_64bit(int full_addr_space);
>  extern unsigned long get_mmap_base(int is_legacy);
>
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..aaed58b03ddb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
>  					0xc0000000 : 0xFFFFe000)
>
> +#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
> +					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
>  #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		TASK_SIZE_LOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..74d1587b181d 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
>  	return error;
>  }
>
> -static void find_start_end(unsigned long flags, unsigned long *begin,
> -			   unsigned long *end)
> +static void find_start_end(unsigned long addr, unsigned long flags,
> +		unsigned long *begin, unsigned long *end)
>  {
>  	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
>  		/* This is usually used needed to map code in small
> @@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
>  	}
>
>  	*begin	= get_mmap_base(1);
> -	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
> +	if (in_compat_syscall())
> +		*end = tasksize_32bit();
> +	else
> +		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
>  }
>
>  unsigned long
> @@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> -	find_start_end(flags, &begin, &end);
> +	find_start_end(addr, flags, &begin, &end);
>
>  	if (len > end)
>  		return -ENOMEM;
> @@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 *
> +	 * !in_compat_syscall() check to avoid high addresses for x32.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..730f00250acb 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = get_mmap_base(1);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
>  	info.high_limit = in_compat_syscall() ?
> -		tasksize_32bit() : tasksize_64bit();
> +		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = TASK_SIZE_LOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..199050249d60 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
>  	return IA32_PAGE_OFFSET;
>  }
>
> -unsigned long tasksize_64bit(void)
> +unsigned long tasksize_64bit(int full_addr_space)
>  {
> -	return TASK_SIZE_MAX;
> +	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> @@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>  		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
>
>  	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
> -			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
> +			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
>
>  #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
>  	/*
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 16:37                 ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 16:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 04/07/2017 06:44 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>

LGTM,
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>

Thou, I'm not very excited about TASK_SIZE_LOW naming, but I'm not good
at naming either, so maybe tglx will help.
Anyway, I don't see any problems with code's logic now.
I've run it through CRIU ia32 tests, where there is
32/64-bit mmap(), 64-bit mmap() from 32-bit binary, the same with
MAP_32BIT and some other not very pleasant corner-cases.
That doesn't prove that mmap() works in *all* possible cases, thou.

P.S.:
JFYI: there is a rule to send new patch versions in a new thread -
otherwise the patch can lose maintainers attention. So, they may ask
you to resend it.

> ---
>  v3:
>    - Address Dmitry feedback;
>    - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
>      instead, which would task TIF_ADDR32 into account.
> ---
>  arch/x86/include/asm/elf.h       |  4 ++--
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 11 ++++++++---
>  arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
>  arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
>  arch/x86/mm/mmap.c               |  6 +++---
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 103 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..2501ef7970f9 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> @@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
>  }
>
>  extern unsigned long tasksize_32bit(void);
> -extern unsigned long tasksize_64bit(void);
> +extern unsigned long tasksize_64bit(int full_addr_space);
>  extern unsigned long get_mmap_base(int is_legacy);
>
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..aaed58b03ddb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
>  					0xc0000000 : 0xFFFFe000)
>
> +#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
> +					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
>  #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		TASK_SIZE_LOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..74d1587b181d 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
>  	return error;
>  }
>
> -static void find_start_end(unsigned long flags, unsigned long *begin,
> -			   unsigned long *end)
> +static void find_start_end(unsigned long addr, unsigned long flags,
> +		unsigned long *begin, unsigned long *end)
>  {
>  	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
>  		/* This is usually used needed to map code in small
> @@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
>  	}
>
>  	*begin	= get_mmap_base(1);
> -	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
> +	if (in_compat_syscall())
> +		*end = tasksize_32bit();
> +	else
> +		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
>  }
>
>  unsigned long
> @@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> -	find_start_end(flags, &begin, &end);
> +	find_start_end(addr, flags, &begin, &end);
>
>  	if (len > end)
>  		return -ENOMEM;
> @@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 *
> +	 * !in_compat_syscall() check to avoid high addresses for x32.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..730f00250acb 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = get_mmap_base(1);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
>  	info.high_limit = in_compat_syscall() ?
> -		tasksize_32bit() : tasksize_64bit();
> +		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = TASK_SIZE_LOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..199050249d60 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
>  	return IA32_PAGE_OFFSET;
>  }
>
> -unsigned long tasksize_64bit(void)
> +unsigned long tasksize_64bit(int full_addr_space)
>  {
> -	return TASK_SIZE_MAX;
> +	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> @@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>  		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
>
>  	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
> -			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
> +			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
>
>  #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
>  	/*
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCHv3 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-07 16:37                 ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2017-04-07 16:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 04/07/2017 06:44 PM, Kirill A. Shutemov wrote:
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use higher bits in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> To mitigate this, we are not going to allocate virtual address space
> above 47-bit by default.
>
> But userspace can ask for allocation from full address space by
> specifying hint address (with or without MAP_FIXED) above 47-bits.
>
> If hint address set above 47-bit, but MAP_FIXED is not specified, we try
> to look for unmapped area by specified address. If it's already
> occupied, we look for unmapped area in *full* address space, rather than
> from 47-bit window.
>
> This approach helps to easily make application's memory allocator aware
> about large address space without manually tracking allocated virtual
> address space.
>
> One important case we need to handle here is interaction with MPX.
> MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
> need to make sure that MPX cannot be enabled we already have VMA above
> the boundary and forbid creating such VMAs once MPX is enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dmitry Safonov <dsafonov@virtuozzo.com>

LGTM,
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>

Thou, I'm not very excited about TASK_SIZE_LOW naming, but I'm not good
at naming either, so maybe tglx will help.
Anyway, I don't see any problems with code's logic now.
I've run it through CRIU ia32 tests, where there is
32/64-bit mmap(), 64-bit mmap() from 32-bit binary, the same with
MAP_32BIT and some other not very pleasant corner-cases.
That doesn't prove that mmap() works in *all* possible cases, thou.

P.S.:
JFYI: there is a rule to send new patch versions in a new thread -
otherwise the patch can lose maintainers attention. So, they may ask
you to resend it.

> ---
>  v3:
>    - Address Dmitry feedback;
>    - Make DEFAULT_MAP_WINDOW constant again, introduce TASK_SIZE_LOW
>      instead, which would task TIF_ADDR32 into account.
> ---
>  arch/x86/include/asm/elf.h       |  4 ++--
>  arch/x86/include/asm/mpx.h       |  9 +++++++++
>  arch/x86/include/asm/processor.h | 11 ++++++++---
>  arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
>  arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
>  arch/x86/mm/mmap.c               |  6 +++---
>  arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
>  7 files changed, 103 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index d4d3ed456cb7..2501ef7970f9 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -250,7 +250,7 @@ extern int force_personality32;
>     the loader.  We need to make sure that it is out of the way of the program
>     that it will "exec", and that there is sufficient room for the brk.  */
>
> -#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
> +#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
>
>  /* This yields a mask that user programs can use to figure out what
>     instruction set this CPU supports.  This could be done in user space,
> @@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
>  }
>
>  extern unsigned long tasksize_32bit(void);
> -extern unsigned long tasksize_64bit(void);
> +extern unsigned long tasksize_64bit(int full_addr_space);
>  extern unsigned long get_mmap_base(int is_legacy);
>
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
> index a0d662be4c5b..7d7404756bb4 100644
> --- a/arch/x86/include/asm/mpx.h
> +++ b/arch/x86/include/asm/mpx.h
> @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
>  }
>  void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long start, unsigned long end);
> +
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags);
>  #else
>  static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
>  {
> @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
>  				    unsigned long start, unsigned long end)
>  {
>  }
> +
> +static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
> +		unsigned long len, unsigned long flags)
> +{
> +	return addr;
> +}
>  #endif /* CONFIG_X86_INTEL_MPX */
>
>  #endif /* _ASM_X86_MPX_H */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 3cada998a402..aaed58b03ddb 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	PAGE_OFFSET
>  #define TASK_SIZE		PAGE_OFFSET
>  #define TASK_SIZE_MAX		TASK_SIZE
> +#define DEFAULT_MAP_WINDOW	TASK_SIZE
>  #define STACK_TOP		TASK_SIZE
>  #define STACK_TOP_MAX		STACK_TOP
>
> @@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
>   * particular problem by preventing anything from being mapped
>   * at the maximum canonical address.
>   */
> -#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
> +#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
> +
> +#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
>
>  /* This decides where the kernel will search for a free chunk of vm
>   * space during mmap's.
> @@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
>  #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
>  					0xc0000000 : 0xFFFFe000)
>
> +#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
> +					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
>  #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>  #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
>  					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
>
> -#define STACK_TOP		TASK_SIZE
> +#define STACK_TOP		TASK_SIZE_LOW
>  #define STACK_TOP_MAX		TASK_SIZE_MAX
>
>  #define INIT_THREAD  {						\
> @@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
>   * space during mmap's.
>   */
>  #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
> -#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
> +#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
>
>  #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
>
> diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
> index 207b8f2582c7..74d1587b181d 100644
> --- a/arch/x86/kernel/sys_x86_64.c
> +++ b/arch/x86/kernel/sys_x86_64.c
> @@ -21,6 +21,7 @@
>  #include <asm/compat.h>
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
> +#include <asm/mpx.h>
>
>  /*
>   * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
> @@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
>  	return error;
>  }
>
> -static void find_start_end(unsigned long flags, unsigned long *begin,
> -			   unsigned long *end)
> +static void find_start_end(unsigned long addr, unsigned long flags,
> +		unsigned long *begin, unsigned long *end)
>  {
>  	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
>  		/* This is usually used needed to map code in small
> @@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
>  	}
>
>  	*begin	= get_mmap_base(1);
> -	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
> +	if (in_compat_syscall())
> +		*end = tasksize_32bit();
> +	else
> +		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
>  }
>
>  unsigned long
> @@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	struct vm_unmapped_area_info info;
>  	unsigned long begin, end;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (flags & MAP_FIXED)
>  		return addr;
>
> -	find_start_end(flags, &begin, &end);
> +	find_start_end(addr, flags, &begin, &end);
>
>  	if (len > end)
>  		return -ENOMEM;
> @@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	unsigned long addr = addr0;
>  	struct vm_unmapped_area_info info;
>
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	/* requested length too big for entire address space */
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
> @@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 *
> +	 * !in_compat_syscall() check to avoid high addresses for x32.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	if (filp) {
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 302f43fd9c28..730f00250acb 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/pgalloc.h>
>  #include <asm/elf.h>
> +#include <asm/mpx.h>
>
>  #if 0	/* This is just for testing */
>  struct page *
> @@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = get_mmap_base(1);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
>  	info.high_limit = in_compat_syscall() ?
> -		tasksize_32bit() : tasksize_64bit();
> +		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	return vm_unmapped_area(&info);
>  }
>
>  static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
> -		unsigned long addr0, unsigned long len,
> +		unsigned long addr, unsigned long len,
>  		unsigned long pgoff, unsigned long flags)
>  {
>  	struct hstate *h = hstate_file(file);
>  	struct vm_unmapped_area_info info;
> -	unsigned long addr;
>
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = get_mmap_base(0);
> +
> +	/*
> +	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
> +	 * in the full address space.
> +	 */
> +	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
> +		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
> +
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>  	addr = vm_unmapped_area(&info);
> @@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
>  		VM_BUG_ON(addr != -ENOMEM);
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = TASK_SIZE;
> +		info.high_limit = TASK_SIZE_LOW;
>  		addr = vm_unmapped_area(&info);
>  	}
>
> @@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> +
> +	addr = mpx_unmapped_area_check(addr, len, flags);
> +	if (IS_ERR_VALUE(addr))
> +		return addr;
> +
>  	if (len > TASK_SIZE)
>  		return -ENOMEM;
>
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 19ad095b41df..199050249d60 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
>  	return IA32_PAGE_OFFSET;
>  }
>
> -unsigned long tasksize_64bit(void)
> +unsigned long tasksize_64bit(int full_addr_space)
>  {
> -	return TASK_SIZE_MAX;
> +	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
>  }
>
>  static unsigned long stack_maxrandom_size(unsigned long task_size)
> @@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>  		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
>
>  	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
> -			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
> +			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
>
>  #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
>  	/*
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae727df7..a26a1b373fd0 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -355,10 +355,19 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/* MPX doesn't support addresses above 47-bits yet. */
> +	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
> +		pr_warn_once("%s (%d): MPX cannot handle addresses "
> +				"above 47-bits. Disabling.",
> +				current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}
>  	mm->context.bd_addr = bd_base;
>  	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
>  		ret = -ENXIO;
> -
> +out:
>  	up_write(&mm->mmap_sem);
>  	return ret;
>  }
> @@ -1038,3 +1047,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (ret)
>  		force_sig(SIGSEGV, current);
>  }
> +
> +/* MPX cannot handle addresses above 47-bits yet. */
> +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
> +		unsigned long flags)
> +{
> +	if (!kernel_managing_mpx_tables(current->mm))
> +		return addr;
> +	if (addr + len <= DEFAULT_MAP_WINDOW)
> +		return addr;
> +	if (flags & MAP_FIXED)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Requested len is larger than whole area we're allowed to map in.
> +	 * Resetting hinting address wouldn't do much good -- fail early.
> +	 */
> +	if (len > DEFAULT_MAP_WINDOW)
> +		return -ENOMEM;
> +
> +	/* Look for unmap area within DEFAULT_MAP_WINDOW */
> +	return 0;
> +}
>


-- 
              Dmitry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-06 14:01   ` Kirill A. Shutemov
@ 2017-04-11  7:02     ` Ingo Molnar
  -1 siblings, 0 replies; 101+ messages in thread
From: Ingo Molnar @ 2017-04-11  7:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:

> This patch adds support for 5-level paging during early boot.
> It generalizes boot for 4- and 5-level paging on 64-bit systems with
> compile-time switch between them.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
>  arch/x86/include/asm/pgtable_64.h           |  2 ++
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
>  arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
>  5 files changed, 85 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index d2ae1f821e0c..3ed26769810b 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -122,9 +122,12 @@ ENTRY(startup_32)
>  	addl	%ebp, gdt+2(%ebp)
>  	lgdt	gdt(%ebp)
>  
> -	/* Enable PAE mode */
> +	/* Enable PAE and LA57 mode */
>  	movl	%cr4, %eax
>  	orl	$X86_CR4_PAE, %eax
> +#ifdef CONFIG_X86_5LEVEL
> +	orl	$X86_CR4_LA57, %eax
> +#endif
>  	movl	%eax, %cr4
>  
>   /*
> @@ -136,13 +139,24 @@ ENTRY(startup_32)
>  	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
>  	rep	stosl
>  
> +	xorl	%edx, %edx
> +
> +	/* Build Top Level */
> +	leal	pgtable(%ebx,%edx,1), %edi
> +	leal	0x1007 (%edi), %eax
> +	movl	%eax, 0(%edi)
> +
> +#ifdef CONFIG_X86_5LEVEL
>  	/* Build Level 4 */
> -	leal	pgtable + 0(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	leal	0x1007 (%edi), %eax
>  	movl	%eax, 0(%edi)
> +#endif
>  
>  	/* Build Level 3 */
> -	leal	pgtable + 0x1000(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	leal	0x1007(%edi), %eax
>  	movl	$4, %ecx
>  1:	movl	%eax, 0x00(%edi)
> @@ -152,7 +166,8 @@ ENTRY(startup_32)
>  	jnz	1b
>  
>  	/* Build Level 2 */
> -	leal	pgtable + 0x2000(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	movl	$0x00000183, %eax
>  	movl	$2048, %ecx
>  1:	movl	%eax, 0(%edi)

I realize that you had difficulties converting this to C, but it's not going to 
get any easier in the future either, with one more paging mode/level added!

If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
trivial .c, build and link it in and call it separately. Then once that works, 
move functionality from asm to C step by step and test it at every step.

I've applied the first two patches of this series, but we really should convert 
this assembly bit to C too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-11  7:02     ` Ingo Molnar
  0 siblings, 0 replies; 101+ messages in thread
From: Ingo Molnar @ 2017-04-11  7:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:

> This patch adds support for 5-level paging during early boot.
> It generalizes boot for 4- and 5-level paging on 64-bit systems with
> compile-time switch between them.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
>  arch/x86/include/asm/pgtable_64.h           |  2 ++
>  arch/x86/include/uapi/asm/processor-flags.h |  2 ++
>  arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
>  arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
>  5 files changed, 85 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index d2ae1f821e0c..3ed26769810b 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -122,9 +122,12 @@ ENTRY(startup_32)
>  	addl	%ebp, gdt+2(%ebp)
>  	lgdt	gdt(%ebp)
>  
> -	/* Enable PAE mode */
> +	/* Enable PAE and LA57 mode */
>  	movl	%cr4, %eax
>  	orl	$X86_CR4_PAE, %eax
> +#ifdef CONFIG_X86_5LEVEL
> +	orl	$X86_CR4_LA57, %eax
> +#endif
>  	movl	%eax, %cr4
>  
>   /*
> @@ -136,13 +139,24 @@ ENTRY(startup_32)
>  	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
>  	rep	stosl
>  
> +	xorl	%edx, %edx
> +
> +	/* Build Top Level */
> +	leal	pgtable(%ebx,%edx,1), %edi
> +	leal	0x1007 (%edi), %eax
> +	movl	%eax, 0(%edi)
> +
> +#ifdef CONFIG_X86_5LEVEL
>  	/* Build Level 4 */
> -	leal	pgtable + 0(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	leal	0x1007 (%edi), %eax
>  	movl	%eax, 0(%edi)
> +#endif
>  
>  	/* Build Level 3 */
> -	leal	pgtable + 0x1000(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	leal	0x1007(%edi), %eax
>  	movl	$4, %ecx
>  1:	movl	%eax, 0x00(%edi)
> @@ -152,7 +166,8 @@ ENTRY(startup_32)
>  	jnz	1b
>  
>  	/* Build Level 2 */
> -	leal	pgtable + 0x2000(%ebx), %edi
> +	addl	$0x1000, %edx
> +	leal	pgtable(%ebx,%edx), %edi
>  	movl	$0x00000183, %eax
>  	movl	$2048, %ecx
>  1:	movl	%eax, 0(%edi)

I realize that you had difficulties converting this to C, but it's not going to 
get any easier in the future either, with one more paging mode/level added!

If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
trivial .c, build and link it in and call it separately. Then once that works, 
move functionality from asm to C step by step and test it at every step.

I've applied the first two patches of this series, but we really should convert 
this assembly bit to C too.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [tip:x86/mm] x86/boot/64: Rewrite startup_64() in C
  2017-04-06 14:00   ` Kirill A. Shutemov
  (?)
@ 2017-04-11  7:58   ` tip-bot for Kirill A. Shutemov
  2017-04-11  8:54     ` Ingo Molnar
  -1 siblings, 1 reply; 101+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2017-04-11  7:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dave.hansen, tglx, luto, kirill.shutemov, hpa, peterz, torvalds,
	akpm, mingo, linux-kernel

Commit-ID:  10ee822ec8c53f3987d969ce98b2686323a7fc2a
Gitweb:     http://git.kernel.org/tip/10ee822ec8c53f3987d969ce98b2686323a7fc2a
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Thu, 6 Apr 2017 17:00:59 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 11 Apr 2017 08:57:37 +0200

x86/boot/64: Rewrite startup_64() in C

The patch converts most of the startup_64 logic from assembly to C.

This is preparation for 5-level paging enabling.

No change in functionality.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170406140106.78087-2-kirill.shutemov@linux.intel.com
[ Small typo fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/head64.c  | 81 ++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 93 +----------------------------------------------
 2 files changed, 81 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002..079b382 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fix up the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fix up phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327..9656c59 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,9 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	call	__startup_64
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [tip:x86/mm] x86/boot/64: Rename init_level4_pgt() and early_level4_pgt[]
  2017-04-06 14:01   ` Kirill A. Shutemov
  (?)
@ 2017-04-11  7:59   ` tip-bot for Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2017-04-11  7:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: kirill.shutemov, torvalds, akpm, hpa, peterz, linux-kernel, luto,
	mingo, dave.hansen, tglx

Commit-ID:  8c86769544abd97bb9b55ae0eaa91d65b56f2272
Gitweb:     http://git.kernel.org/tip/8c86769544abd97bb9b55ae0eaa91d65b56f2272
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Thu, 6 Apr 2017 17:01:00 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 11 Apr 2017 08:57:37 +0200

x86/boot/64: Rename init_level4_pgt() and early_level4_pgt[]

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt() and
early_top_pgt[].

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170406140106.78087-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482a..77037b6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea312..affcb2a 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1..6b91e2e 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 079b382..9b759f8 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c59..d44c350 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -75,7 +75,7 @@ startup_64:
 	leaq	_text(%rip), %rdi
 	call	__startup_64
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -95,7 +95,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -326,7 +326,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -336,14 +336,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b3..42f502b 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 9f305be..6680cef 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d812..88215ac 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f1..dc0836d 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038..7c2081f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e24671..e1a5fbe 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip:x86/mm] x86/boot/64: Rewrite startup_64() in C
  2017-04-11  7:58   ` [tip:x86/mm] x86/boot/64: Rewrite startup_64() " tip-bot for Kirill A. Shutemov
@ 2017-04-11  8:54     ` Ingo Molnar
  2017-04-11 12:29       ` Kirill A. Shutemov
  0 siblings, 1 reply; 101+ messages in thread
From: Ingo Molnar @ 2017-04-11  8:54 UTC (permalink / raw)
  To: kirill.shutemov, dave.hansen, luto, tglx, linux-kernel, akpm,
	torvalds, hpa, peterz
  Cc: linux-tip-commits

[-- Attachment #1: Type: text/plain, Size: 1652 bytes --]


* tip-bot for Kirill A. Shutemov <tipbot@zytor.com> wrote:

> Commit-ID:  10ee822ec8c53f3987d969ce98b2686323a7fc2a
> Gitweb:     http://git.kernel.org/tip/10ee822ec8c53f3987d969ce98b2686323a7fc2a
> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> AuthorDate: Thu, 6 Apr 2017 17:00:59 +0300
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Tue, 11 Apr 2017 08:57:37 +0200
> 
> x86/boot/64: Rewrite startup_64() in C
> 
> The patch converts most of the startup_64 logic from assembly to C.
> 
> This is preparation for 5-level paging enabling.
> 
> No change in functionality.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-mm@kvack.org
> Link: http://lkml.kernel.org/r/20170406140106.78087-2-kirill.shutemov@linux.intel.com
> [ Small typo fixes. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/kernel/head64.c  | 81 ++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kernel/head_64.S | 93 +----------------------------------------------
>  2 files changed, 81 insertions(+), 93 deletions(-)

Hm, so I had to zap this commit as it broke booting on a 64-bit Intel and an AMD 
system as well, with defconfig-ish kernels.

I've attached the failing config. There's no serial console output at all, the 
boot just hangs after the Grub messages.

Thanks,

	Ingo

[-- Attachment #2: .config --]
[-- Type: text/plain, Size: 112277 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.11.0-rc6 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_FORCE=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_SYSFS_DEPRECATED_V2 is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_INITRAMFS_COMPRESSION=".gz"
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_POSIX_TIMERS=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
# CONFIG_BPF_SYSCALL is not set
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_ADVISE_SYSCALLS=y
# CONFIG_USERFAULTFD is not set
CONFIG_PCI_QUIRKS=y
CONFIG_MEMBARRIER=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SYSTEM_DATA_VERIFICATION is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_KEXEC_CORE=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_NMI=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_GCC_PLUGINS=y
# CONFIG_GCC_PLUGINS is not set
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
CONFIG_CC_STACKPROTECTOR_STRONG=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_COPY_THREAD_TLS=y
CONFIG_HAVE_STACK_VALIDATION=y
# CONFIG_HAVE_ARCH_HASH is not set
# CONFIG_ISA_BUS_API is not set
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
# CONFIG_CPU_NO_EFFICIENT_FFS is not set
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
# CONFIG_ARCH_OPTIONAL_KERNEL_RWX is not set
# CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
# CONFIG_MODULE_COMPRESS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLK_SCSI_REQUEST=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
# CONFIG_BLK_DEV_ZONED is not set
CONFIG_BLK_DEV_THROTTLING=y
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_BLK_WBT is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
CONFIG_X86_FAST_FEATURE_TESTS=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
# CONFIG_INTEL_RDT_A is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
# CONFIG_IOSF_MBI is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_HYPERVISOR_GUEST is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=128
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_THERMAL_VECTOR=y

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
# CONFIG_VM86 is not set
CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_X86_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=9
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_HAVE_GENERIC_GUP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
# CONFIG_MOVABLE_NODE is not set
# CONFIG_HAVE_BOOTMEM_INFO_NODE is not set
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
# CONFIG_HWPOISON_INJECT is not set
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
CONFIG_CLEANCACHE=y
CONFIG_FRONTSWAP=y
# CONFIG_CMA is not set
# CONFIG_ZSWAP is not set
# CONFIG_ZPOOL is not set
# CONFIG_ZBUD is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT=y
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
# CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
# CONFIG_X86_INTEL_MPX is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_EFI=y
CONFIG_EFI_STUB=y
# CONFIG_EFI_MIXED is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
# CONFIG_KEXEC_FILE is not set
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_COMPAT_VDSO is not set
# CONFIG_LEGACY_VSYSCALL_NATIVE is not set
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_MODIFY_LDT_SYSCALL=y
CONFIG_HAVE_LIVEPATCH=y
# CONFIG_LIVEPATCH is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
# CONFIG_ACPI_NFIT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_DPTF_POWER is not set
# CONFIG_ACPI_EXTLOG is not set
# CONFIG_PMIC_OPREGION is not set
# CONFIG_ACPI_CONFIGFS is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
CONFIG_X86_PCC_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=y
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_P4_CLOCKMOD=y

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
CONFIG_PCIE_ECRC=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_BUS_ADDR_T_64BIT=y
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT is not set

#
# PCI host controller drivers
#
# CONFIG_VMD is not set
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
# CONFIG_YENTA is not set
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_RAPIDIO is not set
# CONFIG_X86_SYSFB is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
# CONFIG_HAVE_AOUT is not set
# CONFIG_BINFMT_MISC is not set
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y
CONFIG_NET_INGRESS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_NET_IP_TUNNEL is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_NET_UDP_TUNNEL is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_TCP_CONG_DCTCP is not set
# CONFIG_TCP_CONG_CDG is not set
# CONFIG_TCP_CONG_BBR is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
CONFIG_IPV6_MIP6=y
# CONFIG_IPV6_ILA is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_FOU is not set
# CONFIG_IPV6_FOU_TUNNEL is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
# CONFIG_NF_LOG_NETDEV is not set
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
CONFIG_NF_REJECT_IPV4=y
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
# CONFIG_IP_NF_MANGLE is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_SECURITY is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV6 is not set
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
# CONFIG_NF_REJECT_IPV6 is not set
# CONFIG_NF_LOG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_HAVE_NET_DSA=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_FQ is not set
# CONFIG_NET_SCH_HHF is not set
# CONFIG_NET_SCH_PIE is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_SAMPLE is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
# CONFIG_NET_ACT_VLAN is not set
# CONFIG_NET_ACT_BPF is not set
# CONFIG_NET_ACT_SKBMOD is not set
# CONFIG_NET_ACT_IFE is not set
# CONFIG_NET_ACT_TUNNEL_KEY is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_NET_NCSI is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_JIT=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_NET_DROP_MONITOR=y
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_BREDR=y
# CONFIG_BT_RFCOMM is not set
CONFIG_BT_BNEP=y
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
# CONFIG_BT_HIDP is not set
CONFIG_BT_HS=y
CONFIG_BT_LE=y
# CONFIG_BT_LEDS is not set
# CONFIG_BT_SELFTEST is not set
CONFIG_BT_DEBUGFS=y

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIBTUART is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_BT_MRVL is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_STREAM_PARSER is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_WIMAX is not set
CONFIG_RFKILL=y
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
# CONFIG_DST_CACHE is not set
CONFIG_GRO_CELLS=y
# CONFIG_NET_DEVLINK is not set
CONFIG_MAY_USE_DEVLINK=y
CONFIG_HAVE_EBPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set
CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=y
# CONFIG_DMA_SHARED_BUFFER is not set

#
# Bus devices
#
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_VIRTIO_BLK_SCSI is not set
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TARGET is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_SRAM is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC Bus Driver
#
# CONFIG_INTEL_MIC_BUS is not set

#
# SCIF Bus Driver
#
# CONFIG_SCIF_BUS is not set

#
# VOP Bus Driver
#
# CONFIG_VOP_BUS is not set

#
# Intel MIC Host Driver
#

#
# Intel MIC Card Driver
#

#
# SCIF Driver
#

#
# Intel MIC Coprocessor State Management (COSM) Drivers
#

#
# VOP Driver
#
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_CXL_BASE is not set
# CONFIG_CXL_AFU_DRIVER_OPS is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_MQ_DEFAULT is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
CONFIG_SCSI_SAS_ATTRS=y
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
CONFIG_SCSI_MPT3SAS=y
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT3SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS=y
# CONFIG_SCSI_SMARTPQI is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_VIRTIO is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_DWC is not set
# CONFIG_SATA_MV is not set
CONFIG_SATA_NV=y
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
CONFIG_PATA_ATIIXP=y
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
CONFIG_PATA_VIA=y
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=y
CONFIG_ATA_GENERIC=y
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_MD_MULTIPATH=y
CONFIG_MD_FAULTY=y
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_MQ_DEFAULT is not set
CONFIG_DM_DEBUG=y
CONFIG_DM_BUFIO=y
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
# CONFIG_DM_THIN_PROVISIONING is not set
# CONFIG_DM_CACHE is not set
# CONFIG_DM_ERA is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_RAID is not set
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_MACSEC is not set
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_TUN_VNET_CROSS_LE is not set
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
CONFIG_ETHERNET=y
CONFIG_MDIO=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_3C589 is not set
CONFIG_VORTEX=y
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_PCMCIA_NMCLAN is not set
# CONFIG_AMD_XGBE is not set
# CONFIG_AMD_XGBE_HAVE_ECC is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_NET_VENDOR_AURORA is not set
CONFIG_NET_CADENCE=y
# CONFIG_MACB is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BCMGENET is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
CONFIG_TIGON3=y
# CONFIG_BNX2X is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_CAVIUM=y
# CONFIG_THUNDER_NIC_PF is not set
# CONFIG_THUNDER_NIC_VF is not set
# CONFIG_THUNDER_NIC_BGX is not set
# CONFIG_THUNDER_NIC_RGX is not set
# CONFIG_LIQUIDIO is not set
# CONFIG_LIQUIDIO_VF is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
# CONFIG_CX_ECAT is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EZCHIP=y
CONFIG_NET_VENDOR_EXAR=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
# CONFIG_NET_VENDOR_FUJITSU is not set
# CONFIG_NET_VENDOR_HP is not set
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=y
# CONFIG_E1000 is not set
CONFIG_E1000E=y
CONFIG_E1000E_HWTS=y
CONFIG_IGB=y
CONFIG_IGB_HWMON=y
CONFIG_IGB_DCA=y
# CONFIG_IGBVF is not set
CONFIG_IXGB=y
CONFIG_IXGBE=y
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBE_DCA=y
# CONFIG_IXGBE_DCB is not set
# CONFIG_IXGBEVF is not set
# CONFIG_I40E is not set
# CONFIG_I40EVF is not set
# CONFIG_FM10K is not set
# CONFIG_NET_VENDOR_I825XX is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
CONFIG_SKGE=y
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKGE_GENESIS=y
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_NETRONOME=y
# CONFIG_NFP is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_PCMCIA_PCNET is not set
CONFIG_NET_VENDOR_NVIDIA=y
CONFIG_FORCEDETH=y
CONFIG_NET_VENDOR_OKI=y
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_QED is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
# CONFIG_NET_VENDOR_SEEQ is not set
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_PCMCIA_SMC91C92 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_ALE is not set
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XIRCOM=y
# CONFIG_PCMCIA_XIRC2PS is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
# CONFIG_LED_TRIGGER_PHY is not set

#
# MDIO bus device drivers
#
# CONFIG_MDIO_BCM_UNIMAC is not set
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_OCTEON is not set
# CONFIG_MDIO_THUNDER is not set

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
# CONFIG_AT803X_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_ICPLUS_PHY is not set
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_XILINX_GMII2RGMII is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_LAN78XX is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
CONFIG_WLAN_VENDOR_ADMTEK=y
CONFIG_WLAN_VENDOR_ATH=y
# CONFIG_ATH_DEBUG is not set
# CONFIG_ATH5K_PCI is not set
CONFIG_WLAN_VENDOR_ATMEL=y
CONFIG_WLAN_VENDOR_BROADCOM=y
CONFIG_WLAN_VENDOR_CISCO=y
CONFIG_WLAN_VENDOR_INTEL=y
CONFIG_WLAN_VENDOR_INTERSIL=y
# CONFIG_HOSTAP is not set
# CONFIG_PRISM54 is not set
CONFIG_WLAN_VENDOR_MARVELL=y
CONFIG_WLAN_VENDOR_MEDIATEK=y
CONFIG_WLAN_VENDOR_RALINK=y
CONFIG_WLAN_VENDOR_REALTEK=y
CONFIG_WLAN_VENDOR_RSI=y
CONFIG_WLAN_VENDOR_ST=y
CONFIG_WLAN_VENDOR_TI=y
CONFIG_WLAN_VENDOR_ZYDAS=y
# CONFIG_PCMCIA_RAYCS is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_ISDN_DRV_GIGASET is not set
# CONFIG_HYSDN is not set
# CONFIG_MISDN is not set
# CONFIG_NVM is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_LEDS=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_AS5011 is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_PEGASUS is not set
# CONFIG_TABLET_SERIAL_WACOM4 is not set
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_PROPERTIES=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_EGALAX_SERIAL is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_EKTF2127 is not set
# CONFIG_TOUCHSCREEN_ELAN is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MELFAS_MIP4 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WDT87XX_I2C is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2004 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_SILEAD is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_SX8654 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
# CONFIG_TOUCHSCREEN_ZET6223 is not set
# CONFIG_TOUCHSCREEN_ROHM_BU21023 is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=y
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVMEM=y
# CONFIG_DEVKMEM is not set

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_FSL is not set
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
# CONFIG_SERIAL_8250_MOXA is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_DEV_BUS is not set
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
# CONFIG_HW_RANDOM_VIRTIO is not set
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_SCR24X is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
# CONFIG_XILLYBUS is not set

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
CONFIG_I2C_PIIX4=y
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set

#
# PPS support
#
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_AVS is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
CONFIG_SENSORS_K10TEMP=y
CONFIG_SENSORS_FAM15H_POWER=y
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_FTSTEUTATES is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NCT7904 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_EMULATION is not set
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_X86_PKG_TEMP_THERMAL=y
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# CONFIG_INTEL_PCH_THERMAL is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
# CONFIG_WATCHDOG_SYSFS is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WDAT_WDT is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
CONFIG_SP5100_TCO=y
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_NI903X_WDT is not set
# CONFIG_NIC7018_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set

#
# Watchdog Pretimeout Governors
#
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_CROS_EC is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RTSX_PCI is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RTSX_USB is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_MFD_SMSC is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS65218 is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_INTEL_GTT=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_VGA_SWITCHEROO=y
# CONFIG_DRM is not set

#
# ACP (Audio CoProcessor) Configuration
#
# CONFIG_DRM_LIB_RANDOM is not set

#
# Frame buffer Devices
#
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_PROVIDE_GET_FB_UNMAPPED_AREA is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_PM8941_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_VGASTATE is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
# CONFIG_FRAMEBUFFER_CONSOLE is not set
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
CONFIG_HID_BATTERY_STRENGTH=y
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
# CONFIG_HID_BETOP_FF is not set
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_CMEDIA is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_HID_LOGITECH_HIDPP is not set
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
CONFIG_LOGIG940_FF=y
CONFIG_LOGIWHEELS_FF=y
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MAYFLASH is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# I2C HID support
#
# CONFIG_I2C_HID is not set

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_F8153X is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_UPD78F0730 is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set
# CONFIG_UCSI is not set

#
# USB Physical Layer drivers
#
# CONFIG_USB_PHY is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set
# CONFIG_USB_LED_TRIG is not set
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_LP5562 is not set
# CONFIG_LEDS_LP8501 is not set
# CONFIG_LEDS_LP8860 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_GHES=y
# CONFIG_EDAC_AMD64 is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
# CONFIG_EDAC_SKX is not set
# CONFIG_EDAC_PND2 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV8803 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_HID_SENSOR_TIME is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_INTEL_IDMA64 is not set
CONFIG_INTEL_IOATDMA=y
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
CONFIG_HSU_DMA=y

#
# DMA Clients
#
CONFIG_ASYNC_TX_DMA=y
# CONFIG_DMATEST is not set
CONFIG_DMA_ENGINE_RAID=y

#
# DMABUF options
#
# CONFIG_SYNC_FILE is not set
CONFIG_DCA=y
CONFIG_AUXDISPLAY=y
# CONFIG_IMG_ASCII_LCD is not set
# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=y
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO=y

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_LEGACY=y
# CONFIG_VIRTIO_BALLOON is not set
# CONFIG_VIRTIO_INPUT is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV_TSCPAGE is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_LAPTOP is not set
# CONFIG_DELL_SMO8800 is not set
# CONFIG_DELL_RBTN is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_AMILO_RFKILL is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_HP_WIRELESS is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_IDEAPAD_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_HID_EVENT is not set
# CONFIG_INTEL_VBTN is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_PMC_CORE is not set
# CONFIG_IBM_RTL is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_INTEL_OAKTRAIL is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set
# CONFIG_PVPANIC is not set
# CONFIG_INTEL_PMC_IPC is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_MLX_PLATFORM is not set
# CONFIG_MLX_CPLD_PLATFORM is not set
# CONFIG_INTEL_TURBO_MAX_3 is not set
# CONFIG_SILEAD_DMI is not set
CONFIG_PMC_ATOM=y
# CONFIG_CHROME_PLATFORMS is not set
CONFIG_CLKDEV_LOOKUP=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y

#
# Common Clock Framework
#
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_COMMON_CLK_NXP is not set
# CONFIG_COMMON_CLK_PXA is not set
# CONFIG_COMMON_CLK_PIC32 is not set

#
# Hardware Spinlock drivers
#

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# CONFIG_ATMEL_PIT is not set
# CONFIG_SH_TIMER_CMT is not set
# CONFIG_SH_TIMER_MTU2 is not set
# CONFIG_SH_TIMER_TMU is not set
# CONFIG_EM_TIMER_STI is not set
CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
CONFIG_IOMMU_IOVA=y
CONFIG_AMD_IOMMU=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set

#
# Rpmsg drivers
#

#
# SOC (System On Chip) specific Drivers
#

#
# Broadcom SoC drivers
#
# CONFIG_SUNXI_SRAM is not set
# CONFIG_SOC_TI is not set
# CONFIG_SOC_ZTE is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set
CONFIG_ARM_GIC_MAX_NR=1
# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set
# CONFIG_FMC is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
CONFIG_RAS=y
# CONFIG_MCE_AMD_INJ is not set
# CONFIG_RAS_CEC is not set
# CONFIG_THUNDERBOLT is not set

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_LIBNVDIMM is not set
# CONFIG_DEV_DAX is not set
# CONFIG_NVMEM is not set
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set

#
# FPGA Configuration Support
#
# CONFIG_FPGA is not set

#
# FSI support
#
# CONFIG_FSI is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
CONFIG_DMI_SYSFS=y
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_FW_CFG_SYSFS is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_VARS=y
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
CONFIG_UEFI_CPER=y
# CONFIG_EFI_DEV_PATH_PARSER is not set

#
# Tegra firmware driver
#

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_FS_DAX is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_MANDATORY_FILE_LOCKING=y
# CONFIG_FS_ENCRYPTION is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
# CONFIG_EFIVAR_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_ZLIB_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_PMSG is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
# CONFIG_NFS_V3 is not set
# CONFIG_NFS_V4 is not set
# CONFIG_NFS_SWAP is not set
# CONFIG_ROOT_NFS is not set
CONFIG_NFS_DEBUG=y
# CONFIG_NFSD is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_DEBUG=y
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=y
# CONFIG_9P_FS_POSIX_ACL is not set
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
CONFIG_BOOT_PRINTK_DELAY=y
CONFIG_DYNAMIC_DEBUG=y

#
# Compile-time checks and compiler options
#
# CONFIG_DEBUG_INFO is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_READABLE_ASM is not set
CONFIG_UNUSED_SYMBOLS=y
# CONFIG_PAGE_OWNER is not set
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_STACK_VALIDATION is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
CONFIG_DEBUG_RODATA_TEST=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_HAVE_ARCH_KASAN=y
# CONFIG_KASAN is not set
CONFIG_ARCH_HAS_KCOV=y
# CONFIG_KCOV is not set
CONFIG_DEBUG_SHIRQ=y

#
# Debug Lockups and Hangs
#
# CONFIG_LOCKUP_DETECTOR is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
CONFIG_SCHEDSTATS=y
# CONFIG_SCHED_STACK_END_CHECK is not set
# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PI_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_PROVE_RCU is not set
CONFIG_SPARSE_RCU_POINTER=y
# CONFIG_TORTURE_TEST is not set
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
# CONFIG_HWLAT_TRACER is not set
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
# CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBE_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_TRACE_ENUM_MAP_FILE is not set

#
# Runtime Testing
#
# CONFIG_LKDTM is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_TEST_SORT is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
CONFIG_ATOMIC64_SELFTEST=y
# CONFIG_ASYNC_RAID6_TEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_TEST_STRING_HELPERS is not set
CONFIG_TEST_KSTRTOX=y
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_HASH is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_MEMTEST is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_DEFAULT_ENABLE=0x1
CONFIG_KDB_KEYBOARD=y
CONFIG_KDB_CONTINUE_CATASTROPHIC=0
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_ARCH_WANTS_UBSAN_NO_NULL is not set
# CONFIG_UBSAN is not set
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_IO_STRICT_DEVMEM is not set
CONFIG_EARLY_PRINTK_USB=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_EFI is not set
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_X86_PTDUMP_CORE is not set
# CONFIG_X86_PTDUMP is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_WX is not set
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
CONFIG_DEBUG_ENTRY=y
CONFIG_DEBUG_NMI_SELFTEST=y
CONFIG_X86_DEBUG_FPU=y
# CONFIG_PUNIT_ATOM_DEBUG is not set

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_BIG_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
# CONFIG_SECURITY_PATH is not set
CONFIG_INTEL_TXT=y
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
CONFIG_HAVE_ARCH_HARDENED_USERCOPY=y
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_LOADPIN is not set
# CONFIG_SECURITY_YAMA is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_INTEGRITY_AUDIT=y
# CONFIG_IMA is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_XOR_BLOCKS=y
CONFIG_ASYNC_CORE=y
CONFIG_ASYNC_MEMCPY=y
CONFIG_ASYNC_XOR=y
CONFIG_ASYNC_PQ=y
CONFIG_ASYNC_RAID6_RECOV=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
# CONFIG_CRYPTO_RSA is not set
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=y
# CONFIG_CRYPTO_MCRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_SIMD=y
CONFIG_CRYPTO_GLUE_HELPER_X86=y
CONFIG_CRYPTO_ENGINE=m

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_LRW=y
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# CONFIG_CRYPTO_KEYWRAP is not set

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
CONFIG_CRYPTO_CRCT10DIF=y
# CONFIG_CRYPTO_CRCT10DIF_PCLMUL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SHA1_MB is not set
# CONFIG_CRYPTO_SHA256_MB is not set
# CONFIG_CRYPTO_SHA512_MB is not set
CONFIG_CRYPTO_SHA256=y
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=y

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
CONFIG_CRYPTO_AES_X86_64=y
CONFIG_CRYPTO_AES_NI_INTEL=y
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_DESC is not set
# CONFIG_CRYPTO_DEV_CCP is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
CONFIG_CRYPTO_DEV_VIRTIO=m
# CONFIG_ASYMMETRIC_KEY_TYPE is not set

#
# Certificates for signature checking
#
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
CONFIG_KVM_MMU_AUDIT=y
CONFIG_KVM_DEVICE_ASSIGNMENT=y
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=y
CONFIG_BITREVERSE=y
# CONFIG_HAVE_ARCH_BITREVERSE is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
# CONFIG_AUDIT_ARCH_COMPAT_GENERIC is not set
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_RADIX_TREE_MULTIORDER=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
# CONFIG_DMA_NOOP_OPS is not set
# CONFIG_DMA_VIRT_OPS is not set
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
# CONFIG_IRQ_POLL is not set
CONFIG_UCS2_STRING=y
# CONFIG_SG_SPLIT is not set
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_SG_CHAIN=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_MMIO_FLUSH=y
CONFIG_SBITMAP=y

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11  7:02     ` Ingo Molnar
@ 2017-04-11 10:51       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> I realize that you had difficulties converting this to C, but it's not going to 
> get any easier in the future either, with one more paging mode/level added!
> 
> If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> trivial .c, build and link it in and call it separately. Then once that works, 
> move functionality from asm to C step by step and test it at every step.

I've described the specific issue with converting this code to C in cover
letter: how to make compiler to generate 32-bit code for a specific
function or translation unit, without breaking linking afterwards (-m32
break it).

I would be glad to convert it, but I'm stuck.

Do you have an idea how to get around the issue.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-11 10:51       ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> I realize that you had difficulties converting this to C, but it's not going to 
> get any easier in the future either, with one more paging mode/level added!
> 
> If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> trivial .c, build and link it in and call it separately. Then once that works, 
> move functionality from asm to C step by step and test it at every step.

I've described the specific issue with converting this code to C in cover
letter: how to make compiler to generate 32-bit code for a specific
function or translation unit, without breaking linking afterwards (-m32
break it).

I would be glad to convert it, but I'm stuck.

Do you have an idea how to get around the issue.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 10:51       ` Kirill A. Shutemov
@ 2017-04-11 11:28         ` Ingo Molnar
  -1 siblings, 0 replies; 101+ messages in thread
From: Ingo Molnar @ 2017-04-11 11:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > I realize that you had difficulties converting this to C, but it's not going to 
> > get any easier in the future either, with one more paging mode/level added!
> > 
> > If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> > trivial .c, build and link it in and call it separately. Then once that works, 
> > move functionality from asm to C step by step and test it at every step.
> 
> I've described the specific issue with converting this code to C in cover
> letter: how to make compiler to generate 32-bit code for a specific
> function or translation unit, without breaking linking afterwards (-m32
> break it).

Have you tried putting it into a separate .c file, and building it 32-bit?

I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit 
code even on 64-bit kernels.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-11 11:28         ` Ingo Molnar
  0 siblings, 0 replies; 101+ messages in thread
From: Ingo Molnar @ 2017-04-11 11:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > I realize that you had difficulties converting this to C, but it's not going to 
> > get any easier in the future either, with one more paging mode/level added!
> > 
> > If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> > trivial .c, build and link it in and call it separately. Then once that works, 
> > move functionality from asm to C step by step and test it at every step.
> 
> I've described the specific issue with converting this code to C in cover
> letter: how to make compiler to generate 32-bit code for a specific
> function or translation unit, without breaking linking afterwards (-m32
> break it).

Have you tried putting it into a separate .c file, and building it 32-bit?

I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit 
code even on 64-bit kernels.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 11:28         ` Ingo Molnar
@ 2017-04-11 11:46           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 11:46 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 01:28:45PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > > I realize that you had difficulties converting this to C, but it's not going to 
> > > get any easier in the future either, with one more paging mode/level added!
> > > 
> > > If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> > > trivial .c, build and link it in and call it separately. Then once that works, 
> > > move functionality from asm to C step by step and test it at every step.
> > 
> > I've described the specific issue with converting this code to C in cover
> > letter: how to make compiler to generate 32-bit code for a specific
> > function or translation unit, without breaking linking afterwards (-m32
> > break it).
> 
> Have you tried putting it into a separate .c file, and building it 32-bit?

Yes, I have. The patch below fails linking:

ld: i386 architecture of input file `arch/x86/boot/compressed/head64.o' is incompatible with i386:x86-64 output

> 
> I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit 
> code even on 64-bit kernels.

I'll look closer (building proccess it's rather complicated), but my
understanding is that VDSO is stand-alone binary and doesn't really links
with the rest of the kernel, rather included as blob, no?

Andy, may be you have an idea?

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 44163e8c3868..8c1acacf408e 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -76,6 +76,8 @@ vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
 ifdef CONFIG_X86_64
 	vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/pagetable.o
+	vmlinux-objs-y += $(obj)/head64.o
+$(obj)/head64.o: KBUILD_CFLAGS := -m32 -D__KERNEL__ -O2
 endif
 
 $(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
diff --git a/arch/x86/boot/compressed/head64.c b/arch/x86/boot/compressed/head64.c
new file mode 100644
index 000000000000..42e1d64a15f4
--- /dev/null
+++ b/arch/x86/boot/compressed/head64.c
@@ -0,0 +1,3 @@
+void __startup32(void)
+{
+}
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-11 11:46           ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 11:46 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 01:28:45PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Apr 11, 2017 at 09:02:03AM +0200, Ingo Molnar wrote:
> > > I realize that you had difficulties converting this to C, but it's not going to 
> > > get any easier in the future either, with one more paging mode/level added!
> > > 
> > > If you are stuck on where it breaks I'd suggest doing it gradually: first add a 
> > > trivial .c, build and link it in and call it separately. Then once that works, 
> > > move functionality from asm to C step by step and test it at every step.
> > 
> > I've described the specific issue with converting this code to C in cover
> > letter: how to make compiler to generate 32-bit code for a specific
> > function or translation unit, without breaking linking afterwards (-m32
> > break it).
> 
> Have you tried putting it into a separate .c file, and building it 32-bit?

Yes, I have. The patch below fails linking:

ld: i386 architecture of input file `arch/x86/boot/compressed/head64.o' is incompatible with i386:x86-64 output

> 
> I think arch/x86/entry/vdso/Makefile contains an example of how to build 32-bit 
> code even on 64-bit kernels.

I'll look closer (building proccess it's rather complicated), but my
understanding is that VDSO is stand-alone binary and doesn't really links
with the rest of the kernel, rather included as blob, no?

Andy, may be you have an idea?

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 44163e8c3868..8c1acacf408e 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -76,6 +76,8 @@ vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
 vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
 ifdef CONFIG_X86_64
 	vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/pagetable.o
+	vmlinux-objs-y += $(obj)/head64.o
+$(obj)/head64.o: KBUILD_CFLAGS := -m32 -D__KERNEL__ -O2
 endif
 
 $(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
diff --git a/arch/x86/boot/compressed/head64.c b/arch/x86/boot/compressed/head64.c
new file mode 100644
index 000000000000..42e1d64a15f4
--- /dev/null
+++ b/arch/x86/boot/compressed/head64.c
@@ -0,0 +1,3 @@
+void __startup32(void)
+{
+}
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip:x86/mm] x86/boot/64: Rewrite startup_64() in C
  2017-04-11  8:54     ` Ingo Molnar
@ 2017-04-11 12:29       ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-11 12:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dave.hansen, luto, tglx, linux-kernel, akpm, torvalds, hpa,
	peterz, linux-tip-commits

On Tue, Apr 11, 2017 at 10:54:41AM +0200, Ingo Molnar wrote:
> 
> * tip-bot for Kirill A. Shutemov <tipbot@zytor.com> wrote:
> 
> > Commit-ID:  10ee822ec8c53f3987d969ce98b2686323a7fc2a
> > Gitweb:     http://git.kernel.org/tip/10ee822ec8c53f3987d969ce98b2686323a7fc2a
> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > AuthorDate: Thu, 6 Apr 2017 17:00:59 +0300
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Tue, 11 Apr 2017 08:57:37 +0200
> > 
> > x86/boot/64: Rewrite startup_64() in C
> > 
> > The patch converts most of the startup_64 logic from assembly to C.
> > 
> > This is preparation for 5-level paging enabling.
> > 
> > No change in functionality.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Andy Lutomirski <luto@amacapital.net>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Link: http://lkml.kernel.org/r/20170406140106.78087-2-kirill.shutemov@linux.intel.com
> > [ Small typo fixes. ]
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > ---
> >  arch/x86/kernel/head64.c  | 81 ++++++++++++++++++++++++++++++++++++++++-
> >  arch/x86/kernel/head_64.S | 93 +----------------------------------------------
> >  2 files changed, 81 insertions(+), 93 deletions(-)
> 
> Hm, so I had to zap this commit as it broke booting on a 64-bit Intel and an AMD 
> system as well, with defconfig-ish kernels.

The fixup is below. Maybe there's better more idiomatic way, but I'm not
really into assembly.

Basically, we need to preserve %rsi across call __startup_64.

I will post whole thing again once we sort out what to do with the reset
of assembly code I've touched. Or you can apply the fixup, if you feel to.

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9656c5951b98..1432d530fa35 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -73,7 +73,9 @@ startup_64:
 	call verify_cpu
 
 	leaq	_text(%rip), %rdi
+	pushq	%rsi
 	call	__startup_64
+	popq	%rsi
 
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 11:46           ` Kirill A. Shutemov
@ 2017-04-11 14:09             ` Andi Kleen
  -1 siblings, 0 replies; 101+ messages in thread
From: Andi Kleen @ 2017-04-11 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Andy Lutomirski, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

> I'll look closer (building proccess it's rather complicated), but my
> understanding is that VDSO is stand-alone binary and doesn't really links
> with the rest of the kernel, rather included as blob, no?
> 
> Andy, may be you have an idea?

There isn't any way I know of to directly link them together. The ELF 
format wasn't designed for that. You would need to merge blobs and then use
manual jump vectors, like the 16bit startup code does. It would be likely
complicated and ugly.

-Andi

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-11 14:09             ` Andi Kleen
  0 siblings, 0 replies; 101+ messages in thread
From: Andi Kleen @ 2017-04-11 14:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Andy Lutomirski, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

> I'll look closer (building proccess it's rather complicated), but my
> understanding is that VDSO is stand-alone binary and doesn't really links
> with the rest of the kernel, rather included as blob, no?
> 
> Andy, may be you have an idea?

There isn't any way I know of to directly link them together. The ELF 
format wasn't designed for that. You would need to merge blobs and then use
manual jump vectors, like the 16bit startup code does. It would be likely
complicated and ugly.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-11 14:09             ` Andi Kleen
@ 2017-04-12 10:18               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-12 10:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > I'll look closer (building proccess it's rather complicated), but my
> > understanding is that VDSO is stand-alone binary and doesn't really links
> > with the rest of the kernel, rather included as blob, no?
> > 
> > Andy, may be you have an idea?
> 
> There isn't any way I know of to directly link them together. The ELF 
> format wasn't designed for that. You would need to merge blobs and then use
> manual jump vectors, like the 16bit startup code does. It would be likely
> complicated and ugly.

Ingo, can we proceed without coverting this assembly to C?

I'm committed to convert it to C later if we'll find reasonable solution
to the issue.

We're pretty late into release cycle. It would be nice to give the whole
thing time in tip/master and -next before the merge window.

Can I repost part 4?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-12 10:18               ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-12 10:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > I'll look closer (building proccess it's rather complicated), but my
> > understanding is that VDSO is stand-alone binary and doesn't really links
> > with the rest of the kernel, rather included as blob, no?
> > 
> > Andy, may be you have an idea?
> 
> There isn't any way I know of to directly link them together. The ELF 
> format wasn't designed for that. You would need to merge blobs and then use
> manual jump vectors, like the 16bit startup code does. It would be likely
> complicated and ugly.

Ingo, can we proceed without coverting this assembly to C?

I'm committed to convert it to C later if we'll find reasonable solution
to the issue.

We're pretty late into release cycle. It would be nice to give the whole
thing time in tip/master and -next before the merge window.

Can I repost part 4?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-07 15:59       ` Kirill A. Shutemov
@ 2017-04-12 10:41         ` Michael Ellerman
  -1 siblings, 0 replies; 101+ messages in thread
From: Michael Ellerman @ 2017-04-12 10:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov, Aneesh Kumar K.V

Hi Kirill,

I'm interested in this because we're doing pretty much the same thing on
powerpc at the moment, and I want to make sure x86 & powerpc end up with
compatible behaviour.

"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use higher bits in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> > 
>> > To mitigate this, we are not going to allocate virtual address space
>> > above 47-bit by default.
>> 
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
> I don't think we should ever enable full address space for all
> applications. There's no point.
>
> /bin/true doesn't need more than 64TB of virtual memory.
> And I hope never will.
>
> By increasing virtual address space for everybody we will pay (assuming
> current page table format) at least one extra page per process for moving
> stack at very end of address space.

That assumes the current layout though, it could be different.

> Yes, you can gain something in security by having more bits for ASLR, but
> I don't think it worth the cost.

It may not be worth the cost now, for you, but that trade off will be
different for other people and at other times.

So I think it's quite likely some folks will be interested in the full
address range for ASLR.

>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
> I think the reasonable way for an application to claim it's 63-bit clean
> is to make allocations with (void *)-1 as hint address.

I do like the simplicity of that.

But I wouldn't be surprised if some (crappy) code out there already
passes an address of -1. Probably it won't break if it starts getting
high addresses, but who knows.

An alternative would be to only interpret the hint as requesting a large
address if it's >= 64TB && < TASK_SIZE_MAX.

If we're really worried about breaking userspace then a new MMAP flag
seems like the safest option?

I don't feel particularly strongly about any option, but like I said my
main concern is that x86 & powerpc end up with the same behaviour.

And whatever we end up with someone will need to do an update to the man
page for mmap.

cheers

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-12 10:41         ` Michael Ellerman
  0 siblings, 0 replies; 101+ messages in thread
From: Michael Ellerman @ 2017-04-12 10:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Anshuman Khandual
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	Dmitry Safonov, Aneesh Kumar K.V

Hi Kirill,

I'm interested in this because we're doing pretty much the same thing on
powerpc at the moment, and I want to make sure x86 & powerpc end up with
compatible behaviour.

"Kirill A. Shutemov" <kirill@shutemov.name> writes:
> On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
>> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use higher bits in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> > 
>> > To mitigate this, we are not going to allocate virtual address space
>> > above 47-bit by default.
>> 
>> I am wondering if the commitment of virtual space range to the
>> user space is kind of an API which needs to be maintained there
>> after. If that is the case then we need to have some plans when
>> increasing it from the current level.
>
> I don't think we should ever enable full address space for all
> applications. There's no point.
>
> /bin/true doesn't need more than 64TB of virtual memory.
> And I hope never will.
>
> By increasing virtual address space for everybody we will pay (assuming
> current page table format) at least one extra page per process for moving
> stack at very end of address space.

That assumes the current layout though, it could be different.

> Yes, you can gain something in security by having more bits for ASLR, but
> I don't think it worth the cost.

It may not be worth the cost now, for you, but that trade off will be
different for other people and at other times.

So I think it's quite likely some folks will be interested in the full
address range for ASLR.

>> expanding the address range next time around. I think we need
>> to have a plan for this and particularly around 'hint' mechanism
>> and whether it should be decided per mmap() request or at the
>> task level.
>
> I think the reasonable way for an application to claim it's 63-bit clean
> is to make allocations with (void *)-1 as hint address.

I do like the simplicity of that.

But I wouldn't be surprised if some (crappy) code out there already
passes an address of -1. Probably it won't break if it starts getting
high addresses, but who knows.

An alternative would be to only interpret the hint as requesting a large
address if it's >= 64TB && < TASK_SIZE_MAX.

If we're really worried about breaking userspace then a new MMAP flag
seems like the safest option?

I don't feel particularly strongly about any option, but like I said my
main concern is that x86 & powerpc end up with the same behaviour.

And whatever we end up with someone will need to do an update to the man
page for mmap.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-12 10:41         ` Michael Ellerman
@ 2017-04-12 11:11           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-12 11:11 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Anshuman Khandual, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Dmitry Safonov, Aneesh Kumar K.V

On Wed, Apr 12, 2017 at 08:41:29PM +1000, Michael Ellerman wrote:
> Hi Kirill,
> 
> I'm interested in this because we're doing pretty much the same thing on
> powerpc at the moment, and I want to make sure x86 & powerpc end up with
> compatible behaviour.
> 
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> > On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> >> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use higher bits in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> > 
> >> > To mitigate this, we are not going to allocate virtual address space
> >> > above 47-bit by default.
> >> 
> >> I am wondering if the commitment of virtual space range to the
> >> user space is kind of an API which needs to be maintained there
> >> after. If that is the case then we need to have some plans when
> >> increasing it from the current level.
> >
> > I don't think we should ever enable full address space for all
> > applications. There's no point.
> >
> > /bin/true doesn't need more than 64TB of virtual memory.
> > And I hope never will.
> >
> > By increasing virtual address space for everybody we will pay (assuming
> > current page table format) at least one extra page per process for moving
> > stack at very end of address space.
> 
> That assumes the current layout though, it could be different.

True.

> > Yes, you can gain something in security by having more bits for ASLR, but
> > I don't think it worth the cost.
> 
> It may not be worth the cost now, for you, but that trade off will be
> different for other people and at other times.
> 
> So I think it's quite likely some folks will be interested in the full
> address range for ASLR.

We always can extend interface if/when userspace demand materialize.

Let's not invent interfaces unless we're sure there's demand.

> >> expanding the address range next time around. I think we need
> >> to have a plan for this and particularly around 'hint' mechanism
> >> and whether it should be decided per mmap() request or at the
> >> task level.
> >
> > I think the reasonable way for an application to claim it's 63-bit clean
> > is to make allocations with (void *)-1 as hint address.
> 
> I do like the simplicity of that.
> 
> But I wouldn't be surprised if some (crappy) code out there already
> passes an address of -1. Probably it won't break if it starts getting
> high addresses, but who knows.

To make an application break we need two thing:

 - it sets hint address to -1 by mistake;
 - it uses upper bit to encode its info;

I would be surprise if such combination exists in real world.

But let me know if you have any particular code in mind.

> An alternative would be to only interpret the hint as requesting a large
> address if it's >= 64TB && < TASK_SIZE_MAX.

Nope. That doesn't work if you take into accounting further extension of the
address space.

Consider extension x86 to 6-level page tables. User-space has 63-bit
address space. TASK_SIZE_MAX is bumped to (1UL << 63) - PAGE_SIZE.

An application wants access to full address space. It gets recompiled
using new TASK_SIZE_MAX as hint address. And everything works fine.

But only on machine with 6-level paging enabled.

If we run the same application binary on machine with older kernel and
5-level paging, the application will get access to only 47-bit address
space, not 56-bit, as hint address is more than TASK_SIZE_MAX in this
configuration.

> If we're really worried about breaking userspace then a new MMAP flag
> seems like the safest option?
> 
> I don't feel particularly strongly about any option, but like I said my
> main concern is that x86 & powerpc end up with the same behaviour.
> 
> And whatever we end up with someone will need to do an update to the man
> page for mmap.

Sure.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-12 11:11           ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-12 11:11 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Anshuman Khandual, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Dmitry Safonov, Aneesh Kumar K.V

On Wed, Apr 12, 2017 at 08:41:29PM +1000, Michael Ellerman wrote:
> Hi Kirill,
> 
> I'm interested in this because we're doing pretty much the same thing on
> powerpc at the moment, and I want to make sure x86 & powerpc end up with
> compatible behaviour.
> 
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> > On Fri, Apr 07, 2017 at 07:05:26PM +0530, Anshuman Khandual wrote:
> >> On 04/06/2017 07:31 PM, Kirill A. Shutemov wrote:
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use higher bits in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> > 
> >> > To mitigate this, we are not going to allocate virtual address space
> >> > above 47-bit by default.
> >> 
> >> I am wondering if the commitment of virtual space range to the
> >> user space is kind of an API which needs to be maintained there
> >> after. If that is the case then we need to have some plans when
> >> increasing it from the current level.
> >
> > I don't think we should ever enable full address space for all
> > applications. There's no point.
> >
> > /bin/true doesn't need more than 64TB of virtual memory.
> > And I hope never will.
> >
> > By increasing virtual address space for everybody we will pay (assuming
> > current page table format) at least one extra page per process for moving
> > stack at very end of address space.
> 
> That assumes the current layout though, it could be different.

True.

> > Yes, you can gain something in security by having more bits for ASLR, but
> > I don't think it worth the cost.
> 
> It may not be worth the cost now, for you, but that trade off will be
> different for other people and at other times.
> 
> So I think it's quite likely some folks will be interested in the full
> address range for ASLR.

We always can extend interface if/when userspace demand materialize.

Let's not invent interfaces unless we're sure there's demand.

> >> expanding the address range next time around. I think we need
> >> to have a plan for this and particularly around 'hint' mechanism
> >> and whether it should be decided per mmap() request or at the
> >> task level.
> >
> > I think the reasonable way for an application to claim it's 63-bit clean
> > is to make allocations with (void *)-1 as hint address.
> 
> I do like the simplicity of that.
> 
> But I wouldn't be surprised if some (crappy) code out there already
> passes an address of -1. Probably it won't break if it starts getting
> high addresses, but who knows.

To make an application break we need two thing:

 - it sets hint address to -1 by mistake;
 - it uses upper bit to encode its info;

I would be surprise if such combination exists in real world.

But let me know if you have any particular code in mind.

> An alternative would be to only interpret the hint as requesting a large
> address if it's >= 64TB && < TASK_SIZE_MAX.

Nope. That doesn't work if you take into accounting further extension of the
address space.

Consider extension x86 to 6-level page tables. User-space has 63-bit
address space. TASK_SIZE_MAX is bumped to (1UL << 63) - PAGE_SIZE.

An application wants access to full address space. It gets recompiled
using new TASK_SIZE_MAX as hint address. And everything works fine.

But only on machine with 6-level paging enabled.

If we run the same application binary on machine with older kernel and
5-level paging, the application will get access to only 47-bit address
space, not 56-bit, as hint address is more than TASK_SIZE_MAX in this
configuration.

> If we're really worried about breaking userspace then a new MMAP flag
> seems like the safest option?
> 
> I don't feel particularly strongly about any option, but like I said my
> main concern is that x86 & powerpc end up with the same behaviour.
> 
> And whatever we end up with someone will need to do an update to the man
> page for mmap.

Sure.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4
  2017-04-07 11:32             ` Dmitry Safonov
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here's updated version the fourth and the last bunch of of patches that brings
initial 5-level paging enabling.

Please review and consider applying.

The situation with assembly hasn't changed much. I still not see a way to get
it work.

In this version I've included patch to fix comment in return_from_SYSCALL_64,
fixed bug in coverting startup_64 to C and updated the patch which allows to
opt-in full address space.

Kirill A. Shutemov (9):
  x86/asm: Fix comment in return_from_SYSCALL_64
  x86/boot/64: Rewrite startup_64 in C
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Add support of additional page table level during early
    boot
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add support for 5-level paging for KASLR
  x86: Enable 5-level paging support
  x86/mm: Allow to have userspace mappings above 47-bits

 arch/x86/Kconfig                            |   5 +
 arch/x86/boot/compressed/head_64.S          |  23 ++++-
 arch/x86/entry/entry_64.S                   |   3 +-
 arch/x86/include/asm/elf.h                  |   4 +-
 arch/x86/include/asm/mpx.h                  |   9 ++
 arch/x86/include/asm/pgtable.h              |   2 +-
 arch/x86/include/asm/pgtable_64.h           |   6 +-
 arch/x86/include/asm/processor.h            |  11 ++-
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/espfix_64.c                 |   2 +-
 arch/x86/kernel/head64.c                    | 137 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S                   | 134 +++++++--------------------
 arch/x86/kernel/machine_kexec_64.c          |   2 +-
 arch/x86/kernel/sys_x86_64.c                |  30 +++++-
 arch/x86/mm/dump_pagetables.c               |   2 +-
 arch/x86/mm/hugetlbpage.c                   |  27 +++++-
 arch/x86/mm/init_64.c                       | 104 +++++++++++++++++++--
 arch/x86/mm/kasan_init_64.c                 |  12 +--
 arch/x86/mm/kaslr.c                         |  81 ++++++++++++----
 arch/x86/mm/mmap.c                          |   6 +-
 arch/x86/mm/mpx.c                           |  33 ++++++-
 arch/x86/realmode/init.c                    |   2 +-
 arch/x86/xen/Kconfig                        |   1 +
 arch/x86/xen/mmu.c                          |  18 ++--
 arch/x86/xen/xen-pvh.S                      |   2 +-
 25 files changed, 470 insertions(+), 188 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4
@ 2017-04-13 11:30               ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here's updated version the fourth and the last bunch of of patches that brings
initial 5-level paging enabling.

Please review and consider applying.

The situation with assembly hasn't changed much. I still not see a way to get
it work.

In this version I've included patch to fix comment in return_from_SYSCALL_64,
fixed bug in coverting startup_64 to C and updated the patch which allows to
opt-in full address space.

Kirill A. Shutemov (9):
  x86/asm: Fix comment in return_from_SYSCALL_64
  x86/boot/64: Rewrite startup_64 in C
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Add support of additional page table level during early
    boot
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add support for 5-level paging for KASLR
  x86: Enable 5-level paging support
  x86/mm: Allow to have userspace mappings above 47-bits

 arch/x86/Kconfig                            |   5 +
 arch/x86/boot/compressed/head_64.S          |  23 ++++-
 arch/x86/entry/entry_64.S                   |   3 +-
 arch/x86/include/asm/elf.h                  |   4 +-
 arch/x86/include/asm/mpx.h                  |   9 ++
 arch/x86/include/asm/pgtable.h              |   2 +-
 arch/x86/include/asm/pgtable_64.h           |   6 +-
 arch/x86/include/asm/processor.h            |  11 ++-
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kernel/espfix_64.c                 |   2 +-
 arch/x86/kernel/head64.c                    | 137 +++++++++++++++++++++++++---
 arch/x86/kernel/head_64.S                   | 134 +++++++--------------------
 arch/x86/kernel/machine_kexec_64.c          |   2 +-
 arch/x86/kernel/sys_x86_64.c                |  30 +++++-
 arch/x86/mm/dump_pagetables.c               |   2 +-
 arch/x86/mm/hugetlbpage.c                   |  27 +++++-
 arch/x86/mm/init_64.c                       | 104 +++++++++++++++++++--
 arch/x86/mm/kasan_init_64.c                 |  12 +--
 arch/x86/mm/kaslr.c                         |  81 ++++++++++++----
 arch/x86/mm/mmap.c                          |   6 +-
 arch/x86/mm/mpx.c                           |  33 ++++++-
 arch/x86/realmode/init.c                    |   2 +-
 arch/x86/xen/Kconfig                        |   1 +
 arch/x86/xen/mmu.c                          |  18 ++--
 arch/x86/xen/xen-pvh.S                      |   2 +-
 25 files changed, 470 insertions(+), 188 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

On x86-64 __VIRTUAL_MASK_SHIFT depends on paging mode now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 607d72c4a485..edec30584eb8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -266,7 +266,8 @@ return_from_SYSCALL_64:
 	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 *
-	 * Change top 16 bits to be the sign-extension of 47th bit
+	 * Change top bits to match most significant bit (47th or 56th bit
+	 * depending on paging mode) in the address.
 	 */
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

On x86-64 __VIRTUAL_MASK_SHIFT depends on paging mode now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 607d72c4a485..edec30584eb8 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -266,7 +266,8 @@ return_from_SYSCALL_64:
 	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 *
-	 * Change top 16 bits to be the sign-extension of 47th bit
+	 * Change top bits to match most significant bit (47th or 56th bit
+	 * depending on paging mode) in the address.
 	 */
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/head64.c  | 81 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 95 ++---------------------------------------------
 2 files changed, 83 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fixup the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fixup phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..1432d530fa35 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,11 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	pushq	%rsi
+	call	__startup_64
+	popq	%rsi
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The patch write most of startup_64 logic in C.

This is preparation for 5-level paging enabling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/head64.c  | 81 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/head_64.S | 95 ++---------------------------------------------
 2 files changed, 83 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 43b7002f44fb..dbb5b29bf019 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -35,9 +35,88 @@
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
-static unsigned int __initdata next_early_pgt = 2;
+static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
+static void __init *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+	return ptr - (void *)_text + (void *)physaddr;
+}
+
+void __init __startup_64(unsigned long physaddr)
+{
+	unsigned long load_delta, *p;
+	pgdval_t *pgd;
+	pudval_t *pud;
+	pmdval_t *pmd, pmd_entry;
+	int i;
+
+	/* Is the address too large? */
+	if (physaddr >> MAX_PHYSMEM_BITS)
+		for (;;);
+
+	/*
+	 * Compute the delta between the address I am compiled to run at
+	 * and the address I am actually running at.
+	 */
+	load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+
+	/* Is the address not 2M aligned? */
+	if (load_delta & ~PMD_PAGE_MASK)
+		for (;;);
+
+	/* Fixup the physical addresses in the page table */
+
+	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
+
+	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
+	pud[510] += load_delta;
+	pud[511] += load_delta;
+
+	pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+	pmd[506] += load_delta;
+
+	/*
+	 * Set up the identity mapping for the switchover.  These
+	 * entries should *NOT* have the global bit set!  This also
+	 * creates a bunch of nonsense entries but that is fine --
+	 * it avoids problems around wraparound.
+	 */
+
+	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+
+	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
+	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
+
+	pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
+	pmd_entry +=  physaddr;
+
+	for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++)
+		pmd[i + (physaddr >> PMD_SHIFT)] = pmd_entry + i * PMD_SIZE;
+
+	/*
+	 * Fixup the kernel text+data virtual addresses. Note that
+	 * we might write invalid pmds, when the kernel is relocated
+	 * cleanup_highmap() fixes this up along with the mappings
+	 * beyond _end.
+	 */
+
+	pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd[i] & _PAGE_PRESENT)
+			pmd[i] += load_delta;
+	}
+
+	/* Fixup phys_base */
+	p = fixup_pointer(&phys_base, physaddr);
+	*p += load_delta;
+}
+
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index ac9d327d2e42..1432d530fa35 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -72,100 +72,11 @@ startup_64:
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	/*
-	 * Compute the delta between the address I am compiled to run at and the
-	 * address I am actually running at.
-	 */
-	leaq	_text(%rip), %rbp
-	subq	$_text - __START_KERNEL_map, %rbp
-
-	/* Is the address not 2M aligned? */
-	testl	$~PMD_PAGE_MASK, %ebp
-	jnz	bad_address
-
-	/*
-	 * Is the address too large?
-	 */
-	leaq	_text(%rip), %rax
-	shrq	$MAX_PHYSMEM_BITS, %rax
-	jnz	bad_address
-
-	/*
-	 * Fixup the physical addresses in the page table
-	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
-
-	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
-	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
-
-	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)
-
-	/*
-	 * Set up the identity mapping for the switchover.  These
-	 * entries should *NOT* have the global bit set!  This also
-	 * creates a bunch of nonsense entries but that is fine --
-	 * it avoids problems around wraparound.
-	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
-
-	movq	%rdi, %rax
-	shrq	$PGDIR_SHIFT, %rax
-
-	leaq	(PAGE_SIZE + _KERNPG_TABLE)(%rbx), %rdx
-	movq	%rdx, 0(%rbx,%rax,8)
-	movq	%rdx, 8(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE, %rdx
-	movq	%rdi, %rax
-	shrq	$PUD_SHIFT, %rax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-	incl	%eax
-	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
-
-	addq	$PAGE_SIZE * 2, %rbx
-	movq	%rdi, %rax
-	shrq	$PMD_SHIFT, %rdi
-	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
-	leaq	(_end - 1)(%rip), %rcx
-	shrq	$PMD_SHIFT, %rcx
-	subq	%rdi, %rcx
-	incl	%ecx
+	pushq	%rsi
+	call	__startup_64
+	popq	%rsi
 
-1:
-	andq	$(PTRS_PER_PMD - 1), %rdi
-	movq	%rax, (%rbx,%rdi,8)
-	incq	%rdi
-	addq	$PMD_SIZE, %rax
-	decl	%ecx
-	jnz	1b
-
-	test %rbp, %rbp
-	jz .Lskip_fixup
-
-	/*
-	 * Fixup the kernel text+data virtual addresses. Note that
-	 * we might write invalid pmds, when the kernel is relocated
-	 * cleanup_highmap() fixes this up along with the mappings
-	 * beyond _end.
-	 */
-	leaq	level2_kernel_pgt(%rip), %rdi
-	leaq	PAGE_SIZE(%rdi), %r8
-	/* See if it is a valid page table entry */
-1:	testb	$_PAGE_PRESENT, 0(%rdi)
-	jz	2f
-	addq	%rbp, 0(%rdi)
-	/* Go to the next page */
-2:	addq	$8, %rdi
-	cmp	%r8, %rdi
-	jne	1b
-
-	/* Fixup phys_base */
-	addq	%rbp, phys_base(%rip)
-
-.Lskip_fixup:
 	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1432d530fa35..0ae0bad4d4d5 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -77,7 +77,7 @@ startup_64:
 	call	__startup_64
 	popq	%rsi
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -97,7 +97,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -328,7 +328,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -338,14 +338,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index bce6990b1d81..0470826d2bdc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables.

Let's give these variable more generic names: init_top_pgt and
early_top_pgt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h     |  2 +-
 arch/x86/include/asm/pgtable_64.h  |  4 ++--
 arch/x86/kernel/espfix_64.c        |  2 +-
 arch/x86/kernel/head64.c           | 18 +++++++++---------
 arch/x86/kernel/head_64.S          | 14 +++++++-------
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/mm/dump_pagetables.c      |  2 +-
 arch/x86/mm/kasan_init_64.c        | 12 ++++++------
 arch/x86/realmode/init.c           |  2 +-
 arch/x86/xen/mmu.c                 | 18 +++++++++---------
 arch/x86/xen/xen-pvh.S             |  2 +-
 11 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 942482ac36a8..77037b6f1caa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -922,7 +922,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 12ea31274eb6..affcb2a9c563 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -20,9 +20,9 @@ extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dbb5b29bf019..c46e0f62024e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -33,7 +33,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -67,7 +67,7 @@ void __init __startup_64(unsigned long physaddr)
 
 	/* Fixup the physical addresses in the page table */
 
-	pgd = fixup_pointer(&early_level4_pgt, physaddr);
+	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
@@ -120,9 +120,9 @@ void __init __startup_64(unsigned long physaddr)
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -134,11 +134,11 @@ int __init early_make_pgtable(unsigned long address)
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -235,7 +235,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -250,8 +250,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1432d530fa35..0ae0bad4d4d5 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -77,7 +77,7 @@ startup_64:
 	call	__startup_64
 	popq	%rsi
 
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -97,7 +97,7 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
 	/* Enable PAE mode and PGE */
@@ -328,7 +328,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -338,14 +338,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_level4_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + L4_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 085c3b300d32..42f502b45e62 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -342,7 +342,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index bce6990b1d81..0470826d2bdc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -431,7 +431,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0c7d8129bed6..88215ac16b24 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -12,7 +12,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
 static int __init map_range(struct range *range)
@@ -109,8 +109,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -121,8 +121,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -148,7 +148,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index f226038a39ca..7c2081f78a19 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1531,8 +1531,8 @@ static void xen_write_cr3(unsigned long cr3)
  * At the start of the day - when Xen launches a guest, it has already
  * built pagetables for the guest. We diligently look over them
  * in xen_setup_kernel_pagetable and graft as appropriate them in the
- * init_level4_pgt and its friends. Then when we are happy we load
- * the new init_level4_pgt - and continue on.
+ * init_top_pgt and its friends. Then when we are happy we load
+ * the new init_top_pgt - and continue on.
  *
  * The generic code starts (start_kernel) and 'init_mem_mapping' sets
  * up the rest of the pagetables. When it has completed it loads the cr3.
@@ -1975,13 +1975,13 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	pt_end = pt_base + xen_start_info->nr_pt_frames;
 
 	/* Zap identity mapping */
-	init_level4_pgt[0] = __pgd(0);
+	init_top_pgt[0] = __pgd(0);
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Pre-constructed entries are in pfn, so convert to mfn */
 		/* L4[272] -> level3_ident_pgt
 		 * L4[511] -> level3_kernel_pgt */
-		convert_pfn_mfn(init_level4_pgt);
+		convert_pfn_mfn(init_top_pgt);
 
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
@@ -2012,11 +2012,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	/* Copy the initial P->M table mappings if necessary. */
 	i = pgd_index(xen_start_info->mfn_list);
 	if (i && i < pgd_index(__START_KERNEL_map))
-		init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+		init_top_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
 
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
-		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+		set_page_prot(init_top_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO);
@@ -2027,7 +2027,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-				  PFN_DOWN(__pa_symbol(init_level4_pgt)));
+				  PFN_DOWN(__pa_symbol(init_top_pgt)));
 
 		/* Unpin Xen-provided one */
 		pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
@@ -2038,10 +2038,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		 * pgd.
 		 */
 		xen_mc_batch();
-		__xen_write_cr3(true, __pa(init_level4_pgt));
+		__xen_write_cr3(true, __pa(init_top_pgt));
 		xen_mc_issue(PARAVIRT_LAZY_CPU);
 	} else
-		native_write_cr3(__pa(init_level4_pgt));
+		native_write_cr3(__pa(init_top_pgt));
 
 	/* We can't that easily rip out L3 and L2, as the Xen pagetables are
 	 * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ...  for
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index 5e246716d58f..e1a5fbeae08d 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
 	wrmsr
 
 	/* Enable pre-constructed page tables. */
-	mov $_pa(init_level4_pgt), %eax
+	mov $_pa(init_top_pgt), %eax
 	mov %eax, %cr3
 	mov $(X86_CR0_PG | X86_CR0_PE), %eax
 	mov %eax, %cr0
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
 arch/x86/include/asm/pgtable_64.h           |  2 ++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
 5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
 	pgdval_t *pgd;
+	p4dval_t *p4d;
 	pudval_t *pud;
 	pmdval_t *pmd, pmd_entry;
 	int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+		p4d[511] += load_delta;
+	}
+
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
 	pud[510] += load_delta;
 	pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
-	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+		pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+		p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	} else {
+		pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	}
 
 	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
 	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 0ae0bad4d4d5..7b527fa47536 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -100,11 +104,14 @@ ENTRY(secondary_startup_64)
 	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -330,7 +337,11 @@ GLOBAL(name)
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -343,9 +354,9 @@ NEXT_PAGE(init_top_pgt)
 #else
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -359,6 +370,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++++---
 arch/x86/include/asm/pgtable_64.h           |  2 ++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/head64.c                    | 44 +++++++++++++++++++++++++----
 arch/x86/kernel/head_64.S                   | 29 +++++++++++++++----
 5 files changed, 85 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index affcb2a9c563..2160c1fee920 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,6 +14,8 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c46e0f62024e..92935855eaaa 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -47,6 +47,7 @@ void __init __startup_64(unsigned long physaddr)
 {
 	unsigned long load_delta, *p;
 	pgdval_t *pgd;
+	p4dval_t *p4d;
 	pudval_t *pud;
 	pmdval_t *pmd, pmd_entry;
 	int i;
@@ -70,6 +71,11 @@ void __init __startup_64(unsigned long physaddr)
 	pgd = fixup_pointer(&early_top_pgt, physaddr);
 	pgd[pgd_index(__START_KERNEL_map)] += load_delta;
 
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(&level4_kernel_pgt, physaddr);
+		p4d[511] += load_delta;
+	}
+
 	pud = fixup_pointer(&level3_kernel_pgt, physaddr);
 	pud[510] += load_delta;
 	pud[511] += load_delta;
@@ -87,8 +93,18 @@ void __init __startup_64(unsigned long physaddr)
 	pud = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 	pmd = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
 
-	pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
-	pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		p4d = fixup_pointer(early_dynamic_pgts[next_early_pgt++], physaddr);
+
+		pgd[0] = (pgdval_t)p4d + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)p4d + _KERNPG_TABLE;
+
+		p4d[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		p4d[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	} else {
+		pgd[0] = (pgdval_t)pud + _KERNPG_TABLE;
+		pgd[1] = (pgdval_t)pud + _KERNPG_TABLE;
+	}
 
 	pud[0] = (pudval_t)pmd + _KERNPG_TABLE;
 	pud[1] = (pudval_t)pmd + _KERNPG_TABLE;
@@ -130,6 +146,7 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
@@ -146,8 +163,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -156,7 +190,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 0ae0bad4d4d5..7b527fa47536 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -100,11 +104,14 @@ ENTRY(secondary_startup_64)
 	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -330,7 +337,11 @@ GLOBAL(name)
 	__INITDATA
 NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -343,9 +354,9 @@ NEXT_PAGE(init_top_pgt)
 #else
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_PAGE_OFFSET*8, 0
+	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
-	.org    init_top_pgt + L4_START_KERNEL*8, 0
+	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
 
@@ -359,6 +370,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before folded p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a242139df8fe..0b62b13e8655 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,40 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start; address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +169,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RAM) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RESERVED_KERN))
+				set_p4d(p4d, __p4d(0));
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support 5-level paging
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Populate additional page table level if CONFIG_X86_5LEVEL is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0b62b13e8655..53cd9fb5027b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -620,6 +620,57 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RAM) &&
+			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+					     E820_TYPE_RESERVED_KERN))
+				set_p4d(p4d, __p4d(0));
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -641,26 +692,26 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr), (pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 81 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index aed206475aa7..af599167fe3c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,7 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base, 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +142,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +154,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +189,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 8/9] x86: Enable 5-level paging support
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 5 +++++
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
 	bool "Xen guest support"
 	depends on PARAVIRT
+	depends on !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 8/9] x86: Enable 5-level paging support
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig     | 5 +++++
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e153e93273f..7a76dcac357e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -318,6 +318,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1390,6 +1391,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
 	bool "Xen guest support"
 	depends on PARAVIRT
+	depends on !X86_5LEVEL
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
 	select XEN_HAVE_VPMU
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits
  2017-04-13 11:30               ` Kirill A. Shutemov
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, linux-api

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

A high hint address would only affect the allocation in question, but not
any future mmap()s.

Specifying high hint address on older kernel or on machine without 5-level
paging support is safe. The hint will be ignored and kernel will fall back
to allocation from 47-bit address space.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: linux-api@vger.kernel.org
---
 arch/x86/include/asm/elf.h       |  4 ++--
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 11 ++++++++---
 arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
 arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
 arch/x86/mm/mmap.c               |  6 +++---
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e8ab9a46bc68..7a30513a4046 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
 }
 
 extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
 					0xc0000000 : 0xFFFFe000)
 
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
 #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 	return error;
 }
 
-static void find_start_end(unsigned long flags, unsigned long *begin,
-			   unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+		unsigned long *begin, unsigned long *end)
 {
 	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
 		/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 	}
 
 	*begin	= get_mmap_base(1);
-	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+	if (in_compat_syscall())
+		*end = tasksize_32bit();
+	else
+		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
 }
 
 unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
-	find_start_end(flags, &begin, &end);
+	find_start_end(addr, flags, &begin, &end);
 
 	if (len > end)
 		return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 *
+	 * !in_compat_syscall() check to avoid high addresses for x32.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = get_mmap_base(1);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
 	info.high_limit = in_compat_syscall() ?
-		tasksize_32bit() : tasksize_64bit();
+		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = TASK_SIZE_LOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
 	return IA32_PAGE_OFFSET;
 }
 
-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
 {
-	return TASK_SIZE_MAX;
+	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
 	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
-			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
 
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
 	/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 1c34b767c84c..8c8da27e8549 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1030,3 +1039,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits
@ 2017-04-13 11:30                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-13 11:30 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, linux-api

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

A high hint address would only affect the allocation in question, but not
any future mmap()s.

Specifying high hint address on older kernel or on machine without 5-level
paging support is safe. The hint will be ignored and kernel will fall back
to allocation from 47-bit address space.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: linux-api@vger.kernel.org
---
 arch/x86/include/asm/elf.h       |  4 ++--
 arch/x86/include/asm/mpx.h       |  9 +++++++++
 arch/x86/include/asm/processor.h | 11 ++++++++---
 arch/x86/kernel/sys_x86_64.c     | 30 ++++++++++++++++++++++++++----
 arch/x86/mm/hugetlbpage.c        | 27 +++++++++++++++++++++++----
 arch/x86/mm/mmap.c               |  6 +++---
 arch/x86/mm/mpx.c                | 33 ++++++++++++++++++++++++++++++++-
 7 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e8ab9a46bc68..7a30513a4046 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(TASK_SIZE_LOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
@@ -304,7 +304,7 @@ static inline int mmap_is_ia32(void)
 }
 
 extern unsigned long tasksize_32bit(void);
-extern unsigned long tasksize_64bit(void);
+extern unsigned long tasksize_64bit(int full_addr_space);
 extern unsigned long get_mmap_base(int is_legacy);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
 				    unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+		unsigned long len, unsigned long flags)
+{
+	return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada998a402..aaed58b03ddb 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -795,6 +795,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	PAGE_OFFSET
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
 #define STACK_TOP		TASK_SIZE
 #define STACK_TOP_MAX		STACK_TOP
 
@@ -834,7 +835,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -842,12 +845,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
 					0xc0000000 : 0xFFFFe000)
 
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
 #define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		TASK_SIZE_LOW
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -870,7 +875,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * space during mmap's.
  */
 #define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))
-#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE)
+#define TASK_UNMAPPED_BASE		__TASK_UNMAPPED_BASE(TASK_SIZE_LOW)
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 207b8f2582c7..74d1587b181d 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,6 +21,7 @@
 #include <asm/compat.h>
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
+#include <asm/mpx.h>
 
 /*
  * Align a virtual address to avoid aliasing in the I$ on AMD F15h.
@@ -100,8 +101,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 	return error;
 }
 
-static void find_start_end(unsigned long flags, unsigned long *begin,
-			   unsigned long *end)
+static void find_start_end(unsigned long addr, unsigned long flags,
+		unsigned long *begin, unsigned long *end)
 {
 	if (!in_compat_syscall() && (flags & MAP_32BIT)) {
 		/* This is usually used needed to map code in small
@@ -120,7 +121,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 	}
 
 	*begin	= get_mmap_base(1);
-	*end	= in_compat_syscall() ? tasksize_32bit() : tasksize_64bit();
+	if (in_compat_syscall())
+		*end = tasksize_32bit();
+	else
+		*end = tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
 }
 
 unsigned long
@@ -132,10 +136,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (flags & MAP_FIXED)
 		return addr;
 
-	find_start_end(flags, &begin, &end);
+	find_start_end(addr, flags, &begin, &end);
 
 	if (len > end)
 		return -ENOMEM;
@@ -171,6 +179,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	/* requested length too big for entire address space */
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -195,6 +207,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 *
+	 * !in_compat_syscall() check to avoid high addresses for x32.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	if (filp) {
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 302f43fd9c28..730f00250acb 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/elf.h>
+#include <asm/mpx.h>
 
 #if 0	/* This is just for testing */
 struct page *
@@ -85,25 +86,38 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = get_mmap_base(1);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
 	info.high_limit = in_compat_syscall() ?
-		tasksize_32bit() : tasksize_64bit();
+		tasksize_32bit() : tasksize_64bit(addr > DEFAULT_MAP_WINDOW);
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
 }
 
 static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
-		unsigned long addr0, unsigned long len,
+		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
-	unsigned long addr;
 
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = get_mmap_base(0);
+
+	/*
+	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
+	 * in the full address space.
+	 */
+	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
+		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;
+
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	addr = vm_unmapped_area(&info);
@@ -118,7 +132,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = TASK_SIZE_LOW;
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -135,6 +149,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
+
+	addr = mpx_unmapped_area_check(addr, len, flags);
+	if (IS_ERR_VALUE(addr))
+		return addr;
+
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 19ad095b41df..199050249d60 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -42,9 +42,9 @@ unsigned long tasksize_32bit(void)
 	return IA32_PAGE_OFFSET;
 }
 
-unsigned long tasksize_64bit(void)
+unsigned long tasksize_64bit(int full_addr_space)
 {
-	return TASK_SIZE_MAX;
+	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
 }
 
 static unsigned long stack_maxrandom_size(unsigned long task_size)
@@ -140,7 +140,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
 		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
 	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
-			arch_rnd(mmap64_rnd_bits), tasksize_64bit());
+			arch_rnd(mmap64_rnd_bits), tasksize_64bit(0));
 
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
 	/*
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 1c34b767c84c..8c8da27e8549 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -355,10 +355,19 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/* MPX doesn't support addresses above 47-bits yet. */
+	if (find_vma(mm, DEFAULT_MAP_WINDOW)) {
+		pr_warn_once("%s (%d): MPX cannot handle addresses "
+				"above 47-bits. Disabling.",
+				current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -1030,3 +1039,25 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (ret)
 		force_sig(SIGSEGV, current);
 }
+
+/* MPX cannot handle addresses above 47-bits yet. */
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+		unsigned long flags)
+{
+	if (!kernel_managing_mpx_tables(current->mm))
+		return addr;
+	if (addr + len <= DEFAULT_MAP_WINDOW)
+		return addr;
+	if (flags & MAP_FIXED)
+		return -ENOMEM;
+
+	/*
+	 * Requested len is larger than whole area we're allowed to map in.
+	 * Resetting hinting address wouldn't do much good -- fail early.
+	 */
+	if (len > DEFAULT_MAP_WINDOW)
+		return -ENOMEM;
+
+	/* Look for unmap area within DEFAULT_MAP_WINDOW */
+	return 0;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-12 10:18               ` Kirill A. Shutemov
@ 2017-04-17 10:32                 ` Ingo Molnar
  -1 siblings, 0 replies; 101+ messages in thread
From: Ingo Molnar @ 2017-04-17 10:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > I'll look closer (building proccess it's rather complicated), but my
> > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > with the rest of the kernel, rather included as blob, no?
> > > 
> > > Andy, may be you have an idea?
> > 
> > There isn't any way I know of to directly link them together. The ELF 
> > format wasn't designed for that. You would need to merge blobs and then use
> > manual jump vectors, like the 16bit startup code does. It would be likely
> > complicated and ugly.
> 
> Ingo, can we proceed without coverting this assembly to C?
> 
> I'm committed to convert it to C later if we'll find reasonable solution
> to the issue.

So one way to do it would be to build it standalone as a .o, then add it not to 
the regular kernel objects link target (as you found out it's not possible to link 
32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.

But there would be other complications with this approach, such as we'd have to 
add a size field and there might be symbol linking problems ...

Another, pretty hacky way would be to generate a .S from the .c, then post-process 
the .S and essentially generate today's 32-bit .S from it.

Probably not worth the trouble.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-17 10:32                 ` Ingo Molnar
  0 siblings, 0 replies; 101+ messages in thread
From: Ingo Molnar @ 2017-04-17 10:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > I'll look closer (building proccess it's rather complicated), but my
> > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > with the rest of the kernel, rather included as blob, no?
> > > 
> > > Andy, may be you have an idea?
> > 
> > There isn't any way I know of to directly link them together. The ELF 
> > format wasn't designed for that. You would need to merge blobs and then use
> > manual jump vectors, like the 16bit startup code does. It would be likely
> > complicated and ugly.
> 
> Ingo, can we proceed without coverting this assembly to C?
> 
> I'm committed to convert it to C later if we'll find reasonable solution
> to the issue.

So one way to do it would be to build it standalone as a .o, then add it not to 
the regular kernel objects link target (as you found out it's not possible to link 
32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.

But there would be other complications with this approach, such as we'd have to 
add a size field and there might be symbol linking problems ...

Another, pretty hacky way would be to generate a .S from the .c, then post-process 
the .S and essentially generate today's 32-bit .S from it.

Probably not worth the trouble.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-17 10:32                 ` Ingo Molnar
@ 2017-04-18  8:59                   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > I'll look closer (building proccess it's rather complicated), but my
> > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > with the rest of the kernel, rather included as blob, no?
> > > > 
> > > > Andy, may be you have an idea?
> > > 
> > > There isn't any way I know of to directly link them together. The ELF 
> > > format wasn't designed for that. You would need to merge blobs and then use
> > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > complicated and ugly.
> > 
> > Ingo, can we proceed without coverting this assembly to C?
> > 
> > I'm committed to convert it to C later if we'll find reasonable solution
> > to the issue.
> 
> So one way to do it would be to build it standalone as a .o, then add it not to 
> the regular kernel objects link target (as you found out it's not possible to link 
> 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> 
> But there would be other complications with this approach, such as we'd have to 
> add a size field and there might be symbol linking problems ...
> 
> Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> the .S and essentially generate today's 32-bit .S from it.
> 
> Probably not worth the trouble.

So, do I need to do anything else to get part 4 applied?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-18  8:59                   ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > I'll look closer (building proccess it's rather complicated), but my
> > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > with the rest of the kernel, rather included as blob, no?
> > > > 
> > > > Andy, may be you have an idea?
> > > 
> > > There isn't any way I know of to directly link them together. The ELF 
> > > format wasn't designed for that. You would need to merge blobs and then use
> > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > complicated and ugly.
> > 
> > Ingo, can we proceed without coverting this assembly to C?
> > 
> > I'm committed to convert it to C later if we'll find reasonable solution
> > to the issue.
> 
> So one way to do it would be to build it standalone as a .o, then add it not to 
> the regular kernel objects link target (as you found out it's not possible to link 
> 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> 
> But there would be other complications with this approach, such as we'd have to 
> add a size field and there might be symbol linking problems ...
> 
> Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> the .S and essentially generate today's 32-bit .S from it.
> 
> Probably not worth the trouble.

So, do I need to do anything else to get part 4 applied?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-18  8:59                   ` Kirill A. Shutemov
@ 2017-04-18 10:15                     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18 10:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > with the rest of the kernel, rather included as blob, no?
> > > > > 
> > > > > Andy, may be you have an idea?
> > > > 
> > > > There isn't any way I know of to directly link them together. The ELF 
> > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > complicated and ugly.
> > > 
> > > Ingo, can we proceed without coverting this assembly to C?
> > > 
> > > I'm committed to convert it to C later if we'll find reasonable solution
> > > to the issue.
> > 
> > So one way to do it would be to build it standalone as a .o, then add it not to 
> > the regular kernel objects link target (as you found out it's not possible to link 
> > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > 
> > But there would be other complications with this approach, such as we'd have to 
> > add a size field and there might be symbol linking problems ...
> > 
> > Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> > the .S and essentially generate today's 32-bit .S from it.
> > 
> > Probably not worth the trouble.
> 
> So, do I need to do anything else to get part 4 applied?

Doh!

I've just realized we don't really need to enable 5-level paging in
decompression code. Leaving 4-level paging there works perfectly fine.

I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
patchset.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-18 10:15                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18 10:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > with the rest of the kernel, rather included as blob, no?
> > > > > 
> > > > > Andy, may be you have an idea?
> > > > 
> > > > There isn't any way I know of to directly link them together. The ELF 
> > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > complicated and ugly.
> > > 
> > > Ingo, can we proceed without coverting this assembly to C?
> > > 
> > > I'm committed to convert it to C later if we'll find reasonable solution
> > > to the issue.
> > 
> > So one way to do it would be to build it standalone as a .o, then add it not to 
> > the regular kernel objects link target (as you found out it's not possible to link 
> > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > 
> > But there would be other complications with this approach, such as we'd have to 
> > add a size field and there might be symbol linking problems ...
> > 
> > Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> > the .S and essentially generate today's 32-bit .S from it.
> > 
> > Probably not worth the trouble.
> 
> So, do I need to do anything else to get part 4 applied?

Doh!

I've just realized we don't really need to enable 5-level paging in
decompression code. Leaving 4-level paging there works perfectly fine.

I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
patchset.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
  2017-04-18 10:15                     ` Kirill A. Shutemov
@ 2017-04-18 11:10                       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18 11:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 18, 2017 at 01:15:34PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > > with the rest of the kernel, rather included as blob, no?
> > > > > > 
> > > > > > Andy, may be you have an idea?
> > > > > 
> > > > > There isn't any way I know of to directly link them together. The ELF 
> > > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > > complicated and ugly.
> > > > 
> > > > Ingo, can we proceed without coverting this assembly to C?
> > > > 
> > > > I'm committed to convert it to C later if we'll find reasonable solution
> > > > to the issue.
> > > 
> > > So one way to do it would be to build it standalone as a .o, then add it not to 
> > > the regular kernel objects link target (as you found out it's not possible to link 
> > > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> > > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > > 
> > > But there would be other complications with this approach, such as we'd have to 
> > > add a size field and there might be symbol linking problems ...
> > > 
> > > Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> > > the .S and essentially generate today's 32-bit .S from it.
> > > 
> > > Probably not worth the trouble.
> > 
> > So, do I need to do anything else to get part 4 applied?
> 
> Doh!
> 
> I've just realized we don't really need to enable 5-level paging in
> decompression code. Leaving 4-level paging there works perfectly fine.
> 
> I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
> patchset.

No. This breaks KASLR. Decompression code has to use 5-level paging to
keep KASLR working.

So, v4 of part 4 is up-to-date.

Sorry for noise.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot
@ 2017-04-18 11:10                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill A. Shutemov @ 2017-04-18 11:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Kirill A. Shutemov, Andy Lutomirski, Linus Torvalds,
	Andrew Morton, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Hansen, linux-arch, linux-mm, linux-kernel

On Tue, Apr 18, 2017 at 01:15:34PM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2017 at 11:59:26AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 17, 2017 at 12:32:25PM +0200, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > On Tue, Apr 11, 2017 at 07:09:07AM -0700, Andi Kleen wrote:
> > > > > > I'll look closer (building proccess it's rather complicated), but my
> > > > > > understanding is that VDSO is stand-alone binary and doesn't really links
> > > > > > with the rest of the kernel, rather included as blob, no?
> > > > > > 
> > > > > > Andy, may be you have an idea?
> > > > > 
> > > > > There isn't any way I know of to directly link them together. The ELF 
> > > > > format wasn't designed for that. You would need to merge blobs and then use
> > > > > manual jump vectors, like the 16bit startup code does. It would be likely
> > > > > complicated and ugly.
> > > > 
> > > > Ingo, can we proceed without coverting this assembly to C?
> > > > 
> > > > I'm committed to convert it to C later if we'll find reasonable solution
> > > > to the issue.
> > > 
> > > So one way to do it would be to build it standalone as a .o, then add it not to 
> > > the regular kernel objects link target (as you found out it's not possible to link 
> > > 32-bit and 64-bit objects), but to link it in a manual fashion, as part of 
> > > vmlinux.bin.all-y in arch/x86/boot/compressed/Makefile.
> > > 
> > > But there would be other complications with this approach, such as we'd have to 
> > > add a size field and there might be symbol linking problems ...
> > > 
> > > Another, pretty hacky way would be to generate a .S from the .c, then post-process 
> > > the .S and essentially generate today's 32-bit .S from it.
> > > 
> > > Probably not worth the trouble.
> > 
> > So, do I need to do anything else to get part 4 applied?
> 
> Doh!
> 
> I've just realized we don't really need to enable 5-level paging in
> decompression code. Leaving 4-level paging there works perfectly fine.
> 
> I'll drop changes to arch/x86/boot/compressed/head_64.S and resubmit the
> patchset.

No. This breaks KASLR. Decompression code has to use 5-level paging to
keep KASLR working.

So, v4 of part 4 is up-to-date.

Sorry for noise.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2017-04-18 11:10 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 14:00 [PATCH 0/8] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
2017-04-06 14:00 ` Kirill A. Shutemov
2017-04-06 14:00 ` [PATCH 1/8] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
2017-04-06 14:00   ` Kirill A. Shutemov
2017-04-11  7:58   ` [tip:x86/mm] x86/boot/64: Rewrite startup_64() " tip-bot for Kirill A. Shutemov
2017-04-11  8:54     ` Ingo Molnar
2017-04-11 12:29       ` Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 2/8] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-11  7:59   ` [tip:x86/mm] x86/boot/64: Rename init_level4_pgt() and early_level4_pgt[] tip-bot for Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 3/8] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-11  7:02   ` Ingo Molnar
2017-04-11  7:02     ` Ingo Molnar
2017-04-11 10:51     ` Kirill A. Shutemov
2017-04-11 10:51       ` Kirill A. Shutemov
2017-04-11 11:28       ` Ingo Molnar
2017-04-11 11:28         ` Ingo Molnar
2017-04-11 11:46         ` Kirill A. Shutemov
2017-04-11 11:46           ` Kirill A. Shutemov
2017-04-11 14:09           ` Andi Kleen
2017-04-11 14:09             ` Andi Kleen
2017-04-12 10:18             ` Kirill A. Shutemov
2017-04-12 10:18               ` Kirill A. Shutemov
2017-04-17 10:32               ` Ingo Molnar
2017-04-17 10:32                 ` Ingo Molnar
2017-04-18  8:59                 ` Kirill A. Shutemov
2017-04-18  8:59                   ` Kirill A. Shutemov
2017-04-18 10:15                   ` Kirill A. Shutemov
2017-04-18 10:15                     ` Kirill A. Shutemov
2017-04-18 11:10                     ` Kirill A. Shutemov
2017-04-18 11:10                       ` Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 4/8] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 5/8] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 6/8] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-06 14:01 ` [PATCH 7/8] x86: Enable 5-level paging support Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-06 14:52   ` Juergen Gross
2017-04-06 14:52     ` Juergen Gross
2017-04-06 15:24     ` Kirill A. Shutemov
2017-04-06 15:24       ` Kirill A. Shutemov
2017-04-06 15:56       ` Juergen Gross
2017-04-06 15:56         ` Juergen Gross
2017-04-06 14:01 ` [PATCH 8/8] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
2017-04-06 14:01   ` Kirill A. Shutemov
2017-04-06 18:43   ` Dmitry Safonov
2017-04-06 18:43     ` Dmitry Safonov
2017-04-06 18:43     ` Dmitry Safonov
2017-04-06 19:15     ` Dmitry Safonov
2017-04-06 19:15       ` Dmitry Safonov
2017-04-06 19:15       ` Dmitry Safonov
2017-04-06 23:21       ` Kirill A. Shutemov
2017-04-06 23:21         ` Kirill A. Shutemov
2017-04-06 23:24         ` [PATCHv2 " Kirill A. Shutemov
2017-04-06 23:24           ` Kirill A. Shutemov
2017-04-07 11:32           ` Dmitry Safonov
2017-04-07 11:32             ` Dmitry Safonov
2017-04-07 11:32             ` Dmitry Safonov
2017-04-07 15:44             ` [PATCHv3 " Kirill A. Shutemov
2017-04-07 15:44               ` Kirill A. Shutemov
2017-04-07 16:37               ` Dmitry Safonov
2017-04-07 16:37                 ` Dmitry Safonov
2017-04-07 16:37                 ` Dmitry Safonov
2017-04-13 11:30             ` [PATCHv4 0/9] x86: 5-level paging enabling for v4.12, Part 4 Kirill A. Shutemov
2017-04-13 11:30               ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 1/9] x86/asm: Fix comment in return_from_SYSCALL_64 Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 2/9] x86/boot/64: Rewrite startup_64 in C Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 3/9] x86/boot/64: Rename init_level4_pgt and early_level4_pgt Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 4/9] x86/boot/64: Add support of additional page table level during early boot Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 5/9] x86/mm: Add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 6/9] x86/mm: Make kernel_physical_mapping_init() support " Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 7/9] x86/mm: Add support for 5-level paging for KASLR Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 8/9] x86: Enable 5-level paging support Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-13 11:30               ` [PATCHv4 9/9] x86/mm: Allow to have userspace mappings above 47-bits Kirill A. Shutemov
2017-04-13 11:30                 ` Kirill A. Shutemov
2017-04-07 10:06         ` [PATCH 8/8] " Dmitry Safonov
2017-04-07 10:06           ` Dmitry Safonov
2017-04-07 10:06           ` Dmitry Safonov
2017-04-07 13:35   ` Anshuman Khandual
2017-04-07 13:35     ` Anshuman Khandual
2017-04-07 15:59     ` Kirill A. Shutemov
2017-04-07 15:59       ` Kirill A. Shutemov
2017-04-07 16:09       ` hpa
2017-04-07 16:09         ` hpa
2017-04-07 16:20         ` Kirill A. Shutemov
2017-04-07 16:20           ` Kirill A. Shutemov
2017-04-12 10:41       ` Michael Ellerman
2017-04-12 10:41         ` Michael Ellerman
2017-04-12 11:11         ` Kirill A. Shutemov
2017-04-12 11:11           ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.