linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 00/29] 5-level paging
@ 2016-12-27  1:53 Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 01/29] x86/cpufeature: Add 5-level paging detecton Kirill A. Shutemov
                   ` (29 more replies)
  0 siblings, 30 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Here is v2 of 5-level paging patchset.

Please consider applying first 7 patches.

Main x86 portion requires more work (mostly Xen), but any feedback is
welcome.

I've also included a patch proposal to address compatibility issue with
wide addresses that some software has.

Ingo wants to to see opt-out, but I believe opt-in is required to not
break userspace. That's not purely theoretical, I see one crash due to
the issue.

See description of patchset split below.

== Overview ==

x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
of physical address space. We are already bumping into this limit: some
vendors offers servers with 64 TiB of memory today.

To overcome the limitation upcoming hardware will introduce support for
5-level paging[1]. It is a straight-forward extension of the current page
table structure adding one more layer of translation.

It bumps the limits to 128 PiB of virtual address space and 4 PiB of
physical address space. This "ought to be enough for anybody" ©.

==  Patches ==

The patchset is build on top of v4.10-rc1.

Current QEMU upstream git supports 5-level paging. Use "-cpu qemu64,+la57"
to enable it.

Patch 1:
	Detect la57 feature for /proc/cpuinfo. It's trivial and can be
	applied now.

Patches 2-7:
	Brings 5-level paging to generic code and convert all
	architectures to it using <asm-generic/5level-fixup.h>

	I think this should be ready for -mm tree. Please consider
	applying.

Patches 8-15:
	Convert x86 to properly folded p4d layer using
	<asm-generic/pgtable-nop4d.h>.

Patches 16-28:
	Enabling of real 5-level paging.

	CONFIG_X86_5LEVEL=y will enable new paging mode.

Patch 29:
	Extends rlimit interface to set/get maximum virtual address that
	the application can map.

	This aims to address compatibility issue. Only supports x86 for
	now.

	I also patched dash to add the rlimit support. It was handy for
	testing. Let me know if anybody wants to play with it.

	Comments are welcome.

Git:
	git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git la57/v2

== TODO ==

There is still work to do:

  - CONFIG_XEN is broken.

    Paravirt Xen MMU support hasn't yet adjusted to work with 5-level
    paging. It's legacy feature, not sure if we really need to support it
    with new paging, but it blocks Xen drivers too.

    I haven't got around to setup testing environment for XEN, so left it
    broken for now.

    I would appreciate help with the code.

  - Boot-time switch between 4- and 5-level paging.

    We assume that distributions will be keen to avoid returning to the
    i386 days where we shipped one kernel binary for each page table
    layout.

    As page table format is the same for 4- and 5-level paging it should
    be possible to have single kernel binary and switch between them at
    boot-time without too much hassle.

    For now I only implemented compile-time switch.

    This will implemented with separate patchset.

== Changelong ==
  v2:
    - Rebased to v4.10-rc1;
    - RLIMIT_VADDR proposal;
    - Fix virtual map and update documentation;
    - Fix few build errors;
    - Rework cpuid helpers in boot code;
    - Fix espfix code to work with 5-level pages;

[1] https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf

Kirill A. Shutemov (29):
  x86/cpufeature: Add 5-level paging detecton
  asm-generic: introduce 5level-fixup.h
  asm-generic: introduce __ARCH_USE_5LEVEL_HACK
  arch, mm: convert all architectures to use 5level-fixup.h
  asm-generic: introduce <asm-generic/pgtable-nop4d.h>
  mm: convert generic code to 5-level paging
  mm: introduce __p4d_alloc()
  x86: basic changes into headers for 5-level paging
  x86: trivial portion of 5-level paging conversion
  x86/gup: add 5-level paging support
  x86/ident_map: add 5-level paging support
  x86/mm: add support of p4d_t in vmalloc_fault()
  x86/power: support p4d_t in hibernate code
  x86/kexec: support p4d_t
  x86: convert the rest of the code to support p4d_t
  x86: detect 5-level paging support
  x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert
  x86/mm: define virtual memory map for 5-level paging
  x86/paravirt: make paravirt code support 5-level paging
  x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL
  x86/dump_pagetables: support 5-level paging
  x86/mm: extend kasan to support 5-level paging
  x86/espfix: support 5-level paging
  x86/mm: add support of additional page table level during early boot
  x86/mm: add sync_global_pgds() for configuration with 5-level paging
  x86/mm: make kernel_physical_mapping_init() support 5-level paging
  x86/mm: add support for 5-level paging for KASLR
  x86: enable 5-level paging support
  mm, x86: introduce RLIMIT_VADDR

 Documentation/x86/x86_64/mm.txt                  |  33 ++-
 arch/arc/include/asm/hugepage.h                  |   1 +
 arch/arc/include/asm/pgtable.h                   |   1 +
 arch/arm/include/asm/pgtable.h                   |   1 +
 arch/arm64/include/asm/pgtable-types.h           |   4 +
 arch/avr32/include/asm/pgtable-2level.h          |   1 +
 arch/cris/include/asm/pgtable.h                  |   1 +
 arch/frv/include/asm/pgtable.h                   |   1 +
 arch/h8300/include/asm/pgtable.h                 |   1 +
 arch/hexagon/include/asm/pgtable.h               |   1 +
 arch/ia64/include/asm/pgtable.h                  |   2 +
 arch/metag/include/asm/pgtable.h                 |   1 +
 arch/mips/include/asm/pgtable-32.h               |   1 +
 arch/mips/include/asm/pgtable-64.h               |   1 +
 arch/mn10300/include/asm/page.h                  |   1 +
 arch/nios2/include/asm/pgtable.h                 |   1 +
 arch/openrisc/include/asm/pgtable.h              |   1 +
 arch/powerpc/include/asm/book3s/32/pgtable.h     |   1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h     |   2 +
 arch/powerpc/include/asm/nohash/32/pgtable.h     |   1 +
 arch/powerpc/include/asm/nohash/64/pgtable-4k.h  |   3 +
 arch/powerpc/include/asm/nohash/64/pgtable-64k.h |   1 +
 arch/s390/include/asm/pgtable.h                  |   1 +
 arch/score/include/asm/pgtable.h                 |   1 +
 arch/sh/include/asm/pgtable-2level.h             |   1 +
 arch/sh/include/asm/pgtable-3level.h             |   1 +
 arch/sparc/include/asm/pgtable_64.h              |   1 +
 arch/tile/include/asm/pgtable_32.h               |   1 +
 arch/tile/include/asm/pgtable_64.h               |   1 +
 arch/um/include/asm/pgtable-2level.h             |   1 +
 arch/um/include/asm/pgtable-3level.h             |   1 +
 arch/unicore32/include/asm/pgtable.h             |   1 +
 arch/x86/Kconfig                                 |   7 +
 arch/x86/boot/compressed/head_64.S               |  23 ++-
 arch/x86/boot/cpucheck.c                         |   9 +
 arch/x86/boot/cpuflags.c                         |  12 +-
 arch/x86/entry/entry_64.S                        |   7 +-
 arch/x86/include/asm/cpufeatures.h               |   3 +-
 arch/x86/include/asm/disabled-features.h         |   8 +-
 arch/x86/include/asm/elf.h                       |   2 +-
 arch/x86/include/asm/kasan.h                     |   9 +-
 arch/x86/include/asm/kexec.h                     |   1 +
 arch/x86/include/asm/page_64_types.h             |  10 +
 arch/x86/include/asm/paravirt.h                  |  64 +++++-
 arch/x86/include/asm/paravirt_types.h            |  17 +-
 arch/x86/include/asm/pgalloc.h                   |  36 +++-
 arch/x86/include/asm/pgtable-2level_types.h      |   1 +
 arch/x86/include/asm/pgtable-3level_types.h      |   1 +
 arch/x86/include/asm/pgtable.h                   |  91 ++++++++-
 arch/x86/include/asm/pgtable_64.h                |  29 ++-
 arch/x86/include/asm/pgtable_64_types.h          |  27 +++
 arch/x86/include/asm/pgtable_types.h             |  42 +++-
 arch/x86/include/asm/processor.h                 |  17 +-
 arch/x86/include/asm/required-features.h         |   8 +-
 arch/x86/include/asm/sparsemem.h                 |   9 +-
 arch/x86/include/uapi/asm/processor-flags.h      |   2 +
 arch/x86/kernel/espfix_64.c                      |  12 +-
 arch/x86/kernel/head64.c                         |  40 +++-
 arch/x86/kernel/head_64.S                        |  58 ++++--
 arch/x86/kernel/machine_kexec_32.c               |   4 +-
 arch/x86/kernel/machine_kexec_64.c               |  14 +-
 arch/x86/kernel/paravirt.c                       |  13 +-
 arch/x86/kernel/sys_x86_64.c                     |   6 +-
 arch/x86/kernel/tboot.c                          |   6 +-
 arch/x86/kernel/vm86_32.c                        |   6 +-
 arch/x86/mm/dump_pagetables.c                    |  51 ++++-
 arch/x86/mm/fault.c                              |  57 +++++-
 arch/x86/mm/gup.c                                |  33 ++-
 arch/x86/mm/hugetlbpage.c                        |   8 +-
 arch/x86/mm/ident_map.c                          |  42 +++-
 arch/x86/mm/init_32.c                            |  22 +-
 arch/x86/mm/init_64.c                            | 248 ++++++++++++++++++++---
 arch/x86/mm/ioremap.c                            |   3 +-
 arch/x86/mm/kasan_init_64.c                      |  42 +++-
 arch/x86/mm/kaslr.c                              |  82 ++++++--
 arch/x86/mm/mmap.c                               |   4 +-
 arch/x86/mm/pageattr.c                           |  56 +++--
 arch/x86/mm/pgtable.c                            |  38 +++-
 arch/x86/mm/pgtable_32.c                         |   8 +-
 arch/x86/platform/efi/efi_64.c                   |  21 +-
 arch/x86/power/hibernate_32.c                    |   7 +-
 arch/x86/power/hibernate_64.c                    |  35 ++--
 arch/x86/realmode/init.c                         |   2 +-
 arch/x86/xen/Kconfig                             |   1 +
 arch/xtensa/include/asm/pgtable.h                |   1 +
 drivers/misc/sgi-gru/grufault.c                  |   9 +-
 fs/binfmt_aout.c                                 |   2 -
 fs/binfmt_elf.c                                  |  10 +-
 fs/hugetlbfs/inode.c                             |   6 +-
 fs/proc/base.c                                   |   1 +
 fs/userfaultfd.c                                 |   6 +-
 include/asm-generic/4level-fixup.h               |   3 +-
 include/asm-generic/5level-fixup.h               |  41 ++++
 include/asm-generic/pgtable-nop4d-hack.h         |  62 ++++++
 include/asm-generic/pgtable-nop4d.h              |  56 +++++
 include/asm-generic/pgtable-nopud.h              |  48 +++--
 include/asm-generic/pgtable.h                    |  48 ++++-
 include/asm-generic/resource.h                   |   4 +
 include/asm-generic/tlb.h                        |  14 +-
 include/linux/hugetlb.h                          |   5 +-
 include/linux/kasan.h                            |   1 +
 include/linux/mm.h                               |  34 +++-
 include/linux/sched.h                            |   5 +
 include/uapi/asm-generic/resource.h              |   3 +-
 kernel/events/uprobes.c                          |   5 +-
 kernel/sys.c                                     |   6 +-
 lib/ioremap.c                                    |  39 +++-
 mm/gup.c                                         |  46 ++++-
 mm/huge_memory.c                                 |   7 +-
 mm/hugetlb.c                                     |  29 ++-
 mm/kasan/kasan_init.c                            |  35 +++-
 mm/memory.c                                      | 230 +++++++++++++++++----
 mm/mlock.c                                       |   1 +
 mm/mmap.c                                        |  20 +-
 mm/mprotect.c                                    |  26 ++-
 mm/mremap.c                                      |  16 +-
 mm/nommu.c                                       |   2 +-
 mm/pagewalk.c                                    |  32 ++-
 mm/pgtable-generic.c                             |   6 +
 mm/rmap.c                                        |  13 +-
 mm/shmem.c                                       |   8 +-
 mm/sparse-vmemmap.c                              |  22 +-
 mm/swapfile.c                                    |  26 ++-
 mm/userfaultfd.c                                 |  23 ++-
 mm/vmalloc.c                                     |  81 ++++++--
 125 files changed, 2047 insertions(+), 410 deletions(-)
 create mode 100644 include/asm-generic/5level-fixup.h
 create mode 100644 include/asm-generic/pgtable-nop4d-hack.h
 create mode 100644 include/asm-generic/pgtable-nop4d.h

-- 
2.11.0

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCHv2 01/29] x86/cpufeature: Add 5-level paging detecton
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h Kirill A. Shutemov
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Look for 'la57' in /proc/cpuinfo to see if your machine supports 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index eafee3161d1c..f2f7dbfe3cb8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -288,7 +288,8 @@
 #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/
 #define X86_FEATURE_PKU		(16*32+ 3) /* Protection Keys for Userspace */
 #define X86_FEATURE_OSPKE	(16*32+ 4) /* OS Protection Keys Enable */
-#define X86_FEATURE_RDPID	(16*32+ 22) /* RDPID instruction */
+#define X86_FEATURE_LA57	(16*32+16) /* 5-level page tables */
+#define X86_FEATURE_RDPID	(16*32+22) /* RDPID instruction */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (ebx), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery support */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 01/29] x86/cpufeature: Add 5-level paging detecton Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2017-01-27 11:06   ` Vlastimil Babka
  2016-12-27  1:53 ` [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK Kirill A. Shutemov
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/asm-generic/4level-fixup.h |  3 ++-
 include/asm-generic/5level-fixup.h | 41 ++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h                 |  3 +++
 3 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/5level-fixup.h

diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
index 5bdab6bffd23..928fd66b1271 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -15,7 +15,6 @@
 	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
  		NULL: pmd_offset(pud, address))
 
-#define pud_alloc(mm, pgd, address)	(pgd)
 #define pud_offset(pgd, start)		(pgd)
 #define pud_none(pud)			0
 #define pud_bad(pud)			0
@@ -35,4 +34,6 @@
 #undef  pud_addr_end
 #define pud_addr_end(addr, end)		(end)
 
+#include <asm-generic/5level-fixup.h>
+
 #endif
diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
new file mode 100644
index 000000000000..b5ca82dc4175
--- /dev/null
+++ b/include/asm-generic/5level-fixup.h
@@ -0,0 +1,41 @@
+#ifndef _5LEVEL_FIXUP_H
+#define _5LEVEL_FIXUP_H
+
+#define __ARCH_HAS_5LEVEL_HACK
+#define __PAGETABLE_P4D_FOLDED
+
+#define P4D_SHIFT			PGDIR_SHIFT
+#define P4D_SIZE			PGDIR_SIZE
+#define P4D_MASK			PGDIR_MASK
+#define PTRS_PER_P4D			1
+
+#define p4d_t				pgd_t
+
+#define pud_alloc(mm, p4d, address) \
+	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
+		NULL : pud_offset(p4d, address))
+
+#define p4d_alloc(mm, pgd, address)	(pgd)
+#define p4d_offset(pgd, start)		(pgd)
+#define p4d_none(p4d)			0
+#define p4d_bad(p4d)			0
+#define p4d_present(p4d)		1
+#define p4d_ERROR(p4d)			do { } while (0)
+#define p4d_clear(p4d)			pgd_clear(p4d)
+#define p4d_val(p4d)			pgd_val(p4d)
+#define p4d_populate(mm, p4d, pud)	pgd_populate(mm, p4d, pud)
+#define p4d_page(p4d)			pgd_page(p4d)
+#define p4d_page_vaddr(p4d)		pgd_page_vaddr(p4d)
+
+#define __p4d(x)			__pgd(x)
+#define set_p4d(p4dp, p4d)		set_pgd(p4dp, p4d)
+
+#undef p4d_free_tlb
+#define p4d_free_tlb(tlb, x, addr)	do { } while (0)
+#define p4d_free(mm, x)			do { } while (0)
+#define __p4d_free_tlb(tlb, x, addr)	do { } while (0)
+
+#undef  p4d_addr_end
+#define p4d_addr_end(addr, end)		(end)
+
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fe6b4036664a..58fab8917bbe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1574,11 +1574,14 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
  * Remove it when 4level-fixup.h has been removed.
  */
 #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
+
+#ifndef __ARCH_HAS_5LEVEL_HACK
 static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 {
 	return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
 		NULL: pud_offset(pgd, address);
 }
+#endif /* !__ARCH_HAS_5LEVEL_HACK */
 
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 01/29] x86/cpufeature: Add 5-level paging detecton Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2017-01-27 13:24   ` Vlastimil Babka
  2016-12-27  1:53 ` [PATCHv2 04/29] arch, mm: convert all architectures to use 5level-fixup.h Kirill A. Shutemov
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.

If an architecture uses <asm-generic/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.

With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to 5level-fixup.h.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/asm-generic/pgtable-nop4d-hack.h | 62 ++++++++++++++++++++++++++++++++
 include/asm-generic/pgtable-nopud.h      |  5 +++
 2 files changed, 67 insertions(+)
 create mode 100644 include/asm-generic/pgtable-nop4d-hack.h

diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
new file mode 100644
index 000000000000..752fb7511750
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -0,0 +1,62 @@
+#ifndef _PGTABLE_NOP4D_HACK_H
+#define _PGTABLE_NOP4D_HACK_H
+
+#ifndef __ASSEMBLY__
+#include <asm-generic/5level-fixup.h>
+
+#define __PAGETABLE_PUD_FOLDED
+
+/*
+ * Having the pud type consist of a pgd gets the size right, and allows
+ * us to conceptually access the pgd entry that this pud is folded into
+ * without casting.
+ */
+typedef struct { pgd_t pgd; } pud_t;
+
+#define PUD_SHIFT	PGDIR_SHIFT
+#define PTRS_PER_PUD	1
+#define PUD_SIZE	(1UL << PUD_SHIFT)
+#define PUD_MASK	(~(PUD_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the pud is never bad, and a pud always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd)		{ return 0; }
+static inline int pgd_bad(pgd_t pgd)		{ return 0; }
+static inline int pgd_present(pgd_t pgd)	{ return 1; }
+static inline void pgd_clear(pgd_t *pgd)	{ }
+#define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
+
+#define pgd_populate(mm, pgd, pud)		do { } while (0)
+/*
+ * (puds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval)	set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+
+static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
+{
+	return (pud_t *)pgd;
+}
+
+#define pud_val(x)				(pgd_val((x).pgd))
+#define __pud(x)				((pud_t) { __pgd(x) })
+
+#define pgd_page(pgd)				(pud_page((pud_t){ pgd }))
+#define pgd_page_vaddr(pgd)			(pud_page_vaddr((pud_t){ pgd }))
+
+/*
+ * allocating and freeing a pud is trivial: the 1-entry pud is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define pud_alloc_one(mm, address)		NULL
+#define pud_free(mm, x)				do { } while (0)
+#define __pud_free_tlb(tlb, x, a)		do { } while (0)
+
+#undef  pud_addr_end
+#define pud_addr_end(addr, end)			(end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_HACK_H */
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index 810431d8351b..5e49430a30a4 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -3,6 +3,10 @@
 
 #ifndef __ASSEMBLY__
 
+#ifdef __ARCH_USE_5LEVEL_HACK
+#include <asm-generic/pgtable-nop4d-hack.h>
+#else
+
 #define __PAGETABLE_PUD_FOLDED
 
 /*
@@ -58,4 +62,5 @@ static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
 #define pud_addr_end(addr, end)			(end)
 
 #endif /* __ASSEMBLY__ */
+#endif /* !__ARCH_USE_5LEVEL_HACK */
 #endif /* _PGTABLE_NOPUD_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 04/29] arch, mm: convert all architectures to use 5level-fixup.h
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 05/29] asm-generic: introduce <asm-generic/pgtable-nop4d.h> Kirill A. Shutemov
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

If the architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If the architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If the architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/arc/include/asm/hugepage.h                  | 1 +
 arch/arc/include/asm/pgtable.h                   | 1 +
 arch/arm/include/asm/pgtable.h                   | 1 +
 arch/arm64/include/asm/pgtable-types.h           | 4 ++++
 arch/avr32/include/asm/pgtable-2level.h          | 1 +
 arch/cris/include/asm/pgtable.h                  | 1 +
 arch/frv/include/asm/pgtable.h                   | 1 +
 arch/h8300/include/asm/pgtable.h                 | 1 +
 arch/hexagon/include/asm/pgtable.h               | 1 +
 arch/ia64/include/asm/pgtable.h                  | 2 ++
 arch/metag/include/asm/pgtable.h                 | 1 +
 arch/mips/include/asm/pgtable-32.h               | 1 +
 arch/mips/include/asm/pgtable-64.h               | 1 +
 arch/mn10300/include/asm/page.h                  | 1 +
 arch/nios2/include/asm/pgtable.h                 | 1 +
 arch/openrisc/include/asm/pgtable.h              | 1 +
 arch/powerpc/include/asm/book3s/32/pgtable.h     | 1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h     | 2 ++
 arch/powerpc/include/asm/nohash/32/pgtable.h     | 1 +
 arch/powerpc/include/asm/nohash/64/pgtable-4k.h  | 3 +++
 arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 +
 arch/s390/include/asm/pgtable.h                  | 1 +
 arch/score/include/asm/pgtable.h                 | 1 +
 arch/sh/include/asm/pgtable-2level.h             | 1 +
 arch/sh/include/asm/pgtable-3level.h             | 1 +
 arch/sparc/include/asm/pgtable_64.h              | 1 +
 arch/tile/include/asm/pgtable_32.h               | 1 +
 arch/tile/include/asm/pgtable_64.h               | 1 +
 arch/um/include/asm/pgtable-2level.h             | 1 +
 arch/um/include/asm/pgtable-3level.h             | 1 +
 arch/unicore32/include/asm/pgtable.h             | 1 +
 arch/x86/include/asm/pgtable_types.h             | 4 ++++
 arch/xtensa/include/asm/pgtable.h                | 1 +
 33 files changed, 43 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 317ff773e1ca..b18fcb606908 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -11,6 +11,7 @@
 #define _ASM_ARC_HUGEPAGE_H
 
 #include <linux/types.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 static inline pte_t pmd_pte(pmd_t pmd)
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index e94ca72b974e..ee22d40afef4 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -37,6 +37,7 @@
 
 #include <asm/page.h>
 #include <asm/mmu.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 #include <linux/const.h>
 
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a8d656d9aec7..1c462381c225 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -20,6 +20,7 @@
 
 #else
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 #include <asm/memory.h>
 #include <asm/pgtable-hwdef.h>
diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h
index 69b2fd41503c..345a072b5856 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -55,9 +55,13 @@ typedef struct { pteval_t pgprot; } pgprot_t;
 #define __pgprot(x)	((pgprot_t) { (x) } )
 
 #if CONFIG_PGTABLE_LEVELS == 2
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 #elif CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
+#elif CONFIG_PGTABLE_LEVELS == 4
+#include <asm-generic/5level-fixup.h>
 #endif
 
 #endif	/* __ASM_PGTABLE_TYPES_H */
diff --git a/arch/avr32/include/asm/pgtable-2level.h b/arch/avr32/include/asm/pgtable-2level.h
index 425dd567b5b9..d5b1c63993ec 100644
--- a/arch/avr32/include/asm/pgtable-2level.h
+++ b/arch/avr32/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
 #ifndef __ASM_AVR32_PGTABLE_2LEVEL_H
 #define __ASM_AVR32_PGTABLE_2LEVEL_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 /*
diff --git a/arch/cris/include/asm/pgtable.h b/arch/cris/include/asm/pgtable.h
index ceefc314d64d..5dcdb7d014e5 100644
--- a/arch/cris/include/asm/pgtable.h
+++ b/arch/cris/include/asm/pgtable.h
@@ -6,6 +6,7 @@
 #define _CRIS_PGTABLE_H
 
 #include <asm/page.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #ifndef __ASSEMBLY__
diff --git a/arch/frv/include/asm/pgtable.h b/arch/frv/include/asm/pgtable.h
index a0513d463a1f..ab6e7e961b54 100644
--- a/arch/frv/include/asm/pgtable.h
+++ b/arch/frv/include/asm/pgtable.h
@@ -16,6 +16,7 @@
 #ifndef _ASM_PGTABLE_H
 #define _ASM_PGTABLE_H
 
+#include <asm-generic/5level-fixup.h>
 #include <asm/mem-layout.h>
 #include <asm/setup.h>
 #include <asm/processor.h>
diff --git a/arch/h8300/include/asm/pgtable.h b/arch/h8300/include/asm/pgtable.h
index 8341db67821d..7d265d28ba5e 100644
--- a/arch/h8300/include/asm/pgtable.h
+++ b/arch/h8300/include/asm/pgtable.h
@@ -1,5 +1,6 @@
 #ifndef _H8300_PGTABLE_H
 #define _H8300_PGTABLE_H
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 #include <asm-generic/pgtable.h>
 #define pgtable_cache_init()   do { } while (0)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index 49eab8136ec3..24a9177fb897 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -26,6 +26,7 @@
  */
 #include <linux/swap.h>
 #include <asm/page.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 /* A handy thing to have if one has the RAM. Declared in head.S */
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 9f3ed9ee8f13..67568fb8bceb 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -587,8 +587,10 @@ extern struct page *zero_page_memmap_ptr;
 
 
 #if CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 #endif
+#include <asm-generic/5level-fixup.h>
 #include <asm-generic/pgtable.h>
 
 #endif /* _ASM_IA64_PGTABLE_H */
diff --git a/arch/metag/include/asm/pgtable.h b/arch/metag/include/asm/pgtable.h
index ffa3a3a2ecad..0c151e5af079 100644
--- a/arch/metag/include/asm/pgtable.h
+++ b/arch/metag/include/asm/pgtable.h
@@ -6,6 +6,7 @@
 #define _METAG_PGTABLE_H
 
 #include <asm/pgtable-bits.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 /* Invalid regions on Meta: 0x00000000-0x001FFFFF and 0xFFFF0000-0xFFFFFFFF */
diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h
index d21f3da7bdb6..6f94bed571c4 100644
--- a/arch/mips/include/asm/pgtable-32.h
+++ b/arch/mips/include/asm/pgtable-32.h
@@ -16,6 +16,7 @@
 #include <asm/cachectl.h>
 #include <asm/fixmap.h>
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 extern int temp_tlb_entry;
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 514cbc0a6a67..130a2a6c1531 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -17,6 +17,7 @@
 #include <asm/cachectl.h>
 #include <asm/fixmap.h>
 
+#define __ARCH_USE_5LEVEL_HACK
 #if defined(CONFIG_PAGE_SIZE_64KB) && !defined(CONFIG_MIPS_VA_BITS_48)
 #include <asm-generic/pgtable-nopmd.h>
 #else
diff --git a/arch/mn10300/include/asm/page.h b/arch/mn10300/include/asm/page.h
index 3810a6f740fd..dfe730a5ede0 100644
--- a/arch/mn10300/include/asm/page.h
+++ b/arch/mn10300/include/asm/page.h
@@ -57,6 +57,7 @@ typedef struct page *pgtable_t;
 #define __pgd(x)	((pgd_t) { (x) })
 #define __pgprot(x)	((pgprot_t) { (x) })
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 298393c3cb42..db4f7d179220 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -22,6 +22,7 @@
 #include <asm/tlbflush.h>
 
 #include <asm/pgtable-bits.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #define FIRST_USER_ADDRESS	0UL
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 3567aa7be555..ff97374ca069 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -25,6 +25,7 @@
 #ifndef __ASM_OPENRISC_PGTABLE_H
 #define __ASM_OPENRISC_PGTABLE_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 012223638815..26ed228d4dc6 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 #define _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #include <asm/book3s/32/hash.h>
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5905f0ff57d1..86daf07db423 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 #define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 
+#include <asm-generic/5level-fixup.h>
+
 /*
  * Common bits between hash and Radix page table
  */
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index ba9921bf202e..5134ade2e850 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_POWERPC_NOHASH_32_PGTABLE_H
 #define _ASM_POWERPC_NOHASH_32_PGTABLE_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
index d0db98793dd8..9f4de0a1035e 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-4k.h
@@ -1,5 +1,8 @@
 #ifndef _ASM_POWERPC_NOHASH_64_PGTABLE_4K_H
 #define _ASM_POWERPC_NOHASH_64_PGTABLE_4K_H
+
+#include <asm-generic/5level-fixup.h>
+
 /*
  * Entries per page directory level.  The PTE level must use a 64b record
  * for each page table entry.  The PMD and PGD level use a 32b record for
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable-64k.h b/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
index 55b28ef3409a..1facb584dd29 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable-64k.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_POWERPC_NOHASH_64_PGTABLE_64K_H
 #define _ASM_POWERPC_NOHASH_64_PGTABLE_64K_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 
 
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0362cd5fa187..9718066eabcb 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -24,6 +24,7 @@
  * the S390 page table tree.
  */
 #ifndef __ASSEMBLY__
+#include <asm-generic/5level-fixup.h>
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
diff --git a/arch/score/include/asm/pgtable.h b/arch/score/include/asm/pgtable.h
index 0553e5cd5985..46ff8fd678a7 100644
--- a/arch/score/include/asm/pgtable.h
+++ b/arch/score/include/asm/pgtable.h
@@ -2,6 +2,7 @@
 #define _ASM_SCORE_PGTABLE_H
 
 #include <linux/const.h>
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 #include <asm/fixmap.h>
diff --git a/arch/sh/include/asm/pgtable-2level.h b/arch/sh/include/asm/pgtable-2level.h
index 19bd89db17e7..f75cf4387257 100644
--- a/arch/sh/include/asm/pgtable-2level.h
+++ b/arch/sh/include/asm/pgtable-2level.h
@@ -1,6 +1,7 @@
 #ifndef __ASM_SH_PGTABLE_2LEVEL_H
 #define __ASM_SH_PGTABLE_2LEVEL_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 /*
diff --git a/arch/sh/include/asm/pgtable-3level.h b/arch/sh/include/asm/pgtable-3level.h
index 249a985d9648..9b1e776eca31 100644
--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -1,6 +1,7 @@
 #ifndef __ASM_SH_PGTABLE_3LEVEL_H
 #define __ASM_SH_PGTABLE_3LEVEL_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 
 /*
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 314b66851348..f3472eff47bf 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -12,6 +12,7 @@
  * the SpitFire page tables.
  */
 
+#include <asm-generic/5level-fixup.h>
 #include <linux/compiler.h>
 #include <linux/const.h>
 #include <asm/types.h>
diff --git a/arch/tile/include/asm/pgtable_32.h b/arch/tile/include/asm/pgtable_32.h
index d26a42279036..5f8c615cb5e9 100644
--- a/arch/tile/include/asm/pgtable_32.h
+++ b/arch/tile/include/asm/pgtable_32.h
@@ -74,6 +74,7 @@ extern unsigned long VMALLOC_RESERVE /* = CONFIG_VMALLOC_RESERVE */;
 #define MAXMEM		(_VMALLOC_START - PAGE_OFFSET)
 
 /* We have no pmd or pud since we are strictly a two-level page table */
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 static inline int pud_huge_page(pud_t pud)	{ return 0; }
diff --git a/arch/tile/include/asm/pgtable_64.h b/arch/tile/include/asm/pgtable_64.h
index e96cec52f6d8..96fe58b45118 100644
--- a/arch/tile/include/asm/pgtable_64.h
+++ b/arch/tile/include/asm/pgtable_64.h
@@ -59,6 +59,7 @@
 #ifndef __ASSEMBLY__
 
 /* We have no pud since we are a three-level page table. */
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 
 /*
diff --git a/arch/um/include/asm/pgtable-2level.h b/arch/um/include/asm/pgtable-2level.h
index cfbe59752469..179c0ea87a0c 100644
--- a/arch/um/include/asm/pgtable-2level.h
+++ b/arch/um/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
 #ifndef __UM_PGTABLE_2LEVEL_H
 #define __UM_PGTABLE_2LEVEL_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 /* PGDIR_SHIFT determines what a third-level page table entry can map */
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index bae8523a162f..c4d876dfb9ac 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -7,6 +7,7 @@
 #ifndef __UM_PGTABLE_3LEVEL_H
 #define __UM_PGTABLE_3LEVEL_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 
 /* PGDIR_SHIFT determines what a third-level page table entry can map */
diff --git a/arch/unicore32/include/asm/pgtable.h b/arch/unicore32/include/asm/pgtable.h
index 818d0f5598e3..a4f2bef37e70 100644
--- a/arch/unicore32/include/asm/pgtable.h
+++ b/arch/unicore32/include/asm/pgtable.h
@@ -12,6 +12,7 @@
 #ifndef __UNICORE_PGTABLE_H__
 #define __UNICORE_PGTABLE_H__
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 #include <asm/cpu-single.h>
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 8b4de22d6429..62484333673d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,6 +273,8 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
+#include <asm-generic/5level-fixup.h>
+
 typedef struct { pudval_t pud; } pud_t;
 
 static inline pud_t native_make_pud(pmdval_t val)
@@ -285,6 +287,7 @@ static inline pudval_t native_pud_val(pud_t pud)
 	return pud.pud;
 }
 #else
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 
 static inline pudval_t native_pud_val(pud_t pud)
@@ -306,6 +309,7 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
 	return pmd.pmd;
 }
 #else
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 static inline pmdval_t native_pmd_val(pmd_t pmd)
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index 8aa0e0d9cbb2..30dd5b2e4ad5 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -11,6 +11,7 @@
 #ifndef _XTENSA_PGTABLE_H
 #define _XTENSA_PGTABLE_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 #include <asm/page.h>
 #include <asm/kmem_layout.h>
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 05/29] asm-generic: introduce <asm-generic/pgtable-nop4d.h>
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 04/29] arch, mm: convert all architectures to use 5level-fixup.h Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 06/29] mm: convert generic code to 5-level paging Kirill A. Shutemov
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Like with pgtable-nopud.h for 4-level paging, this new header is base
for converting an architectures to properly folded p4d_t level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/asm-generic/pgtable-nop4d.h | 56 +++++++++++++++++++++++++++++++++++++
 include/asm-generic/pgtable-nopud.h | 43 ++++++++++++++--------------
 include/asm-generic/tlb.h           | 14 ++++++++--
 3 files changed, 89 insertions(+), 24 deletions(-)
 create mode 100644 include/asm-generic/pgtable-nop4d.h

diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
new file mode 100644
index 000000000000..de364ecb8df6
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -0,0 +1,56 @@
+#ifndef _PGTABLE_NOP4D_H
+#define _PGTABLE_NOP4D_H
+
+#ifndef __ASSEMBLY__
+
+#define __PAGETABLE_P4D_FOLDED
+
+typedef struct { pgd_t pgd; } p4d_t;
+
+#define P4D_SHIFT	PGDIR_SHIFT
+#define PTRS_PER_P4D	1
+#define P4D_SIZE	(1UL << P4D_SHIFT)
+#define P4D_MASK	(~(P4D_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the p4d is never bad, and a p4d always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd)		{ return 0; }
+static inline int pgd_bad(pgd_t pgd)		{ return 0; }
+static inline int pgd_present(pgd_t pgd)	{ return 1; }
+static inline void pgd_clear(pgd_t *pgd)	{ }
+#define p4d_ERROR(p4d)				(pgd_ERROR((p4d).pgd))
+
+#define pgd_populate(mm, pgd, p4d)		do { } while (0)
+/*
+ * (p4ds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval)	set_p4d((p4d_t *)(pgdptr), (p4d_t) { pgdval })
+
+static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+{
+	return (p4d_t *)pgd;
+}
+
+#define p4d_val(x)				(pgd_val((x).pgd))
+#define __p4d(x)				((p4d_t) { __pgd(x) })
+
+#define pgd_page(pgd)				(p4d_page((p4d_t){ pgd }))
+#define pgd_page_vaddr(pgd)			(p4d_page_vaddr((p4d_t){ pgd }))
+
+/*
+ * allocating and freeing a p4d is trivial: the 1-entry p4d is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define p4d_alloc_one(mm, address)		NULL
+#define p4d_free(mm, x)				do { } while (0)
+#define __p4d_free_tlb(tlb, x, a)		do { } while (0)
+
+#undef  p4d_addr_end
+#define p4d_addr_end(addr, end)			(end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index 5e49430a30a4..c2b9b96d6268 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -6,53 +6,54 @@
 #ifdef __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nop4d-hack.h>
 #else
+#include <asm-generic/pgtable-nop4d.h>
 
 #define __PAGETABLE_PUD_FOLDED
 
 /*
- * Having the pud type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pud is folded into
+ * Having the pud type consist of a p4d gets the size right, and allows
+ * us to conceptually access the p4d entry that this pud is folded into
  * without casting.
  */
-typedef struct { pgd_t pgd; } pud_t;
+typedef struct { p4d_t p4d; } pud_t;
 
-#define PUD_SHIFT	PGDIR_SHIFT
+#define PUD_SHIFT	P4D_SHIFT
 #define PTRS_PER_PUD	1
 #define PUD_SIZE  	(1UL << PUD_SHIFT)
 #define PUD_MASK  	(~(PUD_SIZE-1))
 
 /*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * The "p4d_xxx()" functions here are trivial for a folded two-level
  * setup: the pud is never bad, and a pud always exists (as it's folded
- * into the pgd entry)
+ * into the p4d entry)
  */
-static inline int pgd_none(pgd_t pgd)		{ return 0; }
-static inline int pgd_bad(pgd_t pgd)		{ return 0; }
-static inline int pgd_present(pgd_t pgd)	{ return 1; }
-static inline void pgd_clear(pgd_t *pgd)	{ }
-#define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
+static inline int p4d_none(p4d_t p4d)		{ return 0; }
+static inline int p4d_bad(p4d_t p4d)		{ return 0; }
+static inline int p4d_present(p4d_t p4d)	{ return 1; }
+static inline void p4d_clear(p4d_t *p4d)	{ }
+#define pud_ERROR(pud)				(p4d_ERROR((pud).p4d))
 
-#define pgd_populate(mm, pgd, pud)		do { } while (0)
+#define p4d_populate(mm, p4d, pud)		do { } while (0)
 /*
- * (puds are folded into pgds so this doesn't get actually called,
+ * (puds are folded into p4ds so this doesn't get actually called,
  * but the define is needed for a generic inline function.)
  */
-#define set_pgd(pgdptr, pgdval)			set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
+#define set_p4d(p4dptr, p4dval)	set_pud((pud_t *)(p4dptr), (pud_t) { p4dval })
 
-static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 {
-	return (pud_t *)pgd;
+	return (pud_t *)p4d;
 }
 
-#define pud_val(x)				(pgd_val((x).pgd))
-#define __pud(x)				((pud_t) { __pgd(x) } )
+#define pud_val(x)				(p4d_val((x).p4d))
+#define __pud(x)				((pud_t) { __p4d(x) })
 
-#define pgd_page(pgd)				(pud_page((pud_t){ pgd }))
-#define pgd_page_vaddr(pgd)			(pud_page_vaddr((pud_t){ pgd }))
+#define p4d_page(p4d)				(pud_page((pud_t){ p4d }))
+#define p4d_page_vaddr(p4d)			(pud_page_vaddr((pud_t){ p4d }))
 
 /*
  * allocating and freeing a pud is trivial: the 1-entry pud is
- * inside the pgd, so has no extra memory associated with it.
+ * inside the p4d, so has no extra memory associated with it.
  */
 #define pud_alloc_one(mm, address)		NULL
 #define pud_free(mm, x)				do { } while (0)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 7eed8cf3130a..a6b51b1a7b7f 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -256,6 +256,12 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
 
+#define pmd_free_tlb(tlb, pmdp, address)			\
+	do {							\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__pmd_free_tlb(tlb, pmdp, address);		\
+	} while (0)
+
 #ifndef __ARCH_HAS_4LEVEL_HACK
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
@@ -264,11 +270,13 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 	} while (0)
 #endif
 
-#define pmd_free_tlb(tlb, pmdp, address)			\
+#ifndef __ARCH_HAS_5LEVEL_HACK
+#define p4d_free_tlb(tlb, pudp, address)			\
 	do {							\
-		__tlb_adjust_range(tlb, address, PAGE_SIZE);	\
-		__pmd_free_tlb(tlb, pmdp, address);		\
+		__tlb_adjust_range(tlb, address, PAGE_SIZE);		\
+		__p4d_free_tlb(tlb, pudp, address);		\
 	} while (0)
+#endif
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 06/29] mm: convert generic code to 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 05/29] asm-generic: introduce <asm-generic/pgtable-nop4d.h> Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 07/29] mm: introduce __p4d_alloc() Kirill A. Shutemov
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Convers all non-architecture-specific code to 5-level paging.

It's mosly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/misc/sgi-gru/grufault.c |   9 +-
 fs/userfaultfd.c                |   6 +-
 include/asm-generic/pgtable.h   |  48 +++++++++-
 include/linux/hugetlb.h         |   5 +-
 include/linux/kasan.h           |   1 +
 include/linux/mm.h              |  31 ++++--
 lib/ioremap.c                   |  39 +++++++-
 mm/gup.c                        |  46 +++++++--
 mm/huge_memory.c                |   7 +-
 mm/hugetlb.c                    |  29 +++---
 mm/kasan/kasan_init.c           |  35 ++++++-
 mm/memory.c                     | 207 +++++++++++++++++++++++++++++++++-------
 mm/mlock.c                      |   1 +
 mm/mprotect.c                   |  26 ++++-
 mm/mremap.c                     |  13 ++-
 mm/pagewalk.c                   |  32 ++++++-
 mm/pgtable-generic.c            |   6 ++
 mm/rmap.c                       |  13 ++-
 mm/sparse-vmemmap.c             |  22 ++++-
 mm/swapfile.c                   |  26 ++++-
 mm/userfaultfd.c                |  23 +++--
 mm/vmalloc.c                    |  81 ++++++++++++----
 22 files changed, 586 insertions(+), 120 deletions(-)

diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 6fb773dbcd0c..93be82fc338a 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -219,15 +219,20 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr,
 	int write, unsigned long *paddr, int *pageshift)
 {
 	pgd_t *pgdp;
-	pmd_t *pmdp;
+	p4d_t *p4dp;
 	pud_t *pudp;
+	pmd_t *pmdp;
 	pte_t pte;
 
 	pgdp = pgd_offset(vma->vm_mm, vaddr);
 	if (unlikely(pgd_none(*pgdp)))
 		goto err;
 
-	pudp = pud_offset(pgdp, vaddr);
+	p4dp = p4d_offset(pgdp, vaddr);
+	if (unlikely(p4d_none(*p4dp)))
+		goto err;
+
+	pudp = pud_offset(p4dp, vaddr);
 	if (unlikely(pud_none(*pudp)))
 		goto err;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d96e2f30084b..20e04bd11654 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -195,6 +195,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 {
 	struct mm_struct *mm = ctx->mm;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd, _pmd;
 	pte_t *pte;
@@ -205,7 +206,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
 		goto out;
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		goto out;
+	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		goto out;
 	pmd = pmd_offset(pud, address);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 18af2bcefe6a..dc6a54318cf5 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,9 +10,9 @@
 #include <linux/bug.h>
 #include <linux/errno.h>
 
-#if 4 - defined(__PAGETABLE_PUD_FOLDED) - defined(__PAGETABLE_PMD_FOLDED) != \
-	CONFIG_PGTABLE_LEVELS
-#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{PUD,PMD}_FOLDED
+#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
+	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
+#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
 #endif
 
 /*
@@ -339,6 +339,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 
+#ifndef p4d_addr_end
+#define p4d_addr_end(addr, end)						\
+({	unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;	\
+	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
+})
+#endif
+
 #ifndef pud_addr_end
 #define pud_addr_end(addr, end)						\
 ({	unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;	\
@@ -359,6 +366,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
  * Do the tests inline, but report and clear the bad entry in mm/memory.c.
  */
 void pgd_clear_bad(pgd_t *);
+void p4d_clear_bad(p4d_t *);
 void pud_clear_bad(pud_t *);
 void pmd_clear_bad(pmd_t *);
 
@@ -373,6 +381,17 @@ static inline int pgd_none_or_clear_bad(pgd_t *pgd)
 	return 0;
 }
 
+static inline int p4d_none_or_clear_bad(p4d_t *p4d)
+{
+	if (p4d_none(*p4d))
+		return 1;
+	if (unlikely(p4d_bad(*p4d))) {
+		p4d_clear_bad(p4d);
+		return 1;
+	}
+	return 0;
+}
+
 static inline int pud_none_or_clear_bad(pud_t *pud)
 {
 	if (pud_none(*pud))
@@ -750,11 +769,30 @@ static inline int pmd_protnone(pmd_t pmd)
 #endif /* CONFIG_MMU */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+
+#ifndef __PAGETABLE_P4D_FOLDED
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot);
+int p4d_clear_huge(p4d_t *p4d);
+#else
+static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
+static inline int p4d_clear_huge(p4d_t *p4d)
+{
+	return 0;
+}
+#endif /* !__PAGETABLE_P4D_FOLDED */
+
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
 int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
 int pud_clear_huge(pud_t *pud);
 int pmd_clear_huge(pmd_t *pmd);
 #else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
 static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
 	return 0;
@@ -763,6 +801,10 @@ static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
 {
 	return 0;
 }
+static inline int p4d_clear_huge(p4d_t *p4d)
+{
+	return 0;
+}
 static inline int pud_clear_huge(pud_t *pud)
 {
 	return 0;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 48c76d612d40..1b45ac594b54 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -116,7 +116,7 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 				pud_t *pud, int flags);
 int pmd_huge(pmd_t pmd);
-int pud_huge(pud_t pmd);
+int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot);
 
@@ -189,6 +189,9 @@ static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
 #ifndef pgd_huge
 #define pgd_huge(x)	0
 #endif
+#ifndef p4d_huge
+#define p4d_huge(x)	0
+#endif
 
 #ifndef pgd_write
 static inline int pgd_write(pgd_t pgd)
diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 820c0ad54a01..8e52fe609220 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -19,6 +19,7 @@ extern unsigned char kasan_zero_page[PAGE_SIZE];
 extern pte_t kasan_zero_pte[PTRS_PER_PTE];
 extern pmd_t kasan_zero_pmd[PTRS_PER_PMD];
 extern pud_t kasan_zero_pud[PTRS_PER_PUD];
+extern p4d_t kasan_zero_p4d[PTRS_PER_P4D];
 
 void kasan_populate_zero_shadow(const void *shadow_start,
 				const void *shadow_end);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 58fab8917bbe..9c0ef57e5f56 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1515,14 +1515,24 @@ static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 	return ptep;
 }
 
+#ifdef __PAGETABLE_P4D_FOLDED
+static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+						unsigned long address)
+{
+	return 0;
+}
+#else
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+#endif
+
 #ifdef __PAGETABLE_PUD_FOLDED
-static inline int __pud_alloc(struct mm_struct *mm, pgd_t *pgd,
+static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
 						unsigned long address)
 {
 	return 0;
 }
 #else
-int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address);
 #endif
 
 #if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU)
@@ -1576,10 +1586,18 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
 
 #ifndef __ARCH_HAS_5LEVEL_HACK
-static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+		unsigned long address)
+{
+	return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address)) ?
+		NULL : p4d_offset(pgd, address);
+}
+
+static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
+		unsigned long address)
 {
-	return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
-		NULL: pud_offset(pgd, address);
+	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
+		NULL : pud_offset(p4d, address);
 }
 #endif /* !__ARCH_HAS_5LEVEL_HACK */
 
@@ -2318,7 +2336,8 @@ void sparse_mem_maps_populate_node(struct page **map_map,
 
 struct page *sparse_mem_map_populate(unsigned long pnum, int nid);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
-pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node);
+p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
+pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 86c8911b0e3a..5629eeaba5ae 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -14,6 +14,7 @@
 #include <asm/pgtable.h>
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static int __read_mostly ioremap_p4d_capable;
 static int __read_mostly ioremap_pud_capable;
 static int __read_mostly ioremap_pmd_capable;
 static int __read_mostly ioremap_huge_disabled;
@@ -35,6 +36,11 @@ void __init ioremap_huge_init(void)
 	}
 }
 
+static inline int ioremap_p4d_enabled(void)
+{
+	return ioremap_p4d_capable;
+}
+
 static inline int ioremap_pud_enabled(void)
 {
 	return ioremap_pud_capable;
@@ -46,6 +52,7 @@ static inline int ioremap_pmd_enabled(void)
 }
 
 #else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int ioremap_p4d_enabled(void) { return 0; }
 static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
@@ -94,14 +101,14 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	return 0;
 }
 
-static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
+static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
 {
 	pud_t *pud;
 	unsigned long next;
 
 	phys_addr -= addr;
-	pud = pud_alloc(&init_mm, pgd, addr);
+	pud = pud_alloc(&init_mm, p4d, addr);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -120,6 +127,32 @@ static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
 	return 0;
 }
 
+static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
+		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	phys_addr -= addr;
+	p4d = p4d_alloc(&init_mm, pgd, addr);
+	if (!p4d)
+		return -ENOMEM;
+	do {
+		next = p4d_addr_end(addr, end);
+
+		if (ioremap_p4d_enabled() &&
+		    ((next - addr) == P4D_SIZE) &&
+		    IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
+			if (p4d_set_huge(p4d, phys_addr + addr, prot))
+				continue;
+		}
+
+		if (ioremap_pud_range(p4d, addr, next, phys_addr + addr, prot))
+			return -ENOMEM;
+	} while (p4d++, addr = next, addr != end);
+	return 0;
+}
+
 int ioremap_page_range(unsigned long addr,
 		       unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
 {
@@ -135,7 +168,7 @@ int ioremap_page_range(unsigned long addr,
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = ioremap_pud_range(pgd, addr, next, phys_addr+addr, prot);
+		err = ioremap_p4d_range(pgd, addr, next, phys_addr+addr, prot);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
diff --git a/mm/gup.c b/mm/gup.c
index 55315555489d..4660b7ee2088 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -226,6 +226,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 			      unsigned int *page_mask)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	spinlock_t *ptl;
@@ -243,8 +244,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	pgd = pgd_offset(mm, address);
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
 		return no_page_table(vma, flags);
-
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (p4d_none(*p4d))
+		return no_page_table(vma, flags);
+	BUILD_BUG_ON(p4d_huge(*p4d));
+	if (unlikely(p4d_bad(*p4d)))
+		return no_page_table(vma, flags);
+	pud = pud_offset(p4d, address);
 	if (pud_none(*pud))
 		return no_page_table(vma, flags);
 	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
@@ -317,6 +323,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		struct page **page)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -330,7 +337,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	else
 		pgd = pgd_offset_gate(mm, address);
 	BUG_ON(pgd_none(*pgd));
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	BUG_ON(p4d_none(*p4d));
+	pud = pud_offset(p4d, address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
@@ -1392,13 +1401,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 	return 1;
 }
 
-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			 int write, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&pgd, addr);
+	pudp = pud_offset(&p4d, addr);
 	do {
 		pud_t pud = READ_ONCE(*pudp);
 
@@ -1420,6 +1429,31 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	return 1;
 }
 
+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	p4d_t *p4dp;
+
+	p4dp = p4d_offset(&pgd, addr);
+	do {
+		p4d_t p4d = READ_ONCE(*p4dp);
+
+		next = p4d_addr_end(addr, end);
+		if (p4d_none(p4d))
+			return 0;
+		BUILD_BUG_ON(p4d_huge(p4d));
+		if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
+			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
+					 P4D_SHIFT, next, write, pages, nr))
+				return 0;
+		} else if (!gup_p4d_range(p4d, addr, next, write, pages, nr))
+			return 0;
+	} while (p4dp++, addr = next, addr != end);
+
+	return 1;
+}
+
 /*
  * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
  * the regular GUP. It will only return non-negative values.
@@ -1470,7 +1504,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
 					 PGDIR_SHIFT, next, write, pages, &nr))
 				break;
-		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		} else if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
 			break;
 	} while (pgdp++, addr = next, addr != end);
 	local_irq_restore(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf14421..37778ab8de40 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1778,6 +1778,7 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze, struct page *page)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
@@ -1785,7 +1786,11 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 	if (!pgd_present(*pgd))
 		return;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return;
+
+	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		return;
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3edb759c5c7d..2359da2264e7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4366,7 +4366,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	pgd_t *pgd = pgd_offset(mm, *addr);
-	pud_t *pud = pud_offset(pgd, *addr);
+	p4d_t *p4d = p4d_offset(pgd, *addr);
+	pud_t *pud = pud_offset(p4d, *addr);
 
 	BUG_ON(page_count(virt_to_page(ptep)) == 0);
 	if (page_count(virt_to_page(ptep)) == 1)
@@ -4397,11 +4398,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 			unsigned long addr, unsigned long sz)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	p4d = p4d_offset(pgd, addr);
+	pud = pud_alloc(mm, p4d, addr);
 	if (pud) {
 		if (sz == PUD_SIZE) {
 			pte = (pte_t *)pud;
@@ -4421,18 +4424,22 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
-	pmd_t *pmd = NULL;
+	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	if (pgd_present(*pgd)) {
-		pud = pud_offset(pgd, addr);
-		if (pud_present(*pud)) {
-			if (pud_huge(*pud))
-				return (pte_t *)pud;
-			pmd = pmd_offset(pud, addr);
-		}
-	}
+	if (!pgd_present(*pgd))
+		return NULL;
+	p4d = p4d_offset(pgd, addr);
+	if (!p4d_present(*p4d))
+		return NULL;
+	pud = pud_offset(p4d, addr);
+	if (!pud_present(*pud))
+		return NULL;
+	if (pud_huge(*pud))
+		return (pte_t *)pud;
+	pmd = pmd_offset(pud, addr);
 	return (pte_t *) pmd;
 }
 
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index 3f9a41cf0ac6..0d4ee78796fc 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -29,6 +29,9 @@
  */
 unsigned char kasan_zero_page[PAGE_SIZE] __page_aligned_bss;
 
+#if CONFIG_PGTABLE_LEVELS > 4
+p4d_t kasan_zero_p4d[PTRS_PER_P4D] __page_aligned_bss;
+#endif
 #if CONFIG_PGTABLE_LEVELS > 3
 pud_t kasan_zero_pud[PTRS_PER_PUD] __page_aligned_bss;
 #endif
@@ -81,10 +84,10 @@ static void __init zero_pmd_populate(pud_t *pud, unsigned long addr,
 	} while (pmd++, addr = next, addr != end);
 }
 
-static void __init zero_pud_populate(pgd_t *pgd, unsigned long addr,
+static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr,
 				unsigned long end)
 {
-	pud_t *pud = pud_offset(pgd, addr);
+	pud_t *pud = pud_offset(p4d, addr);
 	unsigned long next;
 
 	do {
@@ -106,6 +109,23 @@ static void __init zero_pud_populate(pgd_t *pgd, unsigned long addr,
 	} while (pud++, addr = next, addr != end);
 }
 
+static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr,
+				unsigned long end)
+{
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	unsigned long next;
+
+	do {
+		next = p4d_addr_end(addr, end);
+
+		if (p4d_none(*p4d)) {
+			p4d_populate(&init_mm, p4d,
+				early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+		}
+		zero_pud_populate(p4d, addr, next);
+	} while (p4d++, addr = next, addr != end);
+}
+
 /**
  * kasan_populate_zero_shadow - populate shadow memory region with
  *                               kasan_zero_page
@@ -124,6 +144,7 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
 		next = pgd_addr_end(addr, end);
 
 		if (IS_ALIGNED(addr, PGDIR_SIZE) && end - addr >= PGDIR_SIZE) {
+			p4d_t *p4d;
 			pud_t *pud;
 			pmd_t *pmd;
 
@@ -135,8 +156,12 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
 			 * puds,pmds, so pgd_populate(), pud_populate()
 			 * is noops.
 			 */
-			pgd_populate(&init_mm, pgd, kasan_zero_pud);
-			pud = pud_offset(pgd, addr);
+#ifndef __ARCH_HAS_5LEVEL_HACK
+			pgd_populate(&init_mm, pgd, kasan_zero_p4d);
+#endif
+			p4d = p4d_offset(pgd, addr);
+			p4d_populate(&init_mm, p4d, kasan_zero_pud);
+			pud = pud_offset(p4d, addr);
 			pud_populate(&init_mm, pud, kasan_zero_pmd);
 			pmd = pmd_offset(pud, addr);
 			pmd_populate_kernel(&init_mm, pmd, kasan_zero_pte);
@@ -147,6 +172,6 @@ void __init kasan_populate_zero_shadow(const void *shadow_start,
 			pgd_populate(&init_mm, pgd,
 				early_alloc(PAGE_SIZE, NUMA_NO_NODE));
 		}
-		zero_pud_populate(pgd, addr, next);
+		zero_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
diff --git a/mm/memory.c b/mm/memory.c
index 7d23b5050248..267cee723aa1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -442,7 +442,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	mm_dec_nr_pmds(tlb->mm);
 }
 
-static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
 				unsigned long floor, unsigned long ceiling)
 {
@@ -451,7 +451,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
 	unsigned long start;
 
 	start = addr;
-	pud = pud_offset(pgd, addr);
+	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
@@ -459,6 +459,39 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
 		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
 	} while (pud++, addr = next, addr != end);
 
+	start &= P4D_MASK;
+	if (start < floor)
+		return;
+	if (ceiling) {
+		ceiling &= P4D_MASK;
+		if (!ceiling)
+			return;
+	}
+	if (end - 1 > ceiling - 1)
+		return;
+
+	pud = pud_offset(p4d, start);
+	p4d_clear(p4d);
+	pud_free_tlb(tlb, pud, start);
+}
+
+static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
+				unsigned long addr, unsigned long end,
+				unsigned long floor, unsigned long ceiling)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	unsigned long start;
+
+	start = addr;
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+	} while (p4d++, addr = next, addr != end);
+
 	start &= PGDIR_MASK;
 	if (start < floor)
 		return;
@@ -470,9 +503,9 @@ static inline void free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
 	if (end - 1 > ceiling - 1)
 		return;
 
-	pud = pud_offset(pgd, start);
+	p4d = p4d_offset(pgd, start);
 	pgd_clear(pgd);
-	pud_free_tlb(tlb, pud, start);
+	p4d_free_tlb(tlb, p4d, start);
 }
 
 /*
@@ -536,7 +569,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
 	} while (pgd++, addr = next, addr != end);
 }
 
@@ -655,7 +688,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 			  pte_t pte, struct page *page)
 {
 	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
-	pud_t *pud = pud_offset(pgd, addr);
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	pud_t *pud = pud_offset(p4d, addr);
 	pmd_t *pmd = pmd_offset(pud, addr);
 	struct address_space *mapping;
 	pgoff_t index;
@@ -1020,16 +1054,16 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
 
-	dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
+	dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
 	if (!dst_pud)
 		return -ENOMEM;
-	src_pud = pud_offset(src_pgd, addr);
+	src_pud = pud_offset(src_p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(src_pud))
@@ -1041,6 +1075,28 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 	return 0;
 }
 
+static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		unsigned long addr, unsigned long end)
+{
+	p4d_t *src_p4d, *dst_p4d;
+	unsigned long next;
+
+	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+	if (!dst_p4d)
+		return -ENOMEM;
+	src_p4d = p4d_offset(src_pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(src_p4d))
+			continue;
+		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
+						vma, addr, next))
+			return -ENOMEM;
+	} while (dst_p4d++, src_p4d++, addr = next, addr != end);
+	return 0;
+}
+
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		struct vm_area_struct *vma)
 {
@@ -1096,7 +1152,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
 					    vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
@@ -1259,14 +1315,14 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 }
 
 static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
-				struct vm_area_struct *vma, pgd_t *pgd,
+				struct vm_area_struct *vma, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_offset(pgd, addr);
+	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
@@ -1277,6 +1333,25 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
+static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pgd_t *pgd,
+				unsigned long addr, unsigned long end,
+				struct zap_details *details)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
+	} while (p4d++, addr = next, addr != end);
+
+	return addr;
+}
+
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
@@ -1292,7 +1367,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
+		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 }
@@ -1448,16 +1523,24 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
 pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 			spinlock_t **ptl)
 {
-	pgd_t * pgd = pgd_offset(mm, addr);
-	pud_t * pud = pud_alloc(mm, pgd, addr);
-	if (pud) {
-		pmd_t * pmd = pmd_alloc(mm, pud, addr);
-		if (pmd) {
-			VM_BUG_ON(pmd_trans_huge(*pmd));
-			return pte_alloc_map_lock(mm, pmd, addr, ptl);
-		}
-	}
-	return NULL;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return NULL;
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return NULL;
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return NULL;
+
+	VM_BUG_ON(pmd_trans_huge(*pmd));
+	return pte_alloc_map_lock(mm, pmd, addr, ptl);
 }
 
 /*
@@ -1723,7 +1806,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	return 0;
 }
 
-static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
@@ -1731,7 +1814,7 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
 	unsigned long next;
 
 	pfn -= addr >> PAGE_SHIFT;
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, p4d, addr);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -1743,6 +1826,26 @@ static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
 	return 0;
 }
 
+static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+			unsigned long addr, unsigned long end,
+			unsigned long pfn, pgprot_t prot)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	pfn -= addr >> PAGE_SHIFT;
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -ENOMEM;
+	do {
+		next = p4d_addr_end(addr, end);
+		if (remap_pud_range(mm, p4d, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot))
+			return -ENOMEM;
+	} while (p4d++, addr = next, addr != end);
+	return 0;
+}
+
 /**
  * remap_pfn_range - remap kernel memory to userspace
  * @vma: user vma to map to
@@ -1799,7 +1902,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	flush_cache_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = remap_pud_range(mm, pgd, addr, next,
+		err = remap_p4d_range(mm, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
 			break;
@@ -1915,7 +2018,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 	return err;
 }
 
-static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 				     unsigned long addr, unsigned long end,
 				     pte_fn_t fn, void *data)
 {
@@ -1923,7 +2026,7 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
 	unsigned long next;
 	int err;
 
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, p4d, addr);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -1935,6 +2038,26 @@ static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
 	return err;
 }
 
+static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+				     unsigned long addr, unsigned long end,
+				     pte_fn_t fn, void *data)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int err;
+
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -ENOMEM;
+	do {
+		next = p4d_addr_end(addr, end);
+		err = apply_to_pud_range(mm, p4d, addr, next, fn, data);
+		if (err)
+			break;
+	} while (p4d++, addr = next, addr != end);
+	return err;
+}
+
 /*
  * Scan a region of virtual memory, filling in page tables as necessary
  * and calling a provided function on each leaf page table.
@@ -1953,7 +2076,7 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
@@ -3622,10 +3745,14 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	};
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 
 	pgd = pgd_offset(mm, address);
-	pud = pud_alloc(mm, pgd, address);
+	p4d = p4d_alloc(mm, pgd, address);
+	if (!p4d)
+		return VM_FAULT_OOM;
+	pud = pud_alloc(mm, p4d, address);
 	if (!pud)
 		return VM_FAULT_OOM;
 	vmf.pmd = pmd_alloc(mm, pud, address);
@@ -3729,7 +3856,7 @@ EXPORT_SYMBOL_GPL(handle_mm_fault);
  * Allocate page upper directory.
  * We've already handled the fast-path in-line.
  */
-int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
 {
 	pud_t *new = pud_alloc_one(mm, address);
 	if (!new)
@@ -3738,10 +3865,17 @@ int __pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&mm->page_table_lock);
-	if (pgd_present(*pgd))		/* Another has populated it */
+#ifndef __ARCH_HAS_5LEVEL_HACK
+	if (p4d_present(*p4d))		/* Another has populated it */
+		pud_free(mm, new);
+	else
+		p4d_populate(mm, p4d, new);
+#else
+	if (pgd_present(*p4d))		/* Another has populated it */
 		pud_free(mm, new);
 	else
-		pgd_populate(mm, pgd, new);
+		pgd_populate(mm, p4d, new);
+#endif /* __ARCH_HAS_5LEVEL_HACK */
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
@@ -3783,6 +3917,7 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
 		pte_t **ptepp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *ptep;
@@ -3791,7 +3926,11 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
 		goto out;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+		goto out;
+
+	pud = pud_offset(p4d, address);
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto out;
 
diff --git a/mm/mlock.c b/mm/mlock.c
index cdbed8aaa426..81fe0c18e29d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -379,6 +379,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
 	/* Make sure we do not cross the page table boundary */
 	end = pgd_addr_end(start, end);
+	end = p4d_addr_end(start, end);
 	end = pud_addr_end(start, end);
 	end = pmd_addr_end(start, end);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f9c07f54dd62..af677fe954bc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -209,14 +209,14 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 }
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
-		pgd_t *pgd, unsigned long addr, unsigned long end,
+		p4d_t *p4d, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
 {
 	pud_t *pud;
 	unsigned long next;
 	unsigned long pages = 0;
 
-	pud = pud_offset(pgd, addr);
+	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
@@ -228,6 +228,26 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	return pages;
 }
 
+static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
+		pgd_t *pgd, unsigned long addr, unsigned long end,
+		pgprot_t newprot, int dirty_accountable, int prot_numa)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	unsigned long pages = 0;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		pages += change_pud_range(vma, p4d, addr, next, newprot,
+				 dirty_accountable, prot_numa);
+	} while (p4d++, addr = next, addr != end);
+
+	return pages;
+}
+
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable, int prot_numa)
@@ -246,7 +266,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		pages += change_pud_range(vma, pgd, addr, next, newprot,
+		pages += change_p4d_range(vma, pgd, addr, next, newprot,
 				 dirty_accountable, prot_numa);
 	} while (pgd++, addr = next, addr != end);
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 30d7d2482eea..2b3bfcd51c75 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -31,6 +31,7 @@
 static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
@@ -38,7 +39,11 @@ static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 	if (pgd_none_or_clear_bad(pgd))
 		return NULL;
 
-	pud = pud_offset(pgd, addr);
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none_or_clear_bad(p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
 	if (pud_none_or_clear_bad(pud))
 		return NULL;
 
@@ -53,11 +58,15 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 			    unsigned long addr)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return NULL;
+	pud = pud_alloc(mm, p4d, addr);
 	if (!pud)
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 207244489a68..0020f340abfd 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -69,14 +69,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	return err;
 }
 
-static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
 	pud_t *pud;
 	unsigned long next;
 	int err = 0;
 
-	pud = pud_offset(pgd, addr);
+	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud)) {
@@ -95,6 +95,32 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	return err;
 }
 
+static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			  struct mm_walk *walk)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int err = 0;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d)) {
+			if (walk->pte_hole)
+				err = walk->pte_hole(addr, next, walk);
+			if (err)
+				break;
+			continue;
+		}
+		if (walk->pmd_entry || walk->pte_entry)
+			err = walk_pud_range(p4d, addr, next, walk);
+		if (err)
+			break;
+	} while (p4d++, addr = next, addr != end);
+
+	return err;
+}
+
 static int walk_pgd_range(unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
@@ -113,7 +139,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 			continue;
 		}
 		if (walk->pmd_entry || walk->pte_entry)
-			err = walk_pud_range(pgd, addr, next, walk);
+			err = walk_p4d_range(pgd, addr, next, walk);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 71c5f9109f2a..738e278b48c1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -22,6 +22,12 @@ void pgd_clear_bad(pgd_t *pgd)
 	pgd_clear(pgd);
 }
 
+void p4d_clear_bad(p4d_t *p4d)
+{
+	p4d_ERROR(*p4d);
+	p4d_clear(p4d);
+}
+
 void pud_clear_bad(pud_t *pud)
 {
 	pud_ERROR(*pud);
diff --git a/mm/rmap.c b/mm/rmap.c
index 91619fd70939..257cb67892ee 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -684,6 +684,7 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
 	pmd_t pmde;
@@ -692,7 +693,11 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	if (!pgd_present(*pgd))
 		goto out;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		goto out;
+
+	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		goto out;
 
@@ -797,6 +802,7 @@ bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
 				  pte_t **ptep, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -816,7 +822,10 @@ bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
 		return false;
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return false;
+	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		return false;
 	pmd = pmd_offset(pud, address);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 574c67b663fe..a56c3989f773 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -196,9 +196,9 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
 	return pmd;
 }
 
-pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
+pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node)
 {
-	pud_t *pud = pud_offset(pgd, addr);
+	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
 		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
 		if (!p)
@@ -208,6 +208,18 @@ pud_t * __meminit vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node)
 	return pud;
 }
 
+p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
+{
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d)) {
+		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!p)
+			return NULL;
+		p4d_populate(&init_mm, p4d, p);
+	}
+	return p4d;
+}
+
 pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 {
 	pgd_t *pgd = pgd_offset_k(addr);
@@ -225,6 +237,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
 {
 	unsigned long addr = start;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -233,7 +246,10 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
 			return -ENOMEM;
-		pud = vmemmap_pud_populate(pgd, addr, node);
+		p4d = vmemmap_p4d_populate(pgd, addr, node);
+		if (!p4d)
+			return -ENOMEM;
+		pud = vmemmap_pud_populate(p4d, addr, node);
 		if (!pud)
 			return -ENOMEM;
 		pmd = vmemmap_pmd_populate(pud, addr, node);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1c6e0321205d..0d3bd27dba69 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1245,7 +1245,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	return 0;
 }
 
-static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline int unuse_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
 				swp_entry_t entry, struct page *page)
 {
@@ -1253,7 +1253,7 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 	unsigned long next;
 	int ret;
 
-	pud = pud_offset(pgd, addr);
+	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
@@ -1265,6 +1265,26 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 	return 0;
 }
 
+static inline int unuse_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
+				unsigned long addr, unsigned long end,
+				swp_entry_t entry, struct page *page)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int ret;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		ret = unuse_pud_range(vma, p4d, addr, next, entry, page);
+		if (ret)
+			return ret;
+	} while (p4d++, addr = next, addr != end);
+	return 0;
+}
+
 static int unuse_vma(struct vm_area_struct *vma,
 				swp_entry_t entry, struct page *page)
 {
@@ -1288,7 +1308,7 @@ static int unuse_vma(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
+		ret = unuse_p4d_range(vma, pgd, addr, next, entry, page);
 		if (ret)
 			return ret;
 	} while (pgd++, addr = next, addr != end);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af817e5060fb..721681deba9f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -124,19 +124,22 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
-	pmd_t *pmd = NULL;
 
 	pgd = pgd_offset(mm, address);
-	pud = pud_alloc(mm, pgd, address);
-	if (pud)
-		/*
-		 * Note that we didn't run this because the pmd was
-		 * missing, the *pmd may be already established and in
-		 * turn it may also be a trans_huge_pmd.
-		 */
-		pmd = pmd_alloc(mm, pud, address);
-	return pmd;
+	p4d = p4d_alloc(mm, pgd, address);
+	if (!p4d)
+		return NULL;
+	pud = pud_alloc(mm, p4d, address);
+	if (!pud)
+		return NULL;
+	/*
+	 * Note that we didn't run this because the pmd was
+	 * missing, the *pmd may be already established and in
+	 * turn it may also be a trans_huge_pmd.
+	 */
+	return pmd_alloc(mm, pud, address);
 }
 
 static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3ca82d44edd3..35ff0d0e6400 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -86,12 +86,12 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
 	} while (pmd++, addr = next, addr != end);
 }
 
-static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_offset(pgd, addr);
+	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		if (pud_clear_huge(pud))
@@ -102,6 +102,22 @@ static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
 	} while (pud++, addr = next, addr != end);
 }
 
+static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (p4d_clear_huge(p4d))
+			continue;
+		if (p4d_none_or_clear_bad(p4d))
+			continue;
+		vunmap_pud_range(p4d, addr, next);
+	} while (p4d++, addr = next, addr != end);
+}
+
 static void vunmap_page_range(unsigned long addr, unsigned long end)
 {
 	pgd_t *pgd;
@@ -113,7 +129,7 @@ static void vunmap_page_range(unsigned long addr, unsigned long end)
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		vunmap_pud_range(pgd, addr, next);
+		vunmap_p4d_range(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
 
@@ -160,13 +176,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 	return 0;
 }
 
-static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, pgd, addr);
+	pud = pud_alloc(&init_mm, p4d, addr);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -177,6 +193,23 @@ static int vmap_pud_range(pgd_t *pgd, unsigned long addr,
 	return 0;
 }
 
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	p4d = p4d_alloc(&init_mm, pgd, addr);
+	if (!p4d)
+		return -ENOMEM;
+	do {
+		next = p4d_addr_end(addr, end);
+		if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+			return -ENOMEM;
+	} while (p4d++, addr = next, addr != end);
+	return 0;
+}
+
 /*
  * Set up page tables in kva (addr, end). The ptes shall have prot "prot", and
  * will have pfns corresponding to the "pages" array.
@@ -196,7 +229,7 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_pud_range(pgd, addr, next, prot, pages, &nr);
+		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
@@ -237,6 +270,10 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 	unsigned long addr = (unsigned long) vmalloc_addr;
 	struct page *page = NULL;
 	pgd_t *pgd = pgd_offset_k(addr);
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep, pte;
 
 	/*
 	 * XXX we might need to change this if we add VIRTUAL_BUG_ON for
@@ -244,21 +281,23 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 	 */
 	VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
 
-	if (!pgd_none(*pgd)) {
-		pud_t *pud = pud_offset(pgd, addr);
-		if (!pud_none(*pud)) {
-			pmd_t *pmd = pmd_offset(pud, addr);
-			if (!pmd_none(*pmd)) {
-				pte_t *ptep, pte;
-
-				ptep = pte_offset_map(pmd, addr);
-				pte = *ptep;
-				if (pte_present(pte))
-					page = pte_page(pte);
-				pte_unmap(ptep);
-			}
-		}
-	}
+	if (pgd_none(*pgd))
+		return NULL;
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	ptep = pte_offset_map(pmd, addr);
+	pte = *ptep;
+	if (pte_present(pte))
+		page = pte_page(pte);
+	pte_unmap(ptep);
 	return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 07/29] mm: introduce __p4d_alloc()
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 06/29] mm: convert generic code to 5-level paging Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 08/29] x86: basic changes into headers for 5-level paging Kirill A. Shutemov
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

For full 5-level paging we need a helper to allocate p4d pagetable.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 267cee723aa1..de3582840cbc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3851,6 +3851,29 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifndef __PAGETABLE_P4D_FOLDED
+/*
+ * Allocate p4d page table.
+ * We've already handled the fast-path in-line.
+ */
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+{
+	p4d_t *new = p4d_alloc_one(mm, address);
+	if (!new)
+		return -ENOMEM;
+
+	smp_wmb(); /* See comment in __pte_alloc */
+
+	spin_lock(&mm->page_table_lock);
+	if (pgd_present(*pgd))		/* Another has populated it */
+		p4d_free(mm, new);
+	else
+		pgd_populate(mm, pgd, new);
+	spin_unlock(&mm->page_table_lock);
+	return 0;
+}
+#endif /* __PAGETABLE_P4D_FOLDED */
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 08/29] x86: basic changes into headers for 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 07/29] mm: introduce __p4d_alloc() Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 09/29] x86: trivial portion of 5-level paging conversion Kirill A. Shutemov
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch extends x86 headers to enable 5-level paging support.

It's still based on <asm-generic/5level-fixup.h>. We will get to the
point where we can have <asm-generic/pgtable-nop4d.h> later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable-2level_types.h |  1 +
 arch/x86/include/asm/pgtable-3level_types.h |  1 +
 arch/x86/include/asm/pgtable.h              | 16 +++++++++++++++
 arch/x86/include/asm/pgtable_64_types.h     |  1 +
 arch/x86/include/asm/pgtable_types.h        | 30 ++++++++++++++++++++++++++++-
 5 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h
index 392576433e77..373ab1de909f 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -7,6 +7,7 @@
 typedef unsigned long	pteval_t;
 typedef unsigned long	pmdval_t;
 typedef unsigned long	pudval_t;
+typedef unsigned long	p4dval_t;
 typedef unsigned long	pgdval_t;
 typedef unsigned long	pgprotval_t;
 
diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index bcc89625ebe5..b8a4341faafa 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -7,6 +7,7 @@
 typedef u64	pteval_t;
 typedef u64	pmdval_t;
 typedef u64	pudval_t;
+typedef u64	p4dval_t;
 typedef u64	pgdval_t;
 typedef u64	pgprotval_t;
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 437feb436efa..54b6632723d5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -168,6 +168,17 @@ static inline unsigned long pud_pfn(pud_t pud)
 	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
 }
 
+static inline unsigned long p4d_pfn(p4d_t p4d)
+{
+	return (p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT;
+}
+
+static inline int p4d_large(p4d_t p4d)
+{
+	/* No 512 GiB pages yet */
+	return 0;
+}
+
 #define pte_page(pte)	pfn_to_page(pte_pfn(pte))
 
 static inline int pmd_large(pmd_t pte)
@@ -660,6 +671,11 @@ static inline int pud_large(pud_t pud)
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
+static inline unsigned long p4d_index(unsigned long address)
+{
+	return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
+}
+
 #if CONFIG_PGTABLE_LEVELS > 3
 static inline int pgd_present(pgd_t pgd)
 {
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 3a264200c62f..0b2797e5083c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -13,6 +13,7 @@
 typedef unsigned long	pteval_t;
 typedef unsigned long	pmdval_t;
 typedef unsigned long	pudval_t;
+typedef unsigned long	p4dval_t;
 typedef unsigned long	pgdval_t;
 typedef unsigned long	pgprotval_t;
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 62484333673d..df08535f774a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -272,9 +272,20 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
 	return native_pgd_val(pgd) & PTE_FLAGS_MASK;
 }
 
-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
+
+#error FIXME
+
+#else
 #include <asm-generic/5level-fixup.h>
 
+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+	return native_pgd_val(p4d);
+}
+#endif
+
+#if CONFIG_PGTABLE_LEVELS > 3
 typedef struct { pudval_t pud; } pud_t;
 
 static inline pud_t native_make_pud(pmdval_t val)
@@ -318,6 +329,22 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
 }
 #endif
 
+static inline p4dval_t p4d_pfn_mask(p4d_t p4d)
+{
+	/* No 512 GiB huge pages yet */
+	return PTE_PFN_MASK;
+}
+
+static inline p4dval_t p4d_flags_mask(p4d_t p4d)
+{
+	return ~p4d_pfn_mask(p4d);
+}
+
+static inline p4dval_t p4d_flags(p4d_t p4d)
+{
+	return native_p4d_val(p4d) & p4d_flags_mask(p4d);
+}
+
 static inline pudval_t pud_pfn_mask(pud_t pud)
 {
 	if (native_pud_val(pud) & _PAGE_PSE)
@@ -461,6 +488,7 @@ enum pg_level {
 	PG_LEVEL_4K,
 	PG_LEVEL_2M,
 	PG_LEVEL_1G,
+	PG_LEVEL_512G,
 	PG_LEVEL_NUM
 };
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 09/29] x86: trivial portion of 5-level paging conversion
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 08/29] x86: basic changes into headers for 5-level paging Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 10/29] x86/gup: add 5-level paging support Kirill A. Shutemov
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch covers simple cases only.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/tboot.c        |  6 +++++-
 arch/x86/kernel/vm86_32.c      |  6 +++++-
 arch/x86/mm/fault.c            | 39 +++++++++++++++++++++++++++++++++------
 arch/x86/mm/init_32.c          | 22 ++++++++++++++++------
 arch/x86/mm/ioremap.c          |  3 ++-
 arch/x86/mm/pgtable.c          |  4 +++-
 arch/x86/mm/pgtable_32.c       |  8 +++++++-
 arch/x86/platform/efi/efi_64.c | 13 +++++++++----
 arch/x86/power/hibernate_32.c  |  7 +++++--
 9 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index b868fa1b812b..5db0f33cbf2c 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -118,12 +118,16 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn,
 			  pgprot_t prot)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
 	pgd = pgd_offset(&tboot_mm, vaddr);
-	pud = pud_alloc(&tboot_mm, pgd, vaddr);
+	p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
+	if (!p4d)
+		return -1;
+	pud = pud_alloc(&tboot_mm, p4d, vaddr);
 	if (!pud)
 		return -1;
 	pmd = pmd_alloc(&tboot_mm, pud, vaddr);
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index ec5d7545e6dc..50a09419f8f1 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -161,6 +161,7 @@ void save_v86_state(struct kernel_vm86_regs *regs, int retval)
 static void mark_screen_rdonly(struct mm_struct *mm)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -171,7 +172,10 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	pgd = pgd_offset(mm, 0xA0000);
 	if (pgd_none_or_clear_bad(pgd))
 		goto out;
-	pud = pud_offset(pgd, 0xA0000);
+	p4d = p4d_offset(pgd, 0xA0000);
+	if (p4d_none_or_clear_bad(p4d))
+		goto out;
+	pud = pud_offset(p4d, 0xA0000);
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e3254ca0eec4..59104b78e8a7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -252,6 +252,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 {
 	unsigned index = pgd_index(address);
 	pgd_t *pgd_k;
+	p4d_t *p4d, *p4d_k;
 	pud_t *pud, *pud_k;
 	pmd_t *pmd, *pmd_k;
 
@@ -264,10 +265,15 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 	/*
 	 * set_pgd(pgd, *pgd_k); here would be useless on PAE
 	 * and redundant with the set_pmd() on non-PAE. As would
-	 * set_pud.
+	 * set_p4d/set_pud.
 	 */
-	pud = pud_offset(pgd, address);
-	pud_k = pud_offset(pgd_k, address);
+	p4d = p4d_offset(pgd, address);
+	p4d_k = p4d_offset(pgd_k, address);
+	if (!p4d_present(*p4d_k))
+		return NULL;
+
+	pud = pud_offset(p4d, address);
+	pud_k = pud_offset(p4d_k, address);
 	if (!pud_present(*pud_k))
 		return NULL;
 
@@ -383,6 +389,8 @@ static void dump_pagetable(unsigned long address)
 {
 	pgd_t *base = __va(read_cr3());
 	pgd_t *pgd = &base[pgd_index(address)];
+	p4d_t *p4d;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
@@ -391,7 +399,9 @@ static void dump_pagetable(unsigned long address)
 	if (!low_pfn(pgd_val(*pgd) >> PAGE_SHIFT) || !pgd_present(*pgd))
 		goto out;
 #endif
-	pmd = pmd_offset(pud_offset(pgd, address), address);
+	p4d = p4d_offset(pgd, address);
+	pud = pud_offset(p4d, address);
+	pmd = pmd_offset(pud, address);
 	printk(KERN_CONT "*pde = %0*Lx ", sizeof(*pmd) * 2, (u64)pmd_val(*pmd));
 
 	/*
@@ -525,6 +535,7 @@ static void dump_pagetable(unsigned long address)
 {
 	pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK);
 	pgd_t *pgd = base + pgd_index(address);
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -537,7 +548,15 @@ static void dump_pagetable(unsigned long address)
 	if (!pgd_present(*pgd))
 		goto out;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (bad_address(p4d))
+		goto bad;
+
+	printk("P4D %lx ", p4d_val(*p4d));
+	if (!p4d_present(*p4d) || p4d_large(*p4d))
+		goto out;
+
+	pud = pud_offset(p4d, address);
 	if (bad_address(pud))
 		goto bad;
 
@@ -1081,6 +1100,7 @@ static noinline int
 spurious_fault(unsigned long error_code, unsigned long address)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -1103,7 +1123,14 @@ spurious_fault(unsigned long error_code, unsigned long address)
 	if (!pgd_present(*pgd))
 		return 0;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return 0;
+
+	if (p4d_large(*p4d))
+		return spurious_fault_check(error_code, (pte_t *) p4d);
+
+	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		return 0;
 
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 928d657de829..9d9b0b20ed13 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -67,6 +67,7 @@ bool __read_mostly __vmalloc_start_set = false;
  */
 static pmd_t * __init one_md_table_init(pgd_t *pgd)
 {
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd_table;
 
@@ -75,13 +76,15 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
 		pmd_table = (pmd_t *)alloc_low_page();
 		paravirt_alloc_pmd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
 		set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
-		pud = pud_offset(pgd, 0);
+		p4d = p4d_offset(pgd, 0);
+		pud = pud_offset(p4d, 0);
 		BUG_ON(pmd_table != pmd_offset(pud, 0));
 
 		return pmd_table;
 	}
 #endif
-	pud = pud_offset(pgd, 0);
+	p4d = p4d_offset(pgd, 0);
+	pud = pud_offset(p4d, 0);
 	pmd_table = pmd_offset(pud, 0);
 
 	return pmd_table;
@@ -390,8 +393,11 @@ pte_t *kmap_pte;
 
 static inline pte_t *kmap_get_fixmap_pte(unsigned long vaddr)
 {
-	return pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr),
-			vaddr), vaddr), vaddr);
+	pgd_t *pgd = pgd_offset_k(vaddr);
+	p4d_t *p4d = p4d_offset(pgd, vaddr);
+	pud_t *pud = pud_offset(p4d, vaddr);
+	pmd_t *pmd = pmd_offset(pud, vaddr);
+	return pte_offset_kernel(pmd, vaddr);
 }
 
 static void __init kmap_init(void)
@@ -410,6 +416,7 @@ static void __init permanent_kmaps_init(pgd_t *pgd_base)
 {
 	unsigned long vaddr;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -418,7 +425,8 @@ static void __init permanent_kmaps_init(pgd_t *pgd_base)
 	page_table_range_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);
 
 	pgd = swapper_pg_dir + pgd_index(vaddr);
-	pud = pud_offset(pgd, vaddr);
+	p4d = p4d_offset(pgd, vaddr);
+	pud = pud_offset(p4d, vaddr);
 	pmd = pmd_offset(pud, vaddr);
 	pte = pte_offset_kernel(pmd, vaddr);
 	pkmap_page_table = pte;
@@ -450,6 +458,7 @@ void __init native_pagetable_init(void)
 {
 	unsigned long pfn, va;
 	pgd_t *pgd, *base = swapper_pg_dir;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -469,7 +478,8 @@ void __init native_pagetable_init(void)
 		if (!pgd_present(*pgd))
 			break;
 
-		pud = pud_offset(pgd, va);
+		p4d = p4d_offset(pgd, va);
+		pud = pud_offset(p4d, va);
 		pmd = pmd_offset(pud, va);
 		if (!pmd_present(*pmd))
 			break;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 7aaa2635862d..a5e1cda85974 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -425,7 +425,8 @@ static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
 	/* Don't assume we're using swapper_pg_dir at this point */
 	pgd_t *base = __va(read_cr3());
 	pgd_t *pgd = &base[pgd_index(addr)];
-	pud_t *pud = pud_offset(pgd, addr);
+	p4d_t *p4d = p4d_offset(pgd, addr);
+	pud_t *pud = pud_offset(p4d, addr);
 	pmd_t *pmd = pmd_offset(pud, addr);
 
 	return pmd;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3feec5af4e67..cc6fcd4040e2 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -261,13 +261,15 @@ static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
 
 static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
 {
+	p4d_t *p4d;
 	pud_t *pud;
 	int i;
 
 	if (PREALLOCATED_PMDS == 0) /* Work around gcc-3.4.x bug */
 		return;
 
-	pud = pud_offset(pgd, 0);
+	p4d = p4d_offset(pgd, 0);
+	pud = pud_offset(p4d, 0);
 
 	for (i = 0; i < PREALLOCATED_PMDS; i++, pud++) {
 		pmd_t *pmd = pmds[i];
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index 9adce776852b..3d275a791c76 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -26,6 +26,7 @@ unsigned int __VMALLOC_RESERVE = 128 << 20;
 void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -35,7 +36,12 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 		BUG();
 		return;
 	}
-	pud = pud_offset(pgd, vaddr);
+	p4d = p4d_offset(pgd, vaddr);
+	if (p4d_none(*p4d)) {
+		BUG();
+		return;
+	}
+	pud = pud_offset(p4d, vaddr);
 	if (pud_none(*pud)) {
 		BUG();
 		return;
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 319148bd4b05..5aabfa3690dd 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -166,6 +166,7 @@ void efi_sync_low_kernel_mappings(void)
 {
 	unsigned num_entries;
 	pgd_t *pgd_k, *pgd_efi;
+	p4d_t *p4d_k, *p4d_efi;
 	pud_t *pud_k, *pud_efi;
 
 	if (efi_enabled(EFI_OLD_MEMMAP))
@@ -197,16 +198,20 @@ void efi_sync_low_kernel_mappings(void)
 	BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);
 
 	pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
-	pud_efi = pud_offset(pgd_efi, 0);
+	p4d_efi = p4d_offset(pgd_efi, 0);
+	pud_efi = pud_offset(p4d_efi, 0);
 
 	pgd_k = pgd_offset_k(EFI_VA_END);
-	pud_k = pud_offset(pgd_k, 0);
+	p4d_k = p4d_offset(pgd_k, 0);
+	pud_k = pud_offset(p4d_k, 0);
 
 	num_entries = pud_index(EFI_VA_END);
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 
-	pud_efi = pud_offset(pgd_efi, EFI_VA_START);
-	pud_k = pud_offset(pgd_k, EFI_VA_START);
+	p4d_efi = p4d_offset(pgd_efi, EFI_VA_START);
+	pud_efi = pud_offset(p4d_efi, EFI_VA_START);
+	p4d_k = p4d_offset(pgd_k, EFI_VA_START);
+	pud_k = pud_offset(p4d_k, EFI_VA_START);
 
 	num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
 	memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
diff --git a/arch/x86/power/hibernate_32.c b/arch/x86/power/hibernate_32.c
index 9f14bd34581d..c35fdb585c68 100644
--- a/arch/x86/power/hibernate_32.c
+++ b/arch/x86/power/hibernate_32.c
@@ -32,6 +32,7 @@ pgd_t *resume_pg_dir;
  */
 static pmd_t *resume_one_md_table_init(pgd_t *pgd)
 {
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd_table;
 
@@ -41,11 +42,13 @@ static pmd_t *resume_one_md_table_init(pgd_t *pgd)
 		return NULL;
 
 	set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
-	pud = pud_offset(pgd, 0);
+	p4d = p4d_offset(pgd, 0);
+	pud = pud_offset(p4d, 0);
 
 	BUG_ON(pmd_table != pmd_offset(pud, 0));
 #else
-	pud = pud_offset(pgd, 0);
+	p4d = p4d_offset(pgd, 0);
+	pud = pud_offset(p4d, 0);
 	pmd_table = pmd_offset(pud, 0);
 #endif
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 10/29] x86/gup: add 5-level paging support
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 09/29] x86: trivial portion of 5-level paging conversion Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 11/29] x86/ident_map: " Kirill A. Shutemov
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

It's simply extension for one more page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/gup.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 0d4fb3ebbbac..27b92430a8cd 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -76,9 +76,9 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
 }
 
 /*
- * 'pteval' can come from a pte, pmd or pud.  We only check
+ * 'pteval' can come from a pte, pmd, pud or p4d.  We only check
  * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
- * same value on all 3 types.
+ * same value on all 4 types.
  */
 static inline int pte_allows_gup(unsigned long pteval, int write)
 {
@@ -270,13 +270,13 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	return 1;
 }
 
-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 			int write, struct page **pages, int *nr)
 {
 	unsigned long next;
 	pud_t *pudp;
 
-	pudp = pud_offset(&pgd, addr);
+	pudp = pud_offset(&p4d, addr);
 	do {
 		pud_t pud = *pudp;
 
@@ -295,6 +295,27 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	return 1;
 }
 
+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+			int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	p4d_t *p4dp;
+
+	p4dp = p4d_offset(&pgd, addr);
+	do {
+		p4d_t p4d = *p4dp;
+
+		next = p4d_addr_end(addr, end);
+		if (p4d_none(p4d))
+			return 0;
+		BUILD_BUG_ON(p4d_large(p4d));
+		if (!gup_pud_range(p4d, addr, next, write, pages, nr))
+			return 0;
+	} while (p4dp++, addr = next, addr != end);
+
+	return 1;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -343,7 +364,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
 			break;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
 			break;
 	} while (pgdp++, addr = next, addr != end);
 	local_irq_restore(flags);
@@ -415,7 +436,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
 			goto slow;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
 			goto slow;
 	} while (pgdp++, addr = next, addr != end);
 	local_irq_enable();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 11/29] x86/ident_map: add 5-level paging support
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 10/29] x86/gup: add 5-level paging support Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 12/29] x86/mm: add support of p4d_t in vmalloc_fault() Kirill A. Shutemov
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Nothing special: just handle one more level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/ident_map.c | 42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 4473cb4f8b90..f1a620b9f2ef 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -45,6 +45,34 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
 	return 0;
 }
 
+static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
+			  unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+
+	for (; addr < end; addr = next) {
+		p4d_t *p4d = p4d_page + p4d_index(addr);
+		pud_t *pud;
+
+		next = (addr & P4D_MASK) + P4D_SIZE;
+		if (next > end)
+			next = end;
+
+		if (p4d_present(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			ident_pud_init(info, pud, addr, next);
+			continue;
+		}
+		pud = (pud_t *)info->alloc_pgt_page(info->context);
+		if (!pud)
+			return -ENOMEM;
+		ident_pud_init(info, pud, addr, next);
+		set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+	}
+
+	return 0;
+}
+
 int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
 			      unsigned long pstart, unsigned long pend)
 {
@@ -55,27 +83,27 @@ int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
 
 	for (; addr < end; addr = next) {
 		pgd_t *pgd = pgd_page + pgd_index(addr);
-		pud_t *pud;
+		p4d_t *p4d;
 
 		next = (addr & PGDIR_MASK) + PGDIR_SIZE;
 		if (next > end)
 			next = end;
 
 		if (pgd_present(*pgd)) {
-			pud = pud_offset(pgd, 0);
-			result = ident_pud_init(info, pud, addr, next);
+			p4d = p4d_offset(pgd, 0);
+			result = ident_p4d_init(info, p4d, addr, next);
 			if (result)
 				return result;
 			continue;
 		}
 
-		pud = (pud_t *)info->alloc_pgt_page(info->context);
-		if (!pud)
+		p4d = (p4d_t *)info->alloc_pgt_page(info->context);
+		if (!p4d)
 			return -ENOMEM;
-		result = ident_pud_init(info, pud, addr, next);
+		result = ident_p4d_init(info, p4d, addr, next);
 		if (result)
 			return result;
-		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+		set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
 	}
 
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 12/29] x86/mm: add support of p4d_t in vmalloc_fault()
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 11/29] x86/ident_map: " Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 13/29] x86/power: support p4d_t in hibernate code Kirill A. Shutemov
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 4-level paging copying happens on p4d level, as we have pgd_none()
always false when p4d_t folded.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/fault.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 59104b78e8a7..b4bcf2dac4a9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -434,6 +434,7 @@ void vmalloc_sync_all(void)
 static noinline int vmalloc_fault(unsigned long address)
 {
 	pgd_t *pgd, *pgd_ref;
+	p4d_t *p4d, *p4d_ref;
 	pud_t *pud, *pud_ref;
 	pmd_t *pmd, *pmd_ref;
 	pte_t *pte, *pte_ref;
@@ -461,13 +462,26 @@ static noinline int vmalloc_fault(unsigned long address)
 		BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
 	}
 
+	/* With 4-level paging copying happens on p4d level. */
+	p4d = p4d_offset(pgd, address);
+	p4d_ref = p4d_offset(pgd_ref, address);
+	if (p4d_none(*p4d_ref))
+		return -1;
+
+	if (p4d_none(*p4d)) {
+		set_p4d(p4d, *p4d_ref);
+		arch_flush_lazy_mmu_mode();
+	} else {
+		BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_ref));
+	}
+
 	/*
 	 * Below here mismatches are bugs because these lower tables
 	 * are shared:
 	 */
 
-	pud = pud_offset(pgd, address);
-	pud_ref = pud_offset(pgd_ref, address);
+	pud = pud_offset(p4d, address);
+	pud_ref = pud_offset(p4d_ref, address);
 	if (pud_none(*pud_ref))
 		return -1;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 13/29] x86/power: support p4d_t in hibernate code
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 12/29] x86/mm: add support of p4d_t in vmalloc_fault() Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 14/29] x86/kexec: support p4d_t Kirill A. Shutemov
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

set_up_temporary_text_mapping() and relocate_restore_code() require
trivial adjustments to handle additional page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/power/hibernate_64.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index ded2e8272382..22de6e29704f 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -49,6 +49,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
 {
 	pmd_t *pmd;
 	pud_t *pud;
+	p4d_t *p4d;
 
 	/*
 	 * The new mapping only has to cover the page containing the image
@@ -75,8 +76,10 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
 		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
 	set_pud(pud + pud_index(restore_jump_address),
 		__pud(__pa(pmd) | _KERNPG_TABLE));
+	set_p4d(p4d + p4d_index(restore_jump_address),
+		__p4d(__pa(pud) | _KERNPG_TABLE));
 	set_pgd(pgd + pgd_index(restore_jump_address),
-		__pgd(__pa(pud) | _KERNPG_TABLE));
+		__pgd(__pa(p4d) | _KERNPG_TABLE));
 
 	return 0;
 }
@@ -124,7 +127,10 @@ static int set_up_temporary_mappings(void)
 static int relocate_restore_code(void)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
 	relocated_restore_code = get_safe_page(GFP_ATOMIC);
 	if (!relocated_restore_code)
@@ -134,22 +140,25 @@ static int relocate_restore_code(void)
 
 	/* Make the page containing the relocated code executable */
 	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
-	pud = pud_offset(pgd, relocated_restore_code);
+	p4d = p4d_offset(pgd, relocated_restore_code);
+	if (p4d_large(*p4d)) {
+		set_p4d(p4d, __p4d(p4d_val(*p4d) & ~_PAGE_NX));
+		goto out;
+	}
+	pud = pud_offset(p4d, relocated_restore_code);
 	if (pud_large(*pud)) {
 		set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
-	} else {
-		pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
-
-		if (pmd_large(*pmd)) {
-			set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
-		} else {
-			pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
-
-			set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
-		}
+		goto out;
+	}
+	pmd = pmd_offset(pud, relocated_restore_code);
+	if (pmd_large(*pmd)) {
+		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+		goto out;
 	}
+	pte = pte_offset_kernel(pmd, relocated_restore_code);
+	set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+out:
 	__flush_tlb_all();
-
 	return 0;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 14/29] x86/kexec: support p4d_t
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 13/29] x86/power: support p4d_t in hibernate code Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:53 ` [PATCHv2 15/29] x86: convert the rest of the code to " Kirill A. Shutemov
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Handle additional page table level in kexec code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/kexec.h       |  1 +
 arch/x86/kernel/machine_kexec_32.c |  4 +++-
 arch/x86/kernel/machine_kexec_64.c | 12 +++++++++++-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 282630e4c6ea..70ef205489f0 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -164,6 +164,7 @@ struct kimage_arch {
 };
 #else
 struct kimage_arch {
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c
index 469b23d6acc2..5f43cec296c5 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -103,6 +103,7 @@ static void machine_kexec_page_table_set_one(
 	pgd_t *pgd, pmd_t *pmd, pte_t *pte,
 	unsigned long vaddr, unsigned long paddr)
 {
+	p4d_t *p4d;
 	pud_t *pud;
 
 	pgd += pgd_index(vaddr);
@@ -110,7 +111,8 @@ static void machine_kexec_page_table_set_one(
 	if (!(pgd_val(*pgd) & _PAGE_PRESENT))
 		set_pgd(pgd, __pgd(__pa(pmd) | _PAGE_PRESENT));
 #endif
-	pud = pud_offset(pgd, vaddr);
+	p4d = p4d_offset(pgd, vaddr);
+	pud = pud_offset(p4d, vaddr);
 	pmd = pmd_offset(pud, vaddr);
 	if (!(pmd_val(*pmd) & _PAGE_PRESENT))
 		set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 307b1f4543de..c325967de4bc 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -36,6 +36,7 @@ static struct kexec_file_ops *kexec_file_loaders[] = {
 
 static void free_transition_pgtable(struct kimage *image)
 {
+	free_page((unsigned long)image->arch.p4d);
 	free_page((unsigned long)image->arch.pud);
 	free_page((unsigned long)image->arch.pmd);
 	free_page((unsigned long)image->arch.pte);
@@ -43,6 +44,7 @@ static void free_transition_pgtable(struct kimage *image)
 
 static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 {
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -59,7 +61,15 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 		image->arch.pud = pud;
 		set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
 	}
-	pud = pud_offset(pgd, vaddr);
+	p4d = p4d_offset(pgd, vaddr);
+	if (!p4d_present(*p4d)) {
+		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
+		if (!p4d)
+			goto err;
+		image->arch.p4d = p4d;
+		set_p4d(p4d, __p4d(__pa(p4d) | _KERNPG_TABLE));
+	}
+	pud = pud_offset(p4d, vaddr);
 	if (!pud_present(*pud)) {
 		pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
 		if (!pmd)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 15/29] x86: convert the rest of the code to support p4d_t
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 14/29] x86/kexec: support p4d_t Kirill A. Shutemov
@ 2016-12-27  1:53 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 16/29] x86: detect 5-level paging support Kirill A. Shutemov
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch converts x86 to use proper folding of new page table level
with <asm-generic/pgtable-nop4d.h>.

TODO: split it up futher.
FIXME: XEN is broken.

Not-yet-Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/paravirt.h       |  33 +++++--
 arch/x86/include/asm/paravirt_types.h |  12 ++-
 arch/x86/include/asm/pgalloc.h        |  35 +++++++-
 arch/x86/include/asm/pgtable.h        |  75 +++++++++++++---
 arch/x86/include/asm/pgtable_64.h     |  12 ++-
 arch/x86/include/asm/pgtable_types.h  |  10 +--
 arch/x86/kernel/paravirt.c            |  10 ++-
 arch/x86/mm/init_64.c                 | 162 +++++++++++++++++++++++++++-------
 arch/x86/mm/kasan_init_64.c           |  12 ++-
 arch/x86/mm/pageattr.c                |  56 ++++++++----
 arch/x86/platform/efi/efi_64.c        |   8 +-
 arch/x86/xen/Kconfig                  |   1 +
 12 files changed, 342 insertions(+), 84 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca40694..432c6e730ed1 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -525,7 +525,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
 		PVOP_VCALL2(pv_mmu_ops.set_pud, pudp,
 			    val);
 }
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
 static inline pud_t __pud(pudval_t val)
 {
 	pudval_t ret;
@@ -554,6 +554,32 @@ static inline pudval_t pud_val(pud_t pud)
 	return ret;
 }
 
+static inline void pud_clear(pud_t *pudp)
+{
+	set_pud(pudp, __pud(0));
+}
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+	p4dval_t val = native_p4d_val(p4d);
+
+	if (sizeof(p4dval_t) > sizeof(long))
+		PVOP_VCALL3(pv_mmu_ops.set_p4d, p4dp,
+			    val, (u64)val >> 32);
+	else
+		PVOP_VCALL2(pv_mmu_ops.set_p4d, p4dp,
+			    val);
+}
+
+static inline void p4d_clear(p4d_t *p4dp)
+{
+	set_p4d(p4dp, __p4d(0));
+}
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+#error FIXME
+
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 	pgdval_t val = native_pgd_val(pgd);
@@ -571,10 +597,7 @@ static inline void pgd_clear(pgd_t *pgdp)
 	set_pgd(pgdp, __pgd(0));
 }
 
-static inline void pud_clear(pud_t *pudp)
-{
-	set_pud(pudp, __pud(0));
-}
+#endif  /* CONFIG_PGTABLE_LEVELS == 5 */
 
 #endif	/* CONFIG_PGTABLE_LEVELS == 4 */
 
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index bb2de45a60f2..3982c200845f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -277,12 +277,18 @@ struct pv_mmu_ops {
 	struct paravirt_callee_save pmd_val;
 	struct paravirt_callee_save make_pmd;
 
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
 	struct paravirt_callee_save pud_val;
 	struct paravirt_callee_save make_pud;
 
-	void (*set_pgd)(pgd_t *pudp, pgd_t pgdval);
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+	void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif	/* CONFIG_PGTABLE_LEVELS >= 5 */
+
+#endif	/* CONFIG_PGTABLE_LEVELS >= 4 */
+
 #endif	/* CONFIG_PGTABLE_LEVELS >= 3 */
 
 	struct pv_lazy_ops lazy_mode;
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b6d425999f99..2f585054c63c 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -121,10 +121,10 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 #endif	/* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 {
 	paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
-	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)));
+	set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -150,6 +150,37 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
 	___pud_free_tlb(tlb, pud);
 }
 
+#if CONFIG_PGTABLE_LEVELS > 4
+static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
+{
+	paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+	gfp_t gfp = GFP_KERNEL_ACCOUNT;
+
+	if (mm == &init_mm)
+		gfp &= ~__GFP_ACCOUNT;
+	return (p4d_t *)get_zeroed_page(gfp);
+}
+
+static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
+{
+	BUG_ON((unsigned long)p4d & (PAGE_SIZE-1));
+	free_page((unsigned long)p4d);
+}
+
+extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d);
+
+static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
+				  unsigned long address)
+{
+	___p4d_free_tlb(tlb, p4d);
+}
+
+#endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 54b6632723d5..398adab9a167 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -52,11 +52,19 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #define set_pmd(pmdp, pmd)		native_set_pmd(pmdp, pmd)
 
-#ifndef __PAGETABLE_PUD_FOLDED
+#ifndef __PAGETABLE_P4D_FOLDED
 #define set_pgd(pgdp, pgd)		native_set_pgd(pgdp, pgd)
 #define pgd_clear(pgd)			native_pgd_clear(pgd)
 #endif
 
+#ifndef set_p4d
+# define set_p4d(p4dp, p4d)		native_set_p4d(p4dp, p4d)
+#endif
+
+#ifndef __PAGETABLE_PUD_FOLDED
+#define p4d_clear(p4d)			native_p4d_clear(p4d)
+#endif
+
 #ifndef set_pud
 # define set_pud(pudp, pud)		native_set_pud(pudp, pud)
 #endif
@@ -73,6 +81,11 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page);
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)
 
+#ifndef __PAGETABLE_P4D_FOLDED
+#define p4d_val(x)	native_p4d_val(x)
+#define __p4d(x)	native_make_p4d(x)
+#endif
+
 #ifndef __PAGETABLE_PUD_FOLDED
 #define pud_val(x)	native_pud_val(x)
 #define __pud(x)	native_make_pud(x)
@@ -439,6 +452,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 #define pte_pgprot(x) __pgprot(pte_flags(x))
 #define pmd_pgprot(x) __pgprot(pmd_flags(x))
 #define pud_pgprot(x) __pgprot(pud_flags(x))
+#define p4d_pgprot(x) __pgprot(p4d_flags(x))
 
 #define canon_pgprot(p) __pgprot(massage_pgprot(p))
 
@@ -671,12 +685,58 @@ static inline int pud_large(pud_t pud)
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
+#if CONFIG_PGTABLE_LEVELS > 3
+static inline int p4d_none(p4d_t p4d)
+{
+	return (native_p4d_val(p4d) & ~(_PAGE_KNL_ERRATUM_MASK)) == 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+	return p4d_flags(p4d) & _PAGE_PRESENT;
+}
+
+static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+{
+	return (unsigned long)__va(p4d_val(p4d) & p4d_pfn_mask(p4d));
+}
+
+/*
+ * Currently stuck as a macro due to indirect forward reference to
+ * linux/mmzone.h's __section_mem_map_addr() definition:
+ */
+#define p4d_page(p4d)		\
+	pfn_to_page((p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT)
+
+/*
+ * the pud page can be thought of an array like this: pud_t[PTRS_PER_PUD]
+ *
+ * this macro returns the index of the entry in the pud page which would
+ * control the given virtual address
+ */
+static inline unsigned long pud_index(unsigned long address)
+{
+	return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
+}
+
+/* Find an entry in the third-level page table.. */
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+	return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+}
+#endif  /* CONFIG_PGTABLE_LEVELS > 3 */
+
 static inline unsigned long p4d_index(unsigned long address)
 {
 	return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
 }
 
-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
 static inline int pgd_present(pgd_t pgd)
 {
 	return pgd_flags(pgd) & _PAGE_PRESENT;
@@ -694,14 +754,9 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd)
 #define pgd_page(pgd)		pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT)
 
 /* to find an entry in a page-table-directory. */
-static inline unsigned long pud_index(unsigned long address)
-{
-	return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
-}
-
-static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 {
-	return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
+	return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
 }
 
 static inline int pgd_bad(pgd_t pgd)
@@ -719,7 +774,7 @@ static inline int pgd_none(pgd_t pgd)
 	 */
 	return !native_pgd_val(pgd);
 }
-#endif	/* CONFIG_PGTABLE_LEVELS > 3 */
+#endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 62b775926045..dff070a6d27e 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -41,7 +41,7 @@ extern void paging_init(void);
 
 struct mm_struct;
 
-void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);
+void set_pte_vaddr_pud(pud_t *pud, unsigned long vaddr, pte_t new_pte);
 
 
 static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
@@ -106,6 +106,16 @@ static inline void native_pud_clear(pud_t *pud)
 	native_set_pud(pud, native_make_pud(0));
 }
 
+static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+	*p4dp = p4d;
+}
+
+static inline void native_p4d_clear(p4d_t *p4d)
+{
+	native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+}
+
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 	*pgdp = pgd;
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index df08535f774a..4930afe9df0a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -277,11 +277,11 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
 #error FIXME
 
 #else
-#include <asm-generic/5level-fixup.h>
+#include <asm-generic/pgtable-nop4d.h>
 
 static inline p4dval_t native_p4d_val(p4d_t p4d)
 {
-	return native_pgd_val(p4d);
+	return native_pgd_val(p4d.pgd);
 }
 #endif
 
@@ -298,12 +298,11 @@ static inline pudval_t native_pud_val(pud_t pud)
 	return pud.pud;
 }
 #else
-#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopud.h>
 
 static inline pudval_t native_pud_val(pud_t pud)
 {
-	return native_pgd_val(pud.pgd);
+	return native_pgd_val(pud.p4d.pgd);
 }
 #endif
 
@@ -320,12 +319,11 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
 	return pmd.pmd;
 }
 #else
-#define __ARCH_USE_5LEVEL_HACK
 #include <asm-generic/pgtable-nopmd.h>
 
 static inline pmdval_t native_pmd_val(pmd_t pmd)
 {
-	return native_pgd_val(pmd.pud.pgd);
+	return native_pgd_val(pmd.pud.p4d.pgd);
 }
 #endif
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a1bfba0f7234..f8aedc112d5e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -429,12 +429,16 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 	.pmd_val = PTE_IDENT,
 	.make_pmd = PTE_IDENT,
 
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
 	.pud_val = PTE_IDENT,
 	.make_pud = PTE_IDENT,
 
-	.set_pgd = native_set_pgd,
-#endif
+	.set_p4d = native_set_p4d,
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
 #endif /* CONFIG_PGTABLE_LEVELS >= 3 */
 
 	.pte_val = PTE_IDENT,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index af85b686a7b0..72527ece6130 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -97,28 +97,38 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 	unsigned long address;
 
 	for (address = start; address <= end; address += PGDIR_SIZE) {
-		const pgd_t *pgd_ref = pgd_offset_k(address);
+		pgd_t *pgd_ref = pgd_offset_k(address);
+		const p4d_t *p4d_ref;
 		struct page *page;
 
-		if (pgd_none(*pgd_ref))
+		/*
+		 * With folded p4d, pgd_none() is always false, we need to
+		 * handle synchonization on p4d level.
+		 */
+		BUILD_BUG_ON(pgd_none(*pgd_ref));
+		p4d_ref = p4d_offset(pgd_ref, address);
+
+		if (p4d_none(*p4d_ref))
 			continue;
 
 		spin_lock(&pgd_lock);
 		list_for_each_entry(page, &pgd_list, lru) {
 			pgd_t *pgd;
+			p4d_t *p4d;
 			spinlock_t *pgt_lock;
 
 			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			p4d = p4d_offset(pgd, address);
 			/* the pgt_lock only for Xen */
 			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
 			spin_lock(pgt_lock);
 
-			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
-				BUG_ON(pgd_page_vaddr(*pgd)
-				       != pgd_page_vaddr(*pgd_ref));
+			if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))
+				BUG_ON(p4d_page_vaddr(*p4d)
+				       != p4d_page_vaddr(*p4d_ref));
 
-			if (pgd_none(*pgd))
-				set_pgd(pgd, *pgd_ref);
+			if (p4d_none(*p4d))
+				set_p4d(p4d, *p4d_ref);
 
 			spin_unlock(pgt_lock);
 		}
@@ -149,16 +159,28 @@ static __ref void *spp_getpage(void)
 	return ptr;
 }
 
-static pud_t *fill_pud(pgd_t *pgd, unsigned long vaddr)
+static p4d_t *fill_p4d(pgd_t *pgd, unsigned long vaddr)
 {
 	if (pgd_none(*pgd)) {
-		pud_t *pud = (pud_t *)spp_getpage();
-		pgd_populate(&init_mm, pgd, pud);
-		if (pud != pud_offset(pgd, 0))
+		p4d_t *p4d = (p4d_t *)spp_getpage();
+		pgd_populate(&init_mm, pgd, p4d);
+		if (p4d != p4d_offset(pgd, 0))
 			printk(KERN_ERR "PAGETABLE BUG #00! %p <-> %p\n",
-			       pud, pud_offset(pgd, 0));
+			       p4d, p4d_offset(pgd, 0));
 	}
-	return pud_offset(pgd, vaddr);
+	return p4d_offset(pgd, vaddr);
+}
+
+static pud_t *fill_pud(p4d_t *p4d, unsigned long vaddr)
+{
+	if (p4d_none(*p4d)) {
+		pud_t *pud = (pud_t *)spp_getpage();
+		p4d_populate(&init_mm, p4d, pud);
+		if (pud != pud_offset(p4d, 0))
+			printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
+			       pud, pud_offset(p4d, 0));
+	}
+	return pud_offset(p4d, vaddr);
 }
 
 static pmd_t *fill_pmd(pud_t *pud, unsigned long vaddr)
@@ -167,7 +189,7 @@ static pmd_t *fill_pmd(pud_t *pud, unsigned long vaddr)
 		pmd_t *pmd = (pmd_t *) spp_getpage();
 		pud_populate(&init_mm, pud, pmd);
 		if (pmd != pmd_offset(pud, 0))
-			printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
+			printk(KERN_ERR "PAGETABLE BUG #02! %p <-> %p\n",
 			       pmd, pmd_offset(pud, 0));
 	}
 	return pmd_offset(pud, vaddr);
@@ -179,18 +201,16 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
 		pte_t *pte = (pte_t *) spp_getpage();
 		pmd_populate_kernel(&init_mm, pmd, pte);
 		if (pte != pte_offset_kernel(pmd, 0))
-			printk(KERN_ERR "PAGETABLE BUG #02!\n");
+			printk(KERN_ERR "PAGETABLE BUG #03!\n");
 	}
 	return pte_offset_kernel(pmd, vaddr);
 }
 
-void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
+void set_pte_vaddr_pud(pud_t *pud, unsigned long vaddr, pte_t new_pte)
 {
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
-	pud = pud_page + pud_index(vaddr);
 	pmd = fill_pmd(pud, vaddr);
 	pte = fill_pte(pmd, vaddr);
 
@@ -206,7 +226,8 @@ void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
 void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 {
 	pgd_t *pgd;
-	pud_t *pud_page;
+	p4d_t *p4d;
+	pud_t *pud;
 
 	pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(pteval));
 
@@ -216,17 +237,20 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
 			"PGD FIXMAP MISSING, it should be setup in head.S!\n");
 		return;
 	}
-	pud_page = (pud_t*)pgd_page_vaddr(*pgd);
-	set_pte_vaddr_pud(pud_page, vaddr, pteval);
+	p4d = fill_p4d(pgd, vaddr);
+	pud = fill_pud(p4d, vaddr);
+	set_pte_vaddr_pud(pud, vaddr, pteval);
 }
 
 pmd_t * __init populate_extra_pmd(unsigned long vaddr)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 
 	pgd = pgd_offset_k(vaddr);
-	pud = fill_pud(pgd, vaddr);
+	p4d = fill_p4d(pgd, vaddr);
+	pud = fill_pud(p4d, vaddr);
 	return fill_pmd(pud, vaddr);
 }
 
@@ -245,6 +269,7 @@ static void __init __init_extra_mapping(unsigned long phys, unsigned long size,
 					enum page_cache_mode cache)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pgprot_t prot;
@@ -255,11 +280,17 @@ static void __init __init_extra_mapping(unsigned long phys, unsigned long size,
 	for (; size; phys += PMD_SIZE, size -= PMD_SIZE) {
 		pgd = pgd_offset_k((unsigned long)__va(phys));
 		if (pgd_none(*pgd)) {
+			p4d = (p4d_t *) spp_getpage();
+			set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE |
+						_PAGE_USER));
+		}
+		p4d = p4d_offset(pgd, (unsigned long)__va(phys));
+		if (p4d_none(*p4d)) {
 			pud = (pud_t *) spp_getpage();
-			set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE |
+			set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE |
 						_PAGE_USER));
 		}
-		pud = pud_offset(pgd, (unsigned long)__va(phys));
+		pud = pud_offset(p4d, (unsigned long)__va(phys));
 		if (pud_none(*pud)) {
 			pmd = (pmd_t *) spp_getpage();
 			set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE |
@@ -563,12 +594,15 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
+		p4d_t *p4d;
 		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		if (pgd_val(*pgd)) {
-			pud = (pud_t *)pgd_page_vaddr(*pgd);
+		BUILD_BUG_ON(pgd_none(*pgd));
+		p4d = p4d_offset(pgd, vaddr);
+		if (p4d_val(*p4d)) {
+			pud = (pud_t *)p4d_page_vaddr(*p4d);
 			paddr_last = phys_pud_init(pud, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
@@ -580,7 +614,7 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		pgd_populate(&init_mm, pgd, pud);
+		p4d_populate(&init_mm, p4d, pud);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
@@ -726,6 +760,24 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	/* free a pud talbe */
+	free_pagetable(p4d_page(*p4d), 0);
+	spin_lock(&init_mm.page_table_lock);
+	p4d_clear(p4d);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		 bool direct)
@@ -908,6 +960,32 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 		update_page_count(PG_LEVEL_1G, -pages);
 }
 
+static void __meminit
+remove_p4d_table(p4d_t *p4d_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long next, pages = 0;
+	pud_t *pud_base;
+	p4d_t *p4d;
+
+	p4d = p4d_start + p4d_index(addr);
+	for (; addr < end; addr = next, p4d++) {
+		next = p4d_addr_end(addr, end);
+
+		if (!p4d_present(*p4d))
+			continue;
+
+		BUILD_BUG_ON(p4d_large(*p4d));
+
+		pud_base = (pud_t *)p4d_page_vaddr(*p4d);
+		remove_pud_table(pud_base, addr, next, direct);
+		free_pud_table(pud_base, p4d);
+	}
+
+	if (direct)
+		update_page_count(PG_LEVEL_512G, -pages);
+}
+
 /* start and end are both virtual address. */
 static void __meminit
 remove_pagetable(unsigned long start, unsigned long end, bool direct)
@@ -915,7 +993,7 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	unsigned long next;
 	unsigned long addr;
 	pgd_t *pgd;
-	pud_t *pud;
+	p4d_t *p4d;
 
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
@@ -924,8 +1002,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 		if (!pgd_present(*pgd))
 			continue;
 
-		pud = (pud_t *)pgd_page_vaddr(*pgd);
-		remove_pud_table(pud, addr, next, direct);
+		p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+		remove_p4d_table(p4d, addr, next, direct);
 	}
 
 	flush_tlb_all();
@@ -1095,6 +1173,7 @@ int kern_addr_valid(unsigned long addr)
 {
 	unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -1106,7 +1185,11 @@ int kern_addr_valid(unsigned long addr)
 	if (pgd_none(*pgd))
 		return 0;
 
-	pud = pud_offset(pgd, addr);
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return 0;
+
+	pud = pud_offset(p4d, addr);
 	if (pud_none(*pud))
 		return 0;
 
@@ -1163,6 +1246,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	unsigned long addr;
 	unsigned long next;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
@@ -1173,7 +1257,11 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 		if (!pgd)
 			return -ENOMEM;
 
-		pud = vmemmap_pud_populate(pgd, addr, node);
+		p4d = vmemmap_p4d_populate(pgd, addr, node);
+		if (!p4d)
+			return -ENOMEM;
+
+		pud = vmemmap_pud_populate(p4d, addr, node);
 		if (!pud)
 			return -ENOMEM;
 
@@ -1241,6 +1329,7 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 	unsigned long end = (unsigned long)(start_page + size);
 	unsigned long next;
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	unsigned int nr_pages;
@@ -1256,7 +1345,14 @@ void register_page_bootmem_memmap(unsigned long section_nr,
 		}
 		get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
 
-		pud = pud_offset(pgd, addr);
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none(*p4d)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, p4d_page(*p4d), MIX_SECTION_INFO);
+
+		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
 			next = (addr + PAGE_SIZE) & PAGE_MASK;
 			continue;
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0493c17b8a51..2964de48e177 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -31,8 +31,16 @@ static int __init map_range(struct range *range)
 static void __init clear_pgds(unsigned long start,
 			unsigned long end)
 {
-	for (; start < end; start += PGDIR_SIZE)
-		pgd_clear(pgd_offset_k(start));
+	pgd_t *pgd;
+
+	for (; start < end; start += PGDIR_SIZE) {
+		pgd = pgd_offset_k(start);
+#ifdef __PAGETABLE_P4D_FOLDED
+		p4d_clear(p4d_offset(pgd, start));
+#else
+		pgd_clear(pgd);
+#endif
+	}
 }
 
 static void __init kasan_map_early_shadow(pgd_t *pgd)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 5a287e523eab..8ec4baa84526 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -333,6 +333,7 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long address,
 pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 			     unsigned int *level)
 {
+	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
@@ -341,7 +342,15 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 	if (pgd_none(*pgd))
 		return NULL;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	*level = PG_LEVEL_512G;
+	if (p4d_large(*p4d) || !p4d_present(*p4d))
+		return (pte_t *)p4d;
+
+	pud = pud_offset(p4d, address);
 	if (pud_none(*pud))
 		return NULL;
 
@@ -393,13 +402,18 @@ static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address,
 pmd_t *lookup_pmd_address(unsigned long address)
 {
 	pgd_t *pgd;
+	p4d_t *p4d;
 	pud_t *pud;
 
 	pgd = pgd_offset_k(address);
 	if (pgd_none(*pgd))
 		return NULL;
 
-	pud = pud_offset(pgd, address);
+	p4d = p4d_offset(pgd, address);
+	if (p4d_none(*p4d) || p4d_large(*p4d) || !p4d_present(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, address);
 	if (pud_none(*pud) || pud_large(*pud) || !pud_present(*pud))
 		return NULL;
 
@@ -464,11 +478,13 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 
 		list_for_each_entry(page, &pgd_list, lru) {
 			pgd_t *pgd;
+			p4d_t *p4d;
 			pud_t *pud;
 			pmd_t *pmd;
 
 			pgd = (pgd_t *)page_address(page) + pgd_index(address);
-			pud = pud_offset(pgd, address);
+			p4d = p4d_offset(pgd, address);
+			pud = pud_offset(p4d, address);
 			pmd = pmd_offset(pud, address);
 			set_pte_atomic((pte_t *)pmd, pte);
 		}
@@ -823,9 +839,9 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end)
 			pud_clear(pud);
 }
 
-static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
+static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
-	pud_t *pud = pud_offset(pgd, start);
+	pud_t *pud = pud_offset(p4d, start);
 
 	/*
 	 * Not on a GB page boundary?
@@ -991,8 +1007,8 @@ static long populate_pmd(struct cpa_data *cpa,
 	return num_pages;
 }
 
-static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
-			 pgprot_t pgprot)
+static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,
+			pgprot_t pgprot)
 {
 	pud_t *pud;
 	unsigned long end;
@@ -1013,7 +1029,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
 		cur_pages = (pre_end - start) >> PAGE_SHIFT;
 		cur_pages = min_t(int, (int)cpa->numpages, cur_pages);
 
-		pud = pud_offset(pgd, start);
+		pud = pud_offset(p4d, start);
 
 		/*
 		 * Need a PMD page?
@@ -1034,7 +1050,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
 	if (cpa->numpages == cur_pages)
 		return cur_pages;
 
-	pud = pud_offset(pgd, start);
+	pud = pud_offset(p4d, start);
 	pud_pgprot = pgprot_4k_2_large(pgprot);
 
 	/*
@@ -1054,7 +1070,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
 	if (start < end) {
 		long tmp;
 
-		pud = pud_offset(pgd, start);
+		pud = pud_offset(p4d, start);
 		if (pud_none(*pud))
 			if (alloc_pmd_page(pud))
 				return -1;
@@ -1077,33 +1093,43 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr)
 {
 	pgprot_t pgprot = __pgprot(_KERNPG_TABLE);
 	pud_t *pud = NULL;	/* shut up gcc */
+	p4d_t *p4d;
 	pgd_t *pgd_entry;
 	long ret;
 
 	pgd_entry = cpa->pgd + pgd_index(addr);
 
+	if (pgd_none(*pgd_entry)) {
+		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
+		if (!p4d)
+			return -1;
+
+		set_p4d(p4d, __p4d(__pa(p4d) | _KERNPG_TABLE));
+	}
+
 	/*
-	 * Allocate a PUD page and hand it down for mapping.
+	 * Allocate a P4D page and hand it down for mapping.
 	 */
-	if (pgd_none(*pgd_entry)) {
+	p4d = p4d_offset(pgd_entry, addr);
+	if (p4d_none(*p4d)) {
 		pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
 		if (!pud)
 			return -1;
 
-		set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
+		set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
 	}
 
 	pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr);
 	pgprot_val(pgprot) |=  pgprot_val(cpa->mask_set);
 
-	ret = populate_pud(cpa, addr, pgd_entry, pgprot);
+	ret = populate_pud(cpa, addr, p4d, pgprot);
 	if (ret < 0) {
 		/*
 		 * Leave the PUD page in place in case some other CPU or thread
 		 * already found it, but remove any useless entries we just
 		 * added to it.
 		 */
-		unmap_pud_range(pgd_entry, addr,
+		unmap_pud_range(p4d, addr,
 				addr + (cpa->numpages << PAGE_SHIFT));
 		return ret;
 	}
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 5aabfa3690dd..67bccd946071 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -135,7 +135,7 @@ static pgd_t *efi_pgd;
 int __init efi_alloc_page_tables(void)
 {
 	pgd_t *pgd;
-	pud_t *pud;
+	p4d_t *p4d;
 	gfp_t gfp_mask;
 
 	if (efi_enabled(EFI_OLD_MEMMAP))
@@ -148,13 +148,13 @@ int __init efi_alloc_page_tables(void)
 
 	pgd = efi_pgd + pgd_index(EFI_VA_END);
 
-	pud = pud_alloc_one(NULL, 0);
-	if (!pud) {
+	p4d = p4d_alloc_one(NULL, 0);
+	if (!p4d) {
 		free_page((unsigned long)efi_pgd);
 		return -ENOMEM;
 	}
 
-	pgd_populate(NULL, pgd, pud);
+	pgd_populate(NULL, pgd, p4d);
 
 	return 0;
 }
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index c7b15f3e2cf3..2aecee939095 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -4,6 +4,7 @@
 
 config XEN
 	bool "Xen guest support"
+	depends on BROKEN
 	depends on PARAVIRT
 	select PARAVIRT_CLOCK
 	select XEN_HAVE_PVMMU
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 16/29] x86: detect 5-level paging support
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2016-12-27  1:53 ` [PATCHv2 15/29] x86: convert the rest of the code to " Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 17/29] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert Kirill A. Shutemov
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

5-level paging support is required from hardware when compiled with
CONFIG_X86_5LEVEL=y. We may implement runtime switch support later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/cpucheck.c                 |  9 +++++++++
 arch/x86/boot/cpuflags.c                 | 12 ++++++++++--
 arch/x86/include/asm/disabled-features.h |  8 +++++++-
 arch/x86/include/asm/required-features.h |  8 +++++++-
 4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/cpucheck.c b/arch/x86/boot/cpucheck.c
index 4ad7d70e8739..8f0c4c9fc904 100644
--- a/arch/x86/boot/cpucheck.c
+++ b/arch/x86/boot/cpucheck.c
@@ -44,6 +44,15 @@ static const u32 req_flags[NCAPINTS] =
 	0, /* REQUIRED_MASK5 not implemented in this file */
 	REQUIRED_MASK6,
 	0, /* REQUIRED_MASK7 not implemented in this file */
+	0, /* REQUIRED_MASK8 not implemented in this file */
+	0, /* REQUIRED_MASK9 not implemented in this file */
+	0, /* REQUIRED_MASK10 not implemented in this file */
+	0, /* REQUIRED_MASK11 not implemented in this file */
+	0, /* REQUIRED_MASK12 not implemented in this file */
+	0, /* REQUIRED_MASK13 not implemented in this file */
+	0, /* REQUIRED_MASK14 not implemented in this file */
+	0, /* REQUIRED_MASK15 not implemented in this file */
+	REQUIRED_MASK16,
 };
 
 #define A32(a, b, c, d) (((d) << 24)+((c) << 16)+((b) << 8)+(a))
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 6687ab953257..9e77c23c2422 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -70,16 +70,19 @@ int has_eflag(unsigned long mask)
 # define EBX_REG "=b"
 #endif
 
-static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
+static inline void cpuid_count(u32 id, u32 count,
+		u32 *a, u32 *b, u32 *c, u32 *d)
 {
 	asm volatile(".ifnc %%ebx,%3 ; movl  %%ebx,%3 ; .endif	\n\t"
 		     "cpuid					\n\t"
 		     ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif	\n\t"
 		    : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
-		    : "a" (id)
+		    : "a" (id), "c" (count)
 	);
 }
 
+#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)
+
 void get_cpuflags(void)
 {
 	u32 max_intel_level, max_amd_level;
@@ -108,6 +111,11 @@ void get_cpuflags(void)
 				cpu.model += ((tfms >> 16) & 0xf) << 4;
 		}
 
+		if (max_intel_level >= 0x00000007) {
+			cpuid_count(0x00000007, 0, &ignored, &ignored,
+					&cpu.flags[16], &ignored);
+		}
+
 		cpuid(0x80000000, &max_amd_level, &ignored, &ignored,
 		      &ignored);
 
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 85599ad4d024..fc0960236fc3 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -36,6 +36,12 @@
 # define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_X86_5LEVEL
+#define DISABLE_LA57	0
+#else
+#define DISABLE_LA57	(1<<(X86_FEATURE_LA57 & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -55,7 +61,7 @@
 #define DISABLED_MASK13	0
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
-#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE)
+#define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
 
diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h
index fac9a5c0abe9..d91ba04dd007 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -53,6 +53,12 @@
 # define NEED_MOVBE	0
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+# define NEED_LA57	(1<<(X86_FEATURE_LA57 & 31))
+#else
+# define NEED_LA57	0
+#endif
+
 #ifdef CONFIG_X86_64
 #ifdef CONFIG_PARAVIRT
 /* Paravirtualized systems may not have PSE or PGE available */
@@ -98,7 +104,7 @@
 #define REQUIRED_MASK13	0
 #define REQUIRED_MASK14	0
 #define REQUIRED_MASK15	0
-#define REQUIRED_MASK16	0
+#define REQUIRED_MASK16	(NEED_LA57)
 #define REQUIRED_MASK17	0
 #define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 17/29] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 16/29] x86: detect 5-level paging support Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 18/29] x86/mm: define virtual memory map for 5-level paging Kirill A. Shutemov
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement
better check for canonical addresses") made canonical address check
generic wrt. address width.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/entry/entry_64.S | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 5b219707c2f2..a6b84c309685 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -264,12 +264,9 @@ return_from_SYSCALL_64:
 	 *
 	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
+	 *
+	 * Change top 16 bits to be the sign-extension of 47th bit
 	 */
-	.ifne __VIRTUAL_MASK_SHIFT - 47
-	.error "virtual address width changed -- SYSRET checks need update"
-	.endif
-
-	/* Change top 16 bits to be the sign-extension of 47th bit */
 	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 18/29] x86/mm: define virtual memory map for 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 17/29] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 19/29] x86/paravirt: make paravirt code support " Kirill A. Shutemov
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

The first part of memory map (up to %esp fixup) simply scales existing
map for 4-level paging by factor of 9 -- number of bits addressed by
additional page table level.

The rest of the map is uncahnged.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/x86_64/mm.txt         | 33 ++++++++++++++++++++++++++++++---
 arch/x86/Kconfig                        |  1 +
 arch/x86/include/asm/kasan.h            |  9 ++++++---
 arch/x86/include/asm/page_64_types.h    | 10 ++++++++++
 arch/x86/include/asm/pgtable_64_types.h |  6 ++++++
 arch/x86/include/asm/sparsemem.h        |  9 +++++++--
 6 files changed, 60 insertions(+), 8 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 5724092db811..0303a47b82f8 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -4,7 +4,7 @@
 Virtual memory map with 4 level page tables:
 
 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
-hole caused by [48:63] sign extension
+hole caused by [47:63] sign extension
 ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
 ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
 ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
@@ -23,12 +23,39 @@ ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
 ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
+Virtual memory map with 5 level page tables:
+
+0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm
+hole caused by [56:63] sign extension
+ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
+ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
+ff90000000000000 - ff91ffffffffffff (=49 bits) hole
+ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
+ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
+ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
+... unused hole ...
+ffd8000000000000 - fff7ffffffffffff (=53 bits) kasan shadow memory (8PB)
+... unused hole ...
+fffe000000000000 - fffe007fffffffff (=39 bits) %esp fixup stacks
+... unused hole ...
+ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
+... unused hole ...
+ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
+ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
+ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+
+Architecture defines a 64-bit virtual address. Implementations can support
+less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
+through to the most-significant implemented bit are set to either all ones
+or all zero. This causes hole between user space and kernel addresses.
+
 The direct mapping covers all memory in the system up to the highest
 memory address (this means in some cases it can also include PCI memory
 holes).
 
-vmalloc space is lazily synchronized into the different PML4 pages of
-the processes using the page fault handler, with init_level4_pgt as
+vmalloc space is lazily synchronized into the different PML4/PML5 pages of
+the processes using the page fault handler, with init_top_pgt as
 reference.
 
 Current X86-64 implementations support up to 46 bits of address space (64 TB),
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493bbd47..3d51256a9e61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -285,6 +285,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 config KASAN_SHADOW_OFFSET
 	hex
 	depends on KASAN
+	default 0xdff8000000000000 if X86_5LEVEL
 	default 0xdffffc0000000000
 
 config HAVE_INTEL_TXT
diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h
index 1410b567ecde..f527b02a0ee3 100644
--- a/arch/x86/include/asm/kasan.h
+++ b/arch/x86/include/asm/kasan.h
@@ -11,9 +11,12 @@
  * 'kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT
  */
 #define KASAN_SHADOW_START      (KASAN_SHADOW_OFFSET + \
-					(0xffff800000000000ULL >> 3))
-/* 47 bits for kernel address -> (47 - 3) bits for shadow */
-#define KASAN_SHADOW_END        (KASAN_SHADOW_START + (1ULL << (47 - 3)))
+					((-1UL << __VIRTUAL_MASK_SHIFT) >> 3))
+/*
+ * 47 bits for kernel address -> (47 - 3) bits for shadow
+ * 56 bits for kernel address -> (56 - 3) bits for shadow
+ */
+#define KASAN_SHADOW_END        (KASAN_SHADOW_START + (1ULL << (__VIRTUAL_MASK_SHIFT - 3)))
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 9215e0527647..3f5f08b010d0 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -36,7 +36,12 @@
  * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
  * what Xen requires.
  */
+#ifdef CONFIG_X86_5LEVEL
+#define __PAGE_OFFSET_BASE      _AC(0xff10000000000000, UL)
+#else
 #define __PAGE_OFFSET_BASE      _AC(0xffff880000000000, UL)
+#endif
+
 #ifdef CONFIG_RANDOMIZE_MEMORY
 #define __PAGE_OFFSET           page_offset_base
 #else
@@ -46,8 +51,13 @@
 #define __START_KERNEL_map	_AC(0xffffffff80000000, UL)
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
+#ifdef CONFIG_X86_5LEVEL
+#define __PHYSICAL_MASK_SHIFT	52
+#define __VIRTUAL_MASK_SHIFT	56
+#else
 #define __PHYSICAL_MASK_SHIFT	46
 #define __VIRTUAL_MASK_SHIFT	47
+#endif
 
 /*
  * Kernel image size is limited to 1GiB due to the fixmap living in the
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 0b2797e5083c..00dc0c2b456e 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -56,9 +56,15 @@ typedef struct { pteval_t pte; } pte_t;
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
 #define MAXMEM		_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
+#ifdef CONFIG_X86_5LEVEL
+#define VMALLOC_SIZE_TB _AC(16384, UL)
+#define __VMALLOC_BASE	_AC(0xff92000000000000, UL)
+#define __VMEMMAP_BASE	_AC(0xffd4000000000000, UL)
+#else
 #define VMALLOC_SIZE_TB	_AC(32, UL)
 #define __VMALLOC_BASE	_AC(0xffffc90000000000, UL)
 #define __VMEMMAP_BASE	_AC(0xffffea0000000000, UL)
+#endif
 #ifdef CONFIG_RANDOMIZE_MEMORY
 #define VMALLOC_START	vmalloc_base
 #define VMEMMAP_START	vmemmap_base
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 4517d6b93188..1f5bee2c202f 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -26,8 +26,13 @@
 # endif
 #else /* CONFIG_X86_32 */
 # define SECTION_SIZE_BITS	27 /* matt - 128 is convenient right now */
-# define MAX_PHYSADDR_BITS	44
-# define MAX_PHYSMEM_BITS	46
+# ifdef CONFIG_X86_5LEVEL
+#  define MAX_PHYSADDR_BITS	52
+#  define MAX_PHYSMEM_BITS	52
+# else
+#  define MAX_PHYSADDR_BITS	44
+#  define MAX_PHYSMEM_BITS	46
+# endif
 #endif
 
 #endif /* CONFIG_SPARSEMEM */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 19/29] x86/paravirt: make paravirt code support 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 18/29] x86/mm: define virtual memory map for 5-level paging Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 20/29] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL Kirill A. Shutemov
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Add operations to allocate/release p4ds.

TODO: cover XEN.

Not-yet-Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/paravirt.h       | 43 +++++++++++++++++++++++++++++++----
 arch/x86/include/asm/paravirt_types.h |  7 +++++-
 arch/x86/include/asm/pgalloc.h        |  1 +
 arch/x86/kernel/paravirt.c            |  9 ++++++--
 4 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 432c6e730ed1..cd8195e6f93a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -357,6 +357,15 @@ static inline void paravirt_release_pud(unsigned long pfn)
 	PVOP_VCALL1(pv_mmu_ops.release_pud, pfn);
 }
 
+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn)
+{
+	PVOP_VCALL2(pv_mmu_ops.alloc_p4d, mm, pfn);
+}
+static inline void paravirt_release_p4d(unsigned long pfn)
+{
+	PVOP_VCALL1(pv_mmu_ops.release_p4d, pfn);
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep)
 {
@@ -571,14 +580,35 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 			    val);
 }
 
-static inline void p4d_clear(p4d_t *p4dp)
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+static inline p4d_t __p4d(p4dval_t val)
 {
-	set_p4d(p4dp, __p4d(0));
+	p4dval_t ret;
+
+	if (sizeof(p4dval_t) > sizeof(long))
+		ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.make_p4d,
+				   val, (u64)val >> 32);
+	else
+		ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.make_p4d,
+				   val);
+
+	return (p4d_t) { ret };
 }
 
-#if CONFIG_PGTABLE_LEVELS >= 5
+static inline p4dval_t p4d_val(p4d_t p4d)
+{
+	p4dval_t ret;
 
-#error FIXME
+	if (sizeof(p4dval_t) > sizeof(long))
+		ret =  PVOP_CALLEE2(p4dval_t, pv_mmu_ops.p4d_val,
+				    p4d.p4d, (u64)p4d.p4d >> 32);
+	else
+		ret =  PVOP_CALLEE1(p4dval_t, pv_mmu_ops.p4d_val,
+				    p4d.p4d);
+
+	return ret;
+}
 
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
@@ -599,6 +629,11 @@ static inline void pgd_clear(pgd_t *pgdp)
 
 #endif  /* CONFIG_PGTABLE_LEVELS == 5 */
 
+static inline void p4d_clear(p4d_t *p4dp)
+{
+	set_p4d(p4dp, __p4d(0));
+}
+
 #endif	/* CONFIG_PGTABLE_LEVELS == 4 */
 
 #endif	/* CONFIG_PGTABLE_LEVELS >= 3 */
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3982c200845f..2dfbb7cedbaa 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -238,9 +238,11 @@ struct pv_mmu_ops {
 	void (*alloc_pte)(struct mm_struct *mm, unsigned long pfn);
 	void (*alloc_pmd)(struct mm_struct *mm, unsigned long pfn);
 	void (*alloc_pud)(struct mm_struct *mm, unsigned long pfn);
+	void (*alloc_p4d)(struct mm_struct *mm, unsigned long pfn);
 	void (*release_pte)(unsigned long pfn);
 	void (*release_pmd)(unsigned long pfn);
 	void (*release_pud)(unsigned long pfn);
+	void (*release_p4d)(unsigned long pfn);
 
 	/* Pagetable manipulation functions */
 	void (*set_pte)(pte_t *ptep, pte_t pteval);
@@ -284,7 +286,10 @@ struct pv_mmu_ops {
 	void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);
 
 #if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
+	struct paravirt_callee_save p4d_val;
+	struct paravirt_callee_save make_p4d;
+
+	void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
 #endif	/* CONFIG_PGTABLE_LEVELS >= 5 */
 
 #endif	/* CONFIG_PGTABLE_LEVELS >= 4 */
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 2f585054c63c..8408511dbdd1 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -17,6 +17,7 @@ static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn)	{
 static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn,
 					    unsigned long start, unsigned long count) {}
 static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn)	{}
+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn)	{}
 static inline void paravirt_release_pte(unsigned long pfn) {}
 static inline void paravirt_release_pmd(unsigned long pfn) {}
 static inline void paravirt_release_pud(unsigned long pfn) {}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index f8aedc112d5e..974359f83cff 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -405,9 +405,11 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 	.alloc_pte = paravirt_nop,
 	.alloc_pmd = paravirt_nop,
 	.alloc_pud = paravirt_nop,
+	.alloc_p4d = paravirt_nop,
 	.release_pte = paravirt_nop,
 	.release_pmd = paravirt_nop,
 	.release_pud = paravirt_nop,
+	.release_p4d = paravirt_nop,
 
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
@@ -436,8 +438,11 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 	.set_p4d = native_set_p4d,
 
 #if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
-#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+	.p4d_val = PTE_IDENT,
+	.make_p4d = PTE_IDENT,
+
+	.set_pgd = native_set_pgd,
+#endif /* CONFIG_PGTABLE_LEVELS >= 5 */
 #endif /* CONFIG_PGTABLE_LEVELS >= 4 */
 #endif /* CONFIG_PGTABLE_LEVELS >= 3 */
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 20/29] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 19/29] x86/paravirt: make paravirt code support " Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 21/29] x86/dump_pagetables: support 5-level paging Kirill A. Shutemov
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Extends pagetable headers to support new paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable_64.h       | 11 +++++++++++
 arch/x86/include/asm/pgtable_64_types.h | 20 +++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h    | 10 +++++++++-
 arch/x86/mm/pgtable.c                   | 34 ++++++++++++++++++++++++++++++++-
 4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index dff070a6d27e..2c04ac6f6fe8 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -35,6 +35,13 @@ extern void paging_init(void);
 #define pud_ERROR(e)					\
 	pr_err("%s:%d: bad pud %p(%016lx)\n",		\
 	       __FILE__, __LINE__, &(e), pud_val(e))
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#define p4d_ERROR(e)					\
+	pr_err("%s:%d: bad p4d %p(%016lx)\n",		\
+	       __FILE__, __LINE__, &(e), p4d_val(e))
+#endif
+
 #define pgd_ERROR(e)					\
 	pr_err("%s:%d: bad pgd %p(%016lx)\n",		\
 	       __FILE__, __LINE__, &(e), pgd_val(e))
@@ -113,7 +120,11 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 
 static inline void native_p4d_clear(p4d_t *p4d)
 {
+#ifdef CONFIG_X86_5LEVEL
+	native_set_p4d(p4d, native_make_p4d(0));
+#else
 	native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+#endif
 }
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 00dc0c2b456e..7ae641fdbd07 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -23,12 +23,32 @@ typedef struct { pteval_t pte; } pte_t;
 
 #define SHARED_KERNEL_PMD	0
 
+#ifdef CONFIG_X86_5LEVEL
+
+/*
+ * PGDIR_SHIFT determines what a top-level page table entry can map
+ */
+#define PGDIR_SHIFT	48
+#define PTRS_PER_PGD	512
+
+/*
+ * 4rd level page in 5-level paging case
+ */
+#define P4D_SHIFT	39
+#define PTRS_PER_P4D	512
+#define P4D_SIZE	(_AC(1, UL) << P4D_SHIFT)
+#define P4D_MASK	(~(P4D_SIZE - 1))
+
+#else  /* CONFIG_X86_5LEVEL */
+
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
 #define PGDIR_SHIFT	39
 #define PTRS_PER_PGD	512
 
+#endif  /* CONFIG_X86_5LEVEL */
+
 /*
  * 3rd level page
  */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 4930afe9df0a..bf9638e1ee42 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,9 +273,17 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
+typedef struct { p4dval_t p4d; } p4d_t;
 
-#error FIXME
+static inline p4d_t native_make_p4d(pudval_t val)
+{
+	return (p4d_t) { val };
+}
 
+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+	return p4d.p4d;
+}
 #else
 #include <asm-generic/pgtable-nop4d.h>
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index cc6fcd4040e2..ec2dc0625480 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -81,6 +81,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
 	tlb_remove_page(tlb, virt_to_page(pud));
 }
+
+#if CONFIG_PGTABLE_LEVELS > 4
+void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
+{
+	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
+	tlb_remove_page(tlb, virt_to_page(p4d));
+}
+#endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 2 */
 
@@ -120,7 +128,7 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
 	   references from swapper_pg_dir. */
 	if (CONFIG_PGTABLE_LEVELS == 2 ||
 	    (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-	    CONFIG_PGTABLE_LEVELS == 4) {
+	    CONFIG_PGTABLE_LEVELS >= 4) {
 		clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
 				swapper_pg_dir + KERNEL_PGD_BOUNDARY,
 				KERNEL_PGD_PTRS);
@@ -551,6 +559,30 @@ void native_set_fixmap(enum fixed_addresses idx, phys_addr_t phys,
 }
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+#ifdef CONFIG_X86_5LEVEL
+/**
+ * p4d_set_huge - setup kernel P4D mapping
+ *
+ * No 512GB pages yet -- always return 0
+ *
+ * Returns 1 on success and 0 on failure.
+ */
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
+
+/**
+ * p4d_clear_huge - clear kernel P4D mapping when it is set
+ *
+ * No 512GB pages yet -- always return 0
+ */
+int p4d_clear_huge(p4d_t *p4d)
+{
+	return 0;
+}
+#endif
+
 /**
  * pud_set_huge - setup kernel PUD mapping
  *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 21/29] x86/dump_pagetables: support 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 20/29] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 22/29] x86/mm: extend kasan to " Kirill A. Shutemov
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Simple extension to support one more page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/dump_pagetables.c | 49 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ea9c49adaa1f..15670b55861b 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -100,7 +100,8 @@ static struct addr_marker address_markers[] = {
 #define PTE_LEVEL_MULT (PAGE_SIZE)
 #define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT)
 #define PUD_LEVEL_MULT (PTRS_PER_PMD * PMD_LEVEL_MULT)
-#define PGD_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define P4D_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define PGD_LEVEL_MULT (PTRS_PER_PUD * P4D_LEVEL_MULT)
 
 #define pt_dump_seq_printf(m, to_dmesg, fmt, args...)		\
 ({								\
@@ -326,14 +327,14 @@ static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
 
 #if PTRS_PER_PUD > 1
 
-static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
 							unsigned long P)
 {
 	int i;
 	pud_t *start;
 	pgprotval_t prot;
 
-	start = (pud_t *) pgd_page_vaddr(addr);
+	start = (pud_t *) p4d_page_vaddr(addr);
 
 	for (i = 0; i < PTRS_PER_PUD; i++) {
 		st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
@@ -353,9 +354,43 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
 }
 
 #else
-#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(pgd_val(a)),p)
-#define pgd_large(a) pud_large(__pud(pgd_val(a)))
-#define pgd_none(a)  pud_none(__pud(pgd_val(a)))
+#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(p4d_val(a)),p)
+#define p4d_large(a) pud_large(__pud(p4d_val(a)))
+#define p4d_none(a)  pud_none(__pud(p4d_val(a)))
+#endif
+
+#if PTRS_PER_P4D > 1
+
+static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+							unsigned long P)
+{
+	int i;
+	p4d_t *start;
+	pgprotval_t prot;
+
+	start = (p4d_t *) pgd_page_vaddr(addr);
+
+	for (i = 0; i < PTRS_PER_P4D; i++) {
+		st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
+		if (!p4d_none(*start)) {
+			if (p4d_large(*start) || !p4d_present(*start)) {
+				prot = p4d_flags(*start);
+				note_page(m, st, __pgprot(prot), 2);
+			} else {
+				walk_pud_level(m, st, *start,
+					       P + i * P4D_LEVEL_MULT);
+			}
+		} else
+			note_page(m, st, __pgprot(0), 2);
+
+		start++;
+	}
+}
+
+#else
+#define walk_p4d_level(m,s,a,p) walk_pud_level(m,s,__p4d(pgd_val(a)),p)
+#define pgd_large(a) p4d_large(__p4d(pgd_val(a)))
+#define pgd_none(a)  p4d_none(__p4d(pgd_val(a)))
 #endif
 
 static inline bool is_hypervisor_range(int idx)
@@ -400,7 +435,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				prot = pgd_flags(*start);
 				note_page(m, &st, __pgprot(prot), 1);
 			} else {
-				walk_pud_level(m, &st, *start,
+				walk_p4d_level(m, &st, *start,
 					       i * PGD_LEVEL_MULT);
 			}
 		} else
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 22/29] x86/mm: extend kasan to support 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 21/29] x86/dump_pagetables: support 5-level paging Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 23/29] x86/espfix: " Kirill A. Shutemov
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch bring support fo non-folded additional page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kasan_init_64.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 2964de48e177..e37504e94e8f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -50,8 +50,18 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
 	unsigned long end = KASAN_SHADOW_END;
 
 	for (i = pgd_index(start); start < end; i++) {
-		pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud)
-				| _KERNPG_TABLE);
+		switch (CONFIG_PGTABLE_LEVELS) {
+		case 4:
+			pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud) |
+					_KERNPG_TABLE);
+			break;
+		case 5:
+			pgd[i] = __pgd(__pa_nodebug(kasan_zero_p4d) |
+					_KERNPG_TABLE);
+			break;
+		default:
+			BUILD_BUG();
+		}
 		start += PGDIR_SIZE;
 	}
 }
@@ -79,6 +89,7 @@ void __init kasan_early_init(void)
 	pteval_t pte_val = __pa_nodebug(kasan_zero_page) | __PAGE_KERNEL;
 	pmdval_t pmd_val = __pa_nodebug(kasan_zero_pte) | _KERNPG_TABLE;
 	pudval_t pud_val = __pa_nodebug(kasan_zero_pmd) | _KERNPG_TABLE;
+	p4dval_t p4d_val = __pa_nodebug(kasan_zero_pud) | _KERNPG_TABLE;
 
 	for (i = 0; i < PTRS_PER_PTE; i++)
 		kasan_zero_pte[i] = __pte(pte_val);
@@ -89,6 +100,9 @@ void __init kasan_early_init(void)
 	for (i = 0; i < PTRS_PER_PUD; i++)
 		kasan_zero_pud[i] = __pud(pud_val);
 
+	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
+		kasan_zero_p4d[i] = __p4d(p4d_val);
+
 	kasan_map_early_shadow(early_level4_pgt);
 	kasan_map_early_shadow(init_level4_pgt);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 23/29] x86/espfix: support 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 22/29] x86/mm: extend kasan to " Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 24/29] x86/mm: add support of additional page table level during early boot Kirill A. Shutemov
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

We don't need extra virtual address space for ESPFIX, so it stays within
one PUD page table for both 4- and 5-level paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/espfix_64.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 04f89caef9c4..8e598a1ad986 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -50,11 +50,11 @@
 #define ESPFIX_STACKS_PER_PAGE	(PAGE_SIZE/ESPFIX_STACK_SIZE)
 
 /* There is address space for how many espfix pages? */
-#define ESPFIX_PAGE_SPACE	(1UL << (PGDIR_SHIFT-PAGE_SHIFT-16))
+#define ESPFIX_PAGE_SPACE	(1UL << (P4D_SHIFT-PAGE_SHIFT-16))
 
 #define ESPFIX_MAX_CPUS		(ESPFIX_STACKS_PER_PAGE * ESPFIX_PAGE_SPACE)
 #if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
-# error "Need more than one PGD for the ESPFIX hack"
+# error "Need more virtual address space for the ESPFIX hack"
 #endif
 
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
@@ -121,11 +121,13 @@ static void init_espfix_random(void)
 
 void __init init_espfix_bsp(void)
 {
-	pgd_t *pgd_p;
+	pgd_t *pgd;
+	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
-	pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
+	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
+	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
 	/* Randomize the locations */
 	init_espfix_random();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 24/29] x86/mm: add support of additional page table level during early boot
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (22 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 23/29] x86/espfix: " Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 25/29] x86/mm: add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

TODO: XEN support is missing

Not-yet-Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S          | 23 ++++++++++--
 arch/x86/include/asm/pgtable.h              |  2 +-
 arch/x86/include/asm/pgtable_64.h           |  6 ++-
 arch/x86/include/uapi/asm/processor-flags.h |  2 +
 arch/x86/kernel/espfix_64.c                 |  2 +-
 arch/x86/kernel/head64.c                    | 40 ++++++++++++++------
 arch/x86/kernel/head_64.S                   | 58 ++++++++++++++++++++++-------
 arch/x86/kernel/machine_kexec_64.c          |  2 +-
 arch/x86/mm/dump_pagetables.c               |  2 +-
 arch/x86/mm/kasan_init_64.c                 | 12 +++---
 arch/x86/realmode/init.c                    |  2 +-
 11 files changed, 110 insertions(+), 41 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 4d85e600db78..f49fe6b84450 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
 	addl	%ebp, gdt+2(%ebp)
 	lgdt	gdt(%ebp)
 
-	/* Enable PAE mode */
+	/* Enable PAE and LA57 mode */
 	movl	%cr4, %eax
 	orl	$X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %eax
+#endif
 	movl	%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
 	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
 	rep	stosl
 
+	xorl	%edx, %edx
+
+	/* Build Top Level */
+	leal	pgtable(%ebx,%edx,1), %edi
+	leal	0x1007 (%edi), %eax
+	movl	%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
 	/* Build Level 4 */
-	leal	pgtable + 0(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007 (%edi), %eax
 	movl	%eax, 0(%edi)
+#endif
 
 	/* Build Level 3 */
-	leal	pgtable + 0x1000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	leal	0x1007(%edi), %eax
 	movl	$4, %ecx
 1:	movl	%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
 	jnz	1b
 
 	/* Build Level 2 */
-	leal	pgtable + 0x2000(%ebx), %edi
+	addl	$0x1000, %edx
+	leal	pgtable(%ebx,%edx), %edi
 	movl	$0x00000183, %eax
 	movl	$2048, %ecx
 1:	movl	%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 398adab9a167..8992f0a9ea3a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -813,7 +813,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
 	/* Default trampoline pgd value */
-	trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+	trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 2c04ac6f6fe8..8c64cf952a07 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,15 +14,17 @@
 #include <linux/bitops.h>
 #include <linux/threads.h>
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
 	p4d_t *p4d;
 
 	/* Install the espfix pud into the kernel page directory */
-	pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 54a2372f5dbb..f32d22986f47 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -32,7 +32,7 @@
 /*
  * Manage page tables very early on.
  */
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt = 2;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
@@ -40,9 +40,9 @@ pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 /* Wipe all early page tables except for the kernel symbol map */
 static void __init reset_early_page_tables(void)
 {
-	memset(early_level4_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
+	memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
 	next_early_pgt = 0;
-	write_cr3(__pa_nodebug(early_level4_pgt));
+	write_cr3(__pa_nodebug(early_top_pgt));
 }
 
 /* Create a new PMD entry */
@@ -50,15 +50,16 @@ int __init early_make_pgtable(unsigned long address)
 {
 	unsigned long physaddr = address - __PAGE_OFFSET;
 	pgdval_t pgd, *pgd_p;
+	p4dval_t p4d, *p4d_p;
 	pudval_t pud, *pud_p;
 	pmdval_t pmd, *pmd_p;
 
 	/* Invalid address or early pgt is done ?  */
-	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_level4_pgt))
+	if (physaddr >= MAXMEM || read_cr3() != __pa_nodebug(early_top_pgt))
 		return -1;
 
 again:
-	pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
+	pgd_p = &early_top_pgt[pgd_index(address)].pgd;
 	pgd = *pgd_p;
 
 	/*
@@ -66,8 +67,25 @@ int __init early_make_pgtable(unsigned long address)
 	 * critical -- __PAGE_OFFSET would point us back into the dynamic
 	 * range and we might end up looping forever...
 	 */
-	if (pgd)
-		pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		p4d_p = pgd_p;
+	else if (pgd)
+		p4d_p = (p4dval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
+	else {
+		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
+			reset_early_page_tables();
+			goto again;
+		}
+
+		p4d_p = (p4dval_t *)early_dynamic_pgts[next_early_pgt++];
+		memset(p4d_p, 0, sizeof(*p4d_p) * PTRS_PER_P4D);
+		*pgd_p = (pgdval_t)p4d_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+	}
+	p4d_p += p4d_index(address);
+	p4d = *p4d_p;
+
+	if (p4d)
+		pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
 	else {
 		if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
 			reset_early_page_tables();
@@ -76,7 +94,7 @@ int __init early_make_pgtable(unsigned long address)
 
 		pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
 		memset(pud_p, 0, sizeof(*pud_p) * PTRS_PER_PUD);
-		*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
+		*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
 	}
 	pud_p += pud_index(address);
 	pud = *pud_p;
@@ -155,7 +173,7 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 
 	clear_bss();
 
-	clear_page(init_level4_pgt);
+	clear_page(init_top_pgt);
 
 	kasan_early_init();
 
@@ -170,8 +188,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
 	 */
 	load_ucode_bsp();
 
-	/* set init_level4_pgt kernel high mapping*/
-	init_level4_pgt[511] = early_level4_pgt[511];
+	/* set init_top_pgt kernel high mapping*/
+	init_top_pgt[511] = early_top_pgt[511];
 
 	x86_64_start_reservations(real_mode_data);
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b467b14b03eb..8ac1e002c7d0 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,10 +37,14 @@
  *
  */
 
+#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
-L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
-L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
+PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#ifdef CONFIG_X86_5LEVEL
+L4_START_KERNEL = p4d_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -93,7 +97,11 @@ startup_64:
 	/*
 	 * Fixup the physical addresses in the page table
 	 */
-	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
+	addq	%rbp, early_top_pgt + (PGD_START_KERNEL*8)(%rip)
+
+#ifdef CONFIG_X86_5LEVEL
+	addq	%rbp, level4_kernel_pgt + (511*8)(%rip)
+#endif
 
 	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
 	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
@@ -107,7 +115,7 @@ startup_64:
 	 * it avoids problems around wraparound.
 	 */
 	leaq	_text(%rip), %rdi
-	leaq	early_level4_pgt(%rip), %rbx
+	leaq	early_top_pgt(%rip), %rbx
 
 	movq	%rdi, %rax
 	shrq	$PGDIR_SHIFT, %rax
@@ -116,16 +124,26 @@ startup_64:
 	movq	%rdx, 0(%rbx,%rax,8)
 	movq	%rdx, 8(%rbx,%rax,8)
 
+#ifdef CONFIG_X86_5LEVEL
+	addq	$PAGE_SIZE, %rbx
+	addq	$PAGE_SIZE, %rdx
+	movq	%rdi, %rax
+	shrq	$P4D_SHIFT, %rax
+	andl	$(PTRS_PER_P4D-1), %eax
+	movq	%rdx, 0(%rbx,%rax,8)
+#endif
+
+	addq	$PAGE_SIZE, %rbx
 	addq	$PAGE_SIZE, %rdx
 	movq	%rdi, %rax
 	shrq	$PUD_SHIFT, %rax
 	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
+	movq	%rdx, 0(%rbx,%rax,8)
 	incl	%eax
 	andl	$(PTRS_PER_PUD-1), %eax
-	movq	%rdx, PAGE_SIZE(%rbx,%rax,8)
+	movq	%rdx, 0(%rbx,%rax,8)
 
-	addq	$PAGE_SIZE * 2, %rbx
+	addq	$PAGE_SIZE, %rbx
 	movq	%rdi, %rax
 	shrq	$PMD_SHIFT, %rdi
 	addq	$(__PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL), %rax
@@ -166,7 +184,7 @@ startup_64:
 	addq	%rbp, phys_base(%rip)
 
 .Lskip_fixup:
-	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(early_top_pgt - __START_KERNEL_map), %rax
 	jmp 1f
 ENTRY(secondary_startup_64)
 	/*
@@ -186,14 +204,17 @@ ENTRY(secondary_startup_64)
 	/* Sanitize CPU configuration */
 	call verify_cpu
 
-	movq	$(init_level4_pgt - __START_KERNEL_map), %rax
+	movq	$(init_top_pgt - __START_KERNEL_map), %rax
 1:
 
-	/* Enable PAE mode and PGE */
+	/* Enable PAE mode, PGE and LA57 */
 	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
+#ifdef CONFIG_X86_5LEVEL
+	orl	$X86_CR4_LA57, %ecx
+#endif
 	movq	%rcx, %cr4
 
-	/* Setup early boot stage 4 level pagetables. */
+	/* Setup early boot stage 4-/5-level pagetables. */
 	addq	phys_base(%rip), %rax
 	movq	%rax, %cr3
 
@@ -419,9 +440,13 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_level4_pgt)
+NEXT_PAGE(early_top_pgt)
 	.fill	511,8,0
+#ifdef CONFIG_X86_5LEVEL
+	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -429,9 +454,10 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_level4_pgt)
+NEXT_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
+#error FIXME
 NEXT_PAGE(init_level4_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
 	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
@@ -450,6 +476,12 @@ NEXT_PAGE(level2_ident_pgt)
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+NEXT_PAGE(level4_kernel_pgt)
+	.fill	511,8,0
+	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
+#endif
+
 NEXT_PAGE(level3_kernel_pgt)
 	.fill	L3_START_KERNEL,8,0
 	/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c325967de4bc..0fe0dc2a85dc 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -339,7 +339,7 @@ void machine_kexec(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 	VMCOREINFO_NUMBER(phys_base);
-	VMCOREINFO_SYMBOL(init_level4_pgt);
+	VMCOREINFO_SYMBOL(init_top_pgt);
 
 #ifdef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(node_data);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 15670b55861b..495ab353d576 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -411,7 +411,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx)
 {
 #ifdef CONFIG_X86_64
-	pgd_t *start = (pgd_t *) &init_level4_pgt;
+	pgd_t *start = (pgd_t *) &init_top_pgt;
 #else
 	pgd_t *start = swapper_pg_dir;
 #endif
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index e37504e94e8f..2d754cc4e02f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -9,7 +9,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 
-extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pgd_t early_top_pgt[PTRS_PER_PGD];
 extern struct range pfn_mapped[E820_X_MAX];
 
 static int __init map_range(struct range *range)
@@ -103,8 +103,8 @@ void __init kasan_early_init(void)
 	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
-	kasan_map_early_shadow(early_level4_pgt);
-	kasan_map_early_shadow(init_level4_pgt);
+	kasan_map_early_shadow(early_top_pgt);
+	kasan_map_early_shadow(init_top_pgt);
 }
 
 void __init kasan_init(void)
@@ -115,8 +115,8 @@ void __init kasan_init(void)
 	register_die_notifier(&kasan_die_notifier);
 #endif
 
-	memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
-	load_cr3(early_level4_pgt);
+	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
 	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -142,7 +142,7 @@ void __init kasan_init(void)
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
 			(void *)KASAN_SHADOW_END);
 
-	load_cr3(init_level4_pgt);
+	load_cr3(init_top_pgt);
 	__flush_tlb_all();
 
 	/*
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 5db706f14111..dc0836d5c5eb 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -102,7 +102,7 @@ static void __init setup_real_mode(void)
 
 	trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
 	trampoline_pgd[0] = trampoline_pgd_entry.pgd;
-	trampoline_pgd[511] = init_level4_pgt[511].pgd;
+	trampoline_pgd[511] = init_top_pgt[511].pgd;
 #endif
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 25/29] x86/mm: add sync_global_pgds() for configuration with 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (23 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 24/29] x86/mm: add support of additional page table level during early boot Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 26/29] x86/mm: make kernel_physical_mapping_init() support " Kirill A. Shutemov
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

This basically restores slightly modified version of original
sync_global_pgds() which we had before foldedl p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 72527ece6130..72f99e837ec2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,42 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+	unsigned long address;
+
+	for (address = start; address <= end && address >= start;
+			address += PGDIR_SIZE) {
+		const pgd_t *pgd_ref = pgd_offset_k(address);
+		struct page *page;
+
+		if (pgd_none(*pgd_ref))
+			continue;
+
+		spin_lock(&pgd_lock);
+		list_for_each_entry(page, &pgd_list, lru) {
+			pgd_t *pgd;
+			spinlock_t *pgt_lock;
+
+			pgd = (pgd_t *)page_address(page) + pgd_index(address);
+			/* the pgt_lock only for Xen */
+			pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+			spin_lock(pgt_lock);
+
+			if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+				BUG_ON(pgd_page_vaddr(*pgd)
+						!= pgd_page_vaddr(*pgd_ref));
+
+			if (pgd_none(*pgd))
+				set_pgd(pgd, *pgd_ref);
+
+			spin_unlock(pgt_lock);
+		}
+		spin_unlock(&pgd_lock);
+	}
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
 	unsigned long address;
@@ -135,6 +171,7 @@ void sync_global_pgds(unsigned long start, unsigned long end)
 		spin_unlock(&pgd_lock);
 	}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 26/29] x86/mm: make kernel_physical_mapping_init() support 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (24 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 25/29] x86/mm: add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 27/29] x86/mm: add support for 5-level paging for KASLR Kirill A. Shutemov
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Properly populate addition pagetable level if CONFIG_X86_5LEVEL is
enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/init_64.c | 71 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 72f99e837ec2..ac0a4048efac 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -611,6 +611,58 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 	return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+	      unsigned long page_size_mask)
+{
+	unsigned long paddr_next, paddr_last = paddr_end;
+	unsigned long vaddr = (unsigned long)__va(paddr);
+	int i = p4d_index(vaddr);
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask);
+
+	for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d;
+		pud_t *pud;
+
+		vaddr = (unsigned long)__va(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		if (paddr >= paddr_end) {
+			if (!after_bootmem &&
+			    !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+					     E820_RAM) &&
+			    !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+					     E820_RESERVED_KERN)) {
+				set_p4d(p4d, __p4d(0));
+			}
+			continue;
+		}
+
+		if (!p4d_none(*p4d)) {
+			pud = pud_offset(p4d, 0);
+			paddr_last = phys_pud_init(pud, paddr,
+					paddr_end,
+					page_size_mask);
+			__flush_tlb_all();
+			continue;
+		}
+
+		pud = alloc_low_page();
+		paddr_last = phys_pud_init(pud, paddr, paddr_end,
+					   page_size_mask);
+
+		spin_lock(&init_mm.page_table_lock);
+		p4d_populate(&init_mm, p4d, pud);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	__flush_tlb_all();
+
+	return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD level
@@ -632,26 +684,27 @@ kernel_physical_mapping_init(unsigned long paddr_start,
 	for (; vaddr < vaddr_end; vaddr = vaddr_next) {
 		pgd_t *pgd = pgd_offset_k(vaddr);
 		p4d_t *p4d;
-		pud_t *pud;
 
 		vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-		BUILD_BUG_ON(pgd_none(*pgd));
-		p4d = p4d_offset(pgd, vaddr);
-		if (p4d_val(*p4d)) {
-			pud = (pud_t *)p4d_page_vaddr(*p4d);
-			paddr_last = phys_pud_init(pud, __pa(vaddr),
+		if (pgd_val(*pgd)) {
+			p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+			paddr_last = phys_p4d_init(p4d, __pa(vaddr),
 						   __pa(vaddr_end),
 						   page_size_mask);
 			continue;
 		}
 
-		pud = alloc_low_page();
-		paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+		p4d = alloc_low_page();
+		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask);
 
 		spin_lock(&init_mm.page_table_lock);
-		p4d_populate(&init_mm, p4d, pud);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			pgd_populate(&init_mm, pgd, p4d);
+		else
+			p4d_populate(&init_mm, p4d_offset(pgd, vaddr),
+					(pud_t *) p4d);
 		spin_unlock(&init_mm.page_table_lock);
 		pgd_changed = true;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 27/29] x86/mm: add support for 5-level paging for KASLR
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (25 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 26/29] x86/mm: make kernel_physical_mapping_init() support " Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [PATCHv2 28/29] x86: enable 5-level paging support Kirill A. Shutemov
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

With 5-level paging randomization happens on P4D level install of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/kaslr.c | 82 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 887e57182716..662e5c4b21c8 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,8 @@ static __initdata struct kaslr_memory_region {
 	unsigned long *base;
 	unsigned long size_tb;
 } kaslr_regions[] = {
-	{ &page_offset_base, 64/* Maximum */ },
+	{ &page_offset_base,
+		1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
 	{ &vmalloc_base, VMALLOC_SIZE_TB },
 	{ &vmemmap_base, 1 },
 };
@@ -142,7 +143,10 @@ void __init kernel_randomize_memory(void)
 		 */
 		entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
 		prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-		entropy = (rand % (entropy + 1)) & PUD_MASK;
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			entropy = (rand % (entropy + 1)) & P4D_MASK;
+		else
+			entropy = (rand % (entropy + 1)) & PUD_MASK;
 		vaddr += entropy;
 		*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +155,21 @@ void __init kernel_randomize_memory(void)
 		 * randomization alignment.
 		 */
 		vaddr += get_padding(&kaslr_regions[i]);
-		vaddr = round_up(vaddr + 1, PUD_SIZE);
+		if (IS_ENABLED(CONFIG_X86_5LEVEL))
+			vaddr = round_up(vaddr + 1, P4D_SIZE);
+		else
+			vaddr = round_up(vaddr + 1, PUD_SIZE);
 		remain_entropy -= entropy;
 	}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
 	unsigned long paddr, paddr_next;
 	pgd_t *pgd;
 	pud_t *pud_page, *pud_page_tramp;
 	int i;
 
-	if (!kaslr_memory_enabled()) {
-		init_trampoline_default();
-		return;
-	}
-
 	pud_page_tramp = alloc_low_page();
 
 	paddr = 0;
@@ -192,3 +190,49 @@ void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+	unsigned long paddr, paddr_next;
+	pgd_t *pgd;
+	p4d_t *p4d_page, *p4d_page_tramp;
+	int i;
+
+	p4d_page_tramp = alloc_low_page();
+
+	paddr = 0;
+	pgd = pgd_offset_k((unsigned long)__va(paddr));
+	p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+	for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+		p4d_t *p4d, *p4d_tramp;
+		unsigned long vaddr = (unsigned long)__va(paddr);
+
+		p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+		p4d = p4d_page + p4d_index(vaddr);
+		paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+		*p4d_tramp = *p4d;
+	}
+
+	set_pgd(&trampoline_pgd_entry,
+		__pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+	if (!kaslr_memory_enabled()) {
+		init_trampoline_default();
+		return;
+	}
+
+	if (IS_ENABLED(CONFIG_X86_5LEVEL))
+		init_trampoline_p4d();
+	else
+		init_trampoline_pud();
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCHv2 28/29] x86: enable 5-level paging support
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (26 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 27/29] x86/mm: add support for 5-level paging for KASLR Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  1:54 ` [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR Kirill A. Shutemov
  2017-01-05 16:57 ` [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

Most of things are in place and we can enable support of 5-level paging.

MPX for 5-level paging will be enabled with separate patchset.
It requires change to GCC components which has not ready yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3d51256a9e61..07cc4f27ca41 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -315,6 +315,7 @@ config DEBUG_RODATA
 
 config PGTABLE_LEVELS
 	int
+	default 5 if X86_5LEVEL
 	default 4 if X86_64
 	default 3 if X86_PAE
 	default 2
@@ -1379,6 +1380,10 @@ config X86_PAE
 	  has the cost of more pagetable lookup overhead, and also
 	  consumes more pagetable space per process.
 
+config X86_5LEVEL
+	bool "Enable 5-level page tables support"
+	depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
 	def_bool y
 	depends on X86_64 || X86_PAE
@@ -1737,6 +1742,7 @@ config X86_SMAP
 config X86_INTEL_MPX
 	prompt "Intel MPX (Memory Protection Extensions)"
 	def_bool n
+	depends on !X86_5LEVEL
 	depends on CPU_SUP_INTEL
 	---help---
 	  MPX provides hardware features that can be used in
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (27 preceding siblings ...)
  2016-12-27  1:54 ` [PATCHv2 28/29] x86: enable 5-level paging support Kirill A. Shutemov
@ 2016-12-27  1:54 ` Kirill A. Shutemov
  2016-12-27  2:06   ` Andy Lutomirski
                     ` (2 more replies)
  2017-01-05 16:57 ` [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
  29 siblings, 3 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  1:54 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, linux-api

This patch introduces new rlimit resource to manage maximum virtual
address available to userspace to map.

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use high bit in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

The patch aims to address this compatibility issue.

MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
address available to map by userspace.

The default hard limit will be RLIM_INFINITY, which basically means that
TASK_SIZE limits available address space.

The soft limit will also be RLIM_INFINITY everywhere, but the machine
with 5-level paging enabled. In this case, soft limit would be
(1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
paging which known to be safe

New rlimit resource would follow usual semantics with regards to
inheritance: preserved on fork(2) and exec(2). This has potential to
break application if limits set too wide or too narrow, but this is not
uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).

As with other resources you can set the limit lower than current usage.
It would affect only future virtual address space allocations.

Use-cases for new rlimit:

  - Bumping the soft limit to RLIM_INFINITY, allows current process all
    its children to use addresses above 47-bits.

  - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
    exec(2) allows the child to use addresses above 47-bits.

  - Lowering the hard limit to 47-bits would prevent current process all
    its children to use addresses above 47-bits, unless a process has
    CAP_SYS_RESOURCES.

  - It’s also can be handy to lower hard or soft limit to arbitrary
    address. User-mode emulation in QEMU may lower the limit to 32-bit
    to emulate 32-bit machine on 64-bit host.

TODO:
  - port to non-x86;

Not-yet-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: linux-api@vger.kernel.org
---
 arch/x86/include/asm/elf.h          |  2 +-
 arch/x86/include/asm/processor.h    | 17 ++++++++++++-----
 arch/x86/kernel/sys_x86_64.c        |  6 +++---
 arch/x86/mm/hugetlbpage.c           |  8 ++++----
 arch/x86/mm/mmap.c                  |  4 ++--
 fs/binfmt_aout.c                    |  2 --
 fs/binfmt_elf.c                     | 10 +++++-----
 fs/hugetlbfs/inode.c                |  6 +++---
 fs/proc/base.c                      |  1 +
 include/asm-generic/resource.h      |  4 ++++
 include/linux/sched.h               |  5 +++++
 include/uapi/asm-generic/resource.h |  3 ++-
 kernel/events/uprobes.c             |  5 +++--
 kernel/sys.c                        |  6 +++---
 mm/mmap.c                           | 20 +++++++++++---------
 mm/mremap.c                         |  3 ++-
 mm/nommu.c                          |  2 +-
 mm/shmem.c                          |  8 ++++----
 18 files changed, 66 insertions(+), 46 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e7f155c3045e..5ce6f2b2b105 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
    the loader.  We need to make sure that it is out of the way of the program
    that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE		(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE		(mmap_max_addr() / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
    instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eaf100508c36..e02917126859 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -770,8 +770,8 @@ static inline void spin_lock_prefetch(const void *x)
  */
 #define TASK_SIZE		PAGE_OFFSET
 #define TASK_SIZE_MAX		TASK_SIZE
-#define STACK_TOP		TASK_SIZE
-#define STACK_TOP_MAX		STACK_TOP
+#define STACK_TOP		mmap_max_addr()
+#define STACK_TOP_MAX		TASK_SIZE
 
 #define INIT_THREAD  {							  \
 	.sp0			= TOP_OF_INIT_STACK,			  \
@@ -809,7 +809,14 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX	((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+/*
+ * Default limit on maximum virtual address. This is required for
+ * compatibility with applications that assumes 47-bit VA.
+ * The limit be overrided with setrlimit(2).
+ */
+#define USER_VADDR_LIM	((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -822,7 +829,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
 					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP		TASK_SIZE
+#define STACK_TOP		mmap_max_addr()
 #define STACK_TOP_MAX		TASK_SIZE_MAX
 
 #define INIT_THREAD  {						\
@@ -844,7 +851,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE	(PAGE_ALIGN(mmap_max_addr() / 3))
 
 #define KSTK_EIP(task)		(task_pt_regs(task)->ip)
 
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index a55ed63b9f91..e31f5b0c5468 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -115,7 +115,7 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 		}
 	} else {
 		*begin = current->mm->mmap_legacy_base;
-		*end = TASK_SIZE;
+		*end = mmap_max_addr();
 	}
 }
 
@@ -168,7 +168,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	struct vm_unmapped_area_info info;
 
 	/* requested length too big for entire address space */
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -182,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr &&
+		if (mmap_max_addr() - len >= addr &&
 				(!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 2ae8584b44c7..b55b04b82097 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -82,7 +82,7 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = current->mm->mmap_legacy_base;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = mmap_max_addr();
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
@@ -114,7 +114,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = mmap_max_addr();
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -131,7 +131,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
@@ -143,7 +143,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	if (addr) {
 		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr &&
+		if (mmap_max_addr() - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index d2dc0438d654..c22f0b802576 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -52,7 +52,7 @@ static unsigned long stack_maxrandom_size(void)
  * Leave an at least ~128 MB hole with possible stack randomization.
  */
 #define MIN_GAP (128*1024*1024UL + stack_maxrandom_size())
-#define MAX_GAP (TASK_SIZE/6*5)
+#define MAX_GAP (mmap_max_addr()/6*5)
 
 static int mmap_is_legacy(void)
 {
@@ -90,7 +90,7 @@ static unsigned long mmap_base(unsigned long rnd)
 	else if (gap > MAX_GAP)
 		gap = MAX_GAP;
 
-	return PAGE_ALIGN(TASK_SIZE - gap - rnd);
+	return PAGE_ALIGN(mmap_max_addr() - gap - rnd);
 }
 
 /*
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 2a59139f520b..7a7f6dba6b00 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -121,8 +121,6 @@ static struct linux_binfmt aout_format = {
 	.min_coredump	= PAGE_SIZE
 };
 
-#define BAD_ADDR(x)	((unsigned long)(x) >= TASK_SIZE)
-
 static int set_brk(unsigned long start, unsigned long end)
 {
 	start = PAGE_ALIGN(start);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 29a02daf08a9..1f8034aed298 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -89,7 +89,7 @@ static struct linux_binfmt elf_format = {
 	.min_coredump	= ELF_EXEC_PAGESIZE,
 };
 
-#define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE)
+#define BAD_ADDR(x) ((unsigned long)(x) >= mmap_max_addr())
 
 static int set_brk(unsigned long start, unsigned long end)
 {
@@ -587,8 +587,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 			k = load_addr + eppnt->p_vaddr;
 			if (BAD_ADDR(k) ||
 			    eppnt->p_filesz > eppnt->p_memsz ||
-			    eppnt->p_memsz > TASK_SIZE ||
-			    TASK_SIZE - eppnt->p_memsz < k) {
+			    eppnt->p_memsz > mmap_max_addr() ||
+			    mmap_max_addr() - eppnt->p_memsz < k) {
 				error = -ENOMEM;
 				goto out;
 			}
@@ -960,8 +960,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		 * <= p_memsz so it is only necessary to check p_memsz.
 		 */
 		if (BAD_ADDR(k) || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
-		    elf_ppnt->p_memsz > TASK_SIZE ||
-		    TASK_SIZE - elf_ppnt->p_memsz < k) {
+		    elf_ppnt->p_memsz > mmap_max_addr() ||
+		    mmap_max_addr() - elf_ppnt->p_memsz < k) {
 			/* set_brk can never work. Avoid overflows. */
 			retval = -EINVAL;
 			goto out_free_dentry;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 54de77e78775..e132e93b85fb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -178,7 +178,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
@@ -190,7 +190,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	if (addr) {
 		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr &&
+		if (mmap_max_addr() - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
@@ -198,7 +198,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = TASK_UNMAPPED_BASE;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = mmap_max_addr();
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 	return vm_unmapped_area(&info);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8e7e61b28f31..b91247cd171d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -594,6 +594,7 @@ static const struct limit_names lnames[RLIM_NLIMITS] = {
 	[RLIMIT_NICE] = {"Max nice priority", NULL},
 	[RLIMIT_RTPRIO] = {"Max realtime priority", NULL},
 	[RLIMIT_RTTIME] = {"Max realtime timeout", "us"},
+	[RLIMIT_VADDR] = {"Max virtual address", NULL},
 };
 
 /* Display limits for a process */
diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 5e752b959054..d24c978103e5 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -3,6 +3,9 @@
 
 #include <uapi/asm-generic/resource.h>
 
+#ifndef USER_VADDR_LIM
+#define USER_VADDR_LIM RLIM_INFINITY
+#endif
 
 /*
  * boot-time rlimit defaults for the init task:
@@ -25,6 +28,7 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_VADDR]		= { USER_VADDR_LIM,  RLIM_INFINITY },   \
 }
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d1905245c7a..f0f23afe0838 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3661,4 +3661,9 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
+static inline unsigned long mmap_max_addr(void)
+{
+	return min(TASK_SIZE, rlimit(RLIMIT_VADDR));
+}
+
 #endif
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index c6d10af50123..7843ed0ed7a7 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -45,7 +45,8 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+#define RLIMIT_VADDR		16	/* maximum virtual address */
+#define RLIM_NLIMITS		17
 
 /*
  * SuS says limits have to be unsigned.
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d416f3baf392..651f571a1a79 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1142,8 +1142,9 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 
 	if (!area->vaddr) {
 		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = get_unmapped_area(NULL,
+				mmap_max_addr() - PAGE_SIZE,
+				PAGE_SIZE, 0, 0);
 		if (area->vaddr & ~PAGE_MASK) {
 			ret = area->vaddr;
 			goto fail;
diff --git a/kernel/sys.c b/kernel/sys.c
index 842914ef7de4..a5ee7f23beda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1718,7 +1718,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
  */
 static int validate_prctl_map(struct prctl_mm_map *prctl_map)
 {
-	unsigned long mmap_max_addr = TASK_SIZE;
+	unsigned long max_addr = mmap_max_addr();
 	struct mm_struct *mm = current->mm;
 	int error = -EINVAL, i;
 
@@ -1743,7 +1743,7 @@ static int validate_prctl_map(struct prctl_mm_map *prctl_map)
 	for (i = 0; i < ARRAY_SIZE(offsets); i++) {
 		u64 val = *(u64 *)((char *)prctl_map + offsets[i]);
 
-		if ((unsigned long)val >= mmap_max_addr ||
+		if ((unsigned long)val >= max_addr ||
 		    (unsigned long)val < mmap_min_addr)
 			goto out;
 	}
@@ -1949,7 +1949,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
 	if (opt == PR_SET_MM_AUXV)
 		return prctl_set_auxv(mm, addr, arg4);
 
-	if (addr >= TASK_SIZE || addr < mmap_min_addr)
+	if (addr >= mmap_max_addr() || addr < mmap_min_addr)
 		return -EINVAL;
 
 	error = -EINVAL;
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..a3384f23359e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1966,7 +1966,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
-	if (len > TASK_SIZE - mmap_min_addr)
+	if (len > mmap_max_addr() - mmap_min_addr)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -1975,15 +1975,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
-		    (!vma || addr + len <= vma->vm_start))
+		if (mmap_max_addr() - len >= addr &&
+				addr >= mmap_min_addr &&
+				(!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
 
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = mm->mmap_base;
-	info.high_limit = TASK_SIZE;
+	info.high_limit = mmap_max_addr();
 	info.align_mask = 0;
 	return vm_unmapped_area(&info);
 }
@@ -2005,7 +2006,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	struct vm_unmapped_area_info info;
 
 	/* requested length too big for entire address space */
-	if (len > TASK_SIZE - mmap_min_addr)
+	if (len > mmap_max_addr() - mmap_min_addr)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -2015,7 +2016,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
+		if (mmap_max_addr() - len >= addr &&
+				addr >= mmap_min_addr &&
 				(!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
@@ -2037,7 +2039,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		VM_BUG_ON(addr != -ENOMEM);
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = TASK_SIZE;
+		info.high_limit = mmap_max_addr();
 		addr = vm_unmapped_area(&info);
 	}
 
@@ -2057,7 +2059,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		return error;
 
 	/* Careful about overflows.. */
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
@@ -2078,7 +2080,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 	if (IS_ERR_VALUE(addr))
 		return addr;
 
-	if (addr > TASK_SIZE - len)
+	if (addr > mmap_max_addr() - len)
 		return -ENOMEM;
 	if (offset_in_page(addr))
 		return -EINVAL;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2b3bfcd51c75..a8b4fba3dce6 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -433,7 +433,8 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (offset_in_page(new_addr))
 		goto out;
 
-	if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
+	if (new_len > mmap_max_addr() ||
+			new_addr > mmap_max_addr() - new_len)
 		goto out;
 
 	/* Ensure the old/new locations do not overlap */
diff --git a/mm/nommu.c b/mm/nommu.c
index 24f9f5f39145..6043b8b82083 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -905,7 +905,7 @@ static int validate_mmap_request(struct file *file,
 
 	/* Careful about overflows.. */
 	rlen = PAGE_ALIGN(len);
-	if (!rlen || rlen > TASK_SIZE)
+	if (!rlen || rlen > mmap_max_addr())
 		return -ENOMEM;
 
 	/* offset overflow? */
diff --git a/mm/shmem.c b/mm/shmem.c
index bb53285a1d99..3c9be716083f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1976,7 +1976,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	unsigned long inflated_addr;
 	unsigned long inflated_offset;
 
-	if (len > TASK_SIZE)
+	if (len > mmap_max_addr())
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
@@ -1988,7 +1988,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		return addr;
 	if (addr & ~PAGE_MASK)
 		return addr;
-	if (addr > TASK_SIZE - len)
+	if (addr > mmap_max_addr() - len)
 		return addr;
 
 	if (shmem_huge == SHMEM_HUGE_DENY)
@@ -2031,7 +2031,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		return addr;
 
 	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
-	if (inflated_len > TASK_SIZE)
+	if (inflated_len > mmap_max_addr())
 		return addr;
 	if (inflated_len < len)
 		return addr;
@@ -2047,7 +2047,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (inflated_offset > offset)
 		inflated_addr += HPAGE_PMD_SIZE;
 
-	if (inflated_addr > TASK_SIZE - len)
+	if (inflated_addr > mmap_max_addr() - len)
 		return addr;
 	return inflated_addr;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  1:54 ` [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR Kirill A. Shutemov
@ 2016-12-27  2:06   ` Andy Lutomirski
  2016-12-27  2:24     ` Kirill A. Shutemov
  2017-01-02  8:44   ` Arnd Bergmann
  2017-01-05 19:13   ` Dave Hansen
  2 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2016-12-27  2:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API

On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> This patch introduces new rlimit resource to manage maximum virtual
> address available to userspace to map.
>
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use high bit in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> The patch aims to address this compatibility issue.
>
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.
>
> The default hard limit will be RLIM_INFINITY, which basically means that
> TASK_SIZE limits available address space.
>
> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> with 5-level paging enabled. In this case, soft limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe
>
> New rlimit resource would follow usual semantics with regards to
> inheritance: preserved on fork(2) and exec(2). This has potential to
> break application if limits set too wide or too narrow, but this is not
> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>
> As with other resources you can set the limit lower than current usage.
> It would affect only future virtual address space allocations.
>
> Use-cases for new rlimit:
>
>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>     its children to use addresses above 47-bits.
>
>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>     exec(2) allows the child to use addresses above 47-bits.
>
>   - Lowering the hard limit to 47-bits would prevent current process all
>     its children to use addresses above 47-bits, unless a process has
>     CAP_SYS_RESOURCES.
>
>   - It’s also can be handy to lower hard or soft limit to arbitrary
>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>     to emulate 32-bit machine on 64-bit host.

I tend to think that this should be a personality or an ELF flag, not
an rlimit.  That way setuid works right.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  2:06   ` Andy Lutomirski
@ 2016-12-27  2:24     ` Kirill A. Shutemov
  2016-12-27  3:22       ` Andy Lutomirski
  2016-12-29  2:53       ` Carlos O'Donell
  0 siblings, 2 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  2:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > This patch introduces new rlimit resource to manage maximum virtual
> > address available to userspace to map.
> >
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use high bit in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> >
> > The patch aims to address this compatibility issue.
> >
> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> > address available to map by userspace.
> >
> > The default hard limit will be RLIM_INFINITY, which basically means that
> > TASK_SIZE limits available address space.
> >
> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
> > with 5-level paging enabled. In this case, soft limit would be
> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> > paging which known to be safe
> >
> > New rlimit resource would follow usual semantics with regards to
> > inheritance: preserved on fork(2) and exec(2). This has potential to
> > break application if limits set too wide or too narrow, but this is not
> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >
> > As with other resources you can set the limit lower than current usage.
> > It would affect only future virtual address space allocations.
> >
> > Use-cases for new rlimit:
> >
> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >     its children to use addresses above 47-bits.
> >
> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >     exec(2) allows the child to use addresses above 47-bits.
> >
> >   - Lowering the hard limit to 47-bits would prevent current process all
> >     its children to use addresses above 47-bits, unless a process has
> >     CAP_SYS_RESOURCES.
> >
> >   - It’s also can be handy to lower hard or soft limit to arbitrary
> >     address. User-mode emulation in QEMU may lower the limit to 32-bit
> >     to emulate 32-bit machine on 64-bit host.
> 
> I tend to think that this should be a personality or an ELF flag, not
> an rlimit.

My plan was to implement ELF flag on top. Basically, ELF flag would mean
that we bump soft limit to hard limit on exec.

> That way setuid works right.

Um.. I probably miss background here.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  2:24     ` Kirill A. Shutemov
@ 2016-12-27  3:22       ` Andy Lutomirski
  2017-01-02  9:09         ` Kirill A. Shutemov
  2016-12-29  2:53       ` Carlos O'Donell
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2016-12-27  3:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Mon, Dec 26, 2016 at 6:24 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > This patch introduces new rlimit resource to manage maximum virtual
>> > address available to userspace to map.
>> >
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use high bit in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> >
>> > The patch aims to address this compatibility issue.
>> >
>> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>> > address available to map by userspace.
>> >
>> > The default hard limit will be RLIM_INFINITY, which basically means that
>> > TASK_SIZE limits available address space.
>> >
>> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
>> > with 5-level paging enabled. In this case, soft limit would be
>> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>> > paging which known to be safe
>> >
>> > New rlimit resource would follow usual semantics with regards to
>> > inheritance: preserved on fork(2) and exec(2). This has potential to
>> > break application if limits set too wide or too narrow, but this is not
>> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>> >
>> > As with other resources you can set the limit lower than current usage.
>> > It would affect only future virtual address space allocations.
>> >
>> > Use-cases for new rlimit:
>> >
>> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> >     its children to use addresses above 47-bits.
>> >
>> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>> >     exec(2) allows the child to use addresses above 47-bits.
>> >
>> >   - Lowering the hard limit to 47-bits would prevent current process all
>> >     its children to use addresses above 47-bits, unless a process has
>> >     CAP_SYS_RESOURCES.
>> >
>> >   - It’s also can be handy to lower hard or soft limit to arbitrary
>> >     address. User-mode emulation in QEMU may lower the limit to 32-bit
>> >     to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
>
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.
>
>> That way setuid works right.
>
> Um.. I probably miss background here.
>

If a setuid program depends on the lower limit, then a malicious
program shouldn't be able to cause it to run with the higher limit.
The personality code should already get this case right because
personalities are reset when setuid happens.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  2:24     ` Kirill A. Shutemov
  2016-12-27  3:22       ` Andy Lutomirski
@ 2016-12-29  2:53       ` Carlos O'Donell
  2016-12-31  2:08         ` Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: Carlos O'Donell @ 2016-12-29  2:53 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch, linux-mm, linux-kernel,
	Linux API

On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>>> This patch introduces new rlimit resource to manage maximum virtual
>>> address available to userspace to map.
>>>
>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>> Not all user space is ready to handle wide addresses. It's known that
>>> at least some JIT compilers use high bit in pointers to encode their
>>> information. It collides with valid pointers with 5-level paging and
>>> leads to crashes.
>>>
>>> The patch aims to address this compatibility issue.
>>>
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>>
>>> The default hard limit will be RLIM_INFINITY, which basically means that
>>> TASK_SIZE limits available address space.
>>>
>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>> with 5-level paging enabled. In this case, soft limit would be
>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>> paging which known to be safe
>>>
>>> New rlimit resource would follow usual semantics with regards to
>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>>> break application if limits set too wide or too narrow, but this is not
>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>>>
>>> As with other resources you can set the limit lower than current usage.
>>> It would affect only future virtual address space allocations.
>>>
>>> Use-cases for new rlimit:
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>>     its children to use addresses above 47-bits.
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>>     exec(2) allows the child to use addresses above 47-bits.
>>>
>>>   - Lowering the hard limit to 47-bits would prevent current process all
>>>     its children to use addresses above 47-bits, unless a process has
>>>     CAP_SYS_RESOURCES.
>>>
>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>>>     to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
> 
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.

Could you clarify what you mean by an "ELF flag?"

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-29  2:53       ` Carlos O'Donell
@ 2016-12-31  2:08         ` Andy Lutomirski
  2017-01-02  8:35           ` Kirill A. Shutemov
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2016-12-31  2:08 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, Dave Hansen,
	linux-arch, linux-mm, linux-kernel, Linux API

On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
>> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>> <kirill.shutemov@linux.intel.com> wrote:
>>>> This patch introduces new rlimit resource to manage maximum virtual
>>>> address available to userspace to map.
>>>>
>>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>>> Not all user space is ready to handle wide addresses. It's known that
>>>> at least some JIT compilers use high bit in pointers to encode their
>>>> information. It collides with valid pointers with 5-level paging and
>>>> leads to crashes.
>>>>
>>>> The patch aims to address this compatibility issue.
>>>>
>>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>>> address available to map by userspace.
>>>>
>>>> The default hard limit will be RLIM_INFINITY, which basically means that
>>>> TASK_SIZE limits available address space.
>>>>
>>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>>> with 5-level paging enabled. In this case, soft limit would be
>>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>>> paging which known to be safe
>>>>
>>>> New rlimit resource would follow usual semantics with regards to
>>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>>>> break application if limits set too wide or too narrow, but this is not
>>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>>>>
>>>> As with other resources you can set the limit lower than current usage.
>>>> It would affect only future virtual address space allocations.
>>>>
>>>> Use-cases for new rlimit:
>>>>
>>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>>>     its children to use addresses above 47-bits.
>>>>
>>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>>>     exec(2) allows the child to use addresses above 47-bits.
>>>>
>>>>   - Lowering the hard limit to 47-bits would prevent current process all
>>>>     its children to use addresses above 47-bits, unless a process has
>>>>     CAP_SYS_RESOURCES.
>>>>
>>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>>>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>>>>     to emulate 32-bit machine on 64-bit host.
>>>
>>> I tend to think that this should be a personality or an ELF flag, not
>>> an rlimit.
>>
>> My plan was to implement ELF flag on top. Basically, ELF flag would mean
>> that we bump soft limit to hard limit on exec.
>
> Could you clarify what you mean by an "ELF flag?"

Some way to mark a binary as supporting a larger address space.  I
don't have a precise solution in mind, but an ELF note might be a good
way to go here.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-31  2:08         ` Andy Lutomirski
@ 2017-01-02  8:35           ` Kirill A. Shutemov
  2017-01-13 20:11             ` H.J. Lu
  0 siblings, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-02  8:35 UTC (permalink / raw)
  To: Andy Lutomirski, H.J. Lu
  Cc: Carlos O'Donell, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, Dave Hansen,
	linux-arch, linux-mm, linux-kernel, Linux API

On Fri, Dec 30, 2016 at 06:08:27PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> > On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> >> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> >>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> >>> <kirill.shutemov@linux.intel.com> wrote:
> >>>> This patch introduces new rlimit resource to manage maximum virtual
> >>>> address available to userspace to map.
> >>>>
> >>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
> >>>> Not all user space is ready to handle wide addresses. It's known that
> >>>> at least some JIT compilers use high bit in pointers to encode their
> >>>> information. It collides with valid pointers with 5-level paging and
> >>>> leads to crashes.
> >>>>
> >>>> The patch aims to address this compatibility issue.
> >>>>
> >>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >>>> address available to map by userspace.
> >>>>
> >>>> The default hard limit will be RLIM_INFINITY, which basically means that
> >>>> TASK_SIZE limits available address space.
> >>>>
> >>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> >>>> with 5-level paging enabled. In this case, soft limit would be
> >>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> >>>> paging which known to be safe
> >>>>
> >>>> New rlimit resource would follow usual semantics with regards to
> >>>> inheritance: preserved on fork(2) and exec(2). This has potential to
> >>>> break application if limits set too wide or too narrow, but this is not
> >>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >>>>
> >>>> As with other resources you can set the limit lower than current usage.
> >>>> It would affect only future virtual address space allocations.
> >>>>
> >>>> Use-cases for new rlimit:
> >>>>
> >>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >>>>     its children to use addresses above 47-bits.
> >>>>
> >>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >>>>     exec(2) allows the child to use addresses above 47-bits.
> >>>>
> >>>>   - Lowering the hard limit to 47-bits would prevent current process all
> >>>>     its children to use addresses above 47-bits, unless a process has
> >>>>     CAP_SYS_RESOURCES.
> >>>>
> >>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
> >>>>     address. User-mode emulation in QEMU may lower the limit to 32-bit
> >>>>     to emulate 32-bit machine on 64-bit host.
> >>>
> >>> I tend to think that this should be a personality or an ELF flag, not
> >>> an rlimit.
> >>
> >> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> >> that we bump soft limit to hard limit on exec.
> >
> > Could you clarify what you mean by an "ELF flag?"
> 
> Some way to mark a binary as supporting a larger address space.  I
> don't have a precise solution in mind, but an ELF note might be a good
> way to go here.

+ H.J.

There's discussion of proposal of "Program Properties"[1]. It seems fits
the purpose.

[1] https://sourceware.org/ml/gnu-gabi/2016-q4/msg00000.html

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  1:54 ` [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR Kirill A. Shutemov
  2016-12-27  2:06   ` Andy Lutomirski
@ 2017-01-02  8:44   ` Arnd Bergmann
  2017-01-03  6:08     ` Andy Lutomirski
  2017-01-05 19:13   ` Dave Hansen
  2 siblings, 1 reply; 72+ messages in thread
From: Arnd Bergmann @ 2017-01-02  8:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Dave Hansen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel, linux-api, linux-arm-kernel,
	Catalin Marinas, Will Deacon

On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
> This patch introduces new rlimit resource to manage maximum virtual
> address available to userspace to map.
> 
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use high bit in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
> 
> The patch aims to address this compatibility issue.
> 
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.
> 
> The default hard limit will be RLIM_INFINITY, which basically means that
> TASK_SIZE limits available address space.
> 
> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> with 5-level paging enabled. In this case, soft limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe
> 
> New rlimit resource would follow usual semantics with regards to
> inheritance: preserved on fork(2) and exec(2). This has potential to
> break application if limits set too wide or too narrow, but this is not
> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> 
> As with other resources you can set the limit lower than current usage.
> It would affect only future virtual address space allocations.
> 
> Use-cases for new rlimit:
> 
>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>     its children to use addresses above 47-bits.
> 
>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>     exec(2) allows the child to use addresses above 47-bits.
> 
>   - Lowering the hard limit to 47-bits would prevent current process all
>     its children to use addresses above 47-bits, unless a process has
>     CAP_SYS_RESOURCES.
> 
>   - It’s also can be handy to lower hard or soft limit to arbitrary
>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>     to emulate 32-bit machine on 64-bit host.
> 
> TODO:
>   - port to non-x86;
> 
> Not-yet-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: linux-api@vger.kernel.org

This seems to nicely address the same problem on arm64, which has
run into the same issue due to the various page table formats
that can currently be chosen at compile time.

I don't see how this interacts with the existing
PER_LINUX32/PER_LINUX32_3GB personality flags, but I assume you have
either already thought of that, or we can come up with a good way
to define what happens when conflicting settings are applied.

The two reasonable ways I can think of are to either use the
minimum of the two limits, or to make the personality syscall
set the soft rlimit and use whatever limit was last set.

	Arnd

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  3:22       ` Andy Lutomirski
@ 2017-01-02  9:09         ` Kirill A. Shutemov
  0 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-02  9:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Mon, Dec 26, 2016 at 07:22:03PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 26, 2016 at 6:24 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > This patch introduces new rlimit resource to manage maximum virtual
> >> > address available to userspace to map.
> >> >
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use high bit in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> >
> >> > The patch aims to address this compatibility issue.
> >> >
> >> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >> > address available to map by userspace.
> >> >
> >> > The default hard limit will be RLIM_INFINITY, which basically means that
> >> > TASK_SIZE limits available address space.
> >> >
> >> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
> >> > with 5-level paging enabled. In this case, soft limit would be
> >> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> >> > paging which known to be safe
> >> >
> >> > New rlimit resource would follow usual semantics with regards to
> >> > inheritance: preserved on fork(2) and exec(2). This has potential to
> >> > break application if limits set too wide or too narrow, but this is not
> >> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >> >
> >> > As with other resources you can set the limit lower than current usage.
> >> > It would affect only future virtual address space allocations.
> >> >
> >> > Use-cases for new rlimit:
> >> >
> >> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >> >     its children to use addresses above 47-bits.
> >> >
> >> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >> >     exec(2) allows the child to use addresses above 47-bits.
> >> >
> >> >   - Lowering the hard limit to 47-bits would prevent current process all
> >> >     its children to use addresses above 47-bits, unless a process has
> >> >     CAP_SYS_RESOURCES.
> >> >
> >> >   - It’s also can be handy to lower hard or soft limit to arbitrary
> >> >     address. User-mode emulation in QEMU may lower the limit to 32-bit
> >> >     to emulate 32-bit machine on 64-bit host.
> >>
> >> I tend to think that this should be a personality or an ELF flag, not
> >> an rlimit.
> >
> > My plan was to implement ELF flag on top. Basically, ELF flag would mean
> > that we bump soft limit to hard limit on exec.
> >
> >> That way setuid works right.
> >
> > Um.. I probably miss background here.
> >
> 
> If a setuid program depends on the lower limit, then a malicious
> program shouldn't be able to cause it to run with the higher limit.
> The personality code should already get this case right because
> personalities are reset when setuid happens.

It would be nice to have more fine-grained control than binary personality
flag gives. It would cover more use-cases.

Well, we could reset the limit on exec of setuid binary too. That's not
ideal, but...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-02  8:44   ` Arnd Bergmann
@ 2017-01-03  6:08     ` Andy Lutomirski
  2017-01-03 13:18       ` Arnd Bergmann
  2017-01-03 16:04       ` Kirill A. Shutemov
  0 siblings, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-03  6:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
>> As with other resources you can set the limit lower than current usage.
>> It would affect only future virtual address space allocations.

I still don't buy all these use cases:

>>
>> Use-cases for new rlimit:
>>
>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>     its children to use addresses above 47-bits.

OK, I get this, but only as a workaround for programs that make
assumptions about the address space and don't use some mechanism (to
be designed?) to work correctly in spite of a larger address space.

>>
>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>     exec(2) allows the child to use addresses above 47-bits.

Ditto.

>>
>>   - Lowering the hard limit to 47-bits would prevent current process all
>>     its children to use addresses above 47-bits, unless a process has
>>     CAP_SYS_RESOURCES.

I've tried and I can't imagine any reason to do this.

>>
>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>>     to emulate 32-bit machine on 64-bit host.

I don't understand.  QEMU user-mode emulation intercepts all syscalls.
What QEMU would *actually* want is a way to say "allocate me some
memory with the high N bits clear".  mmap-via-int80 on x86 should be
fixed to do this, but a new syscall with an explicit parameter would
work, as would a prctl changing the current limit.

>>
>> TODO:
>>   - port to non-x86;
>>
>> Not-yet-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: linux-api@vger.kernel.org
>
> This seems to nicely address the same problem on arm64, which has
> run into the same issue due to the various page table formats
> that can currently be chosen at compile time.

On further reflection, I think this has very little to do with paging
formats except insofar as paging formats make us notice the problem.
The issue is that user code wants to be able to assume an upper limit
on an address, and it gets an upper limit right now that depends on
architecture due to paging formats.  But someone really might want to
write a *portable* 64-bit program that allocates memory with the high
16 bits clear.  So let's add such a mechanism directly.

As a thought experiment, what if x86_64 simply never allocated "high"
(above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
were used?  Old glibc would continue working.  Old VMs would work.
New programs that want to use ginormous mappings would have to use the
new syscall.  This would be totally stateless and would have no issues
with CRIU.

If necessary, we could also have a prctl that changes a
"personality-like" limit that is in effect when the old mmap was used.
I say "personality-like" because it would reset under exactly the same
conditions that personality resets itself.

Thoughts?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03  6:08     ` Andy Lutomirski
@ 2017-01-03 13:18       ` Arnd Bergmann
  2017-01-03 18:29         ` Andy Lutomirski
  2017-01-03 16:04       ` Kirill A. Shutemov
  1 sibling, 1 reply; 72+ messages in thread
From: Arnd Bergmann @ 2017-01-03 13:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Monday, January 2, 2017 10:08:28 PM CET Andy Lutomirski wrote:
> 
> > This seems to nicely address the same problem on arm64, which has
> > run into the same issue due to the various page table formats
> > that can currently be chosen at compile time.
> 
> On further reflection, I think this has very little to do with paging
> formats except insofar as paging formats make us notice the problem.
> The issue is that user code wants to be able to assume an upper limit
> on an address, and it gets an upper limit right now that depends on
> architecture due to paging formats.  But someone really might want to
> write a *portable* 64-bit program that allocates memory with the high
> 16 bits clear.  So let's add such a mechanism directly.
> 
> As a thought experiment, what if x86_64 simply never allocated "high"
> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
> were used?  Old glibc would continue working.  Old VMs would work.
> New programs that want to use ginormous mappings would have to use the
> new syscall.  This would be totally stateless and would have no issues
> with CRIU.

I can see this working well for the 47-bit addressing default, but
what about applications that actually rely on 39-bit addressing
(I'd have to double-check, but I think this was the limit that
people were most interested in for arm64)?

39 bits seems a little small to make that the default for everyone
who doesn't pass the extra flag. Having to pass another flag to
limit the addresses introduces other problems (e.g. mmap from
library call that doesn't pass that flag).

> If necessary, we could also have a prctl that changes a
> "personality-like" limit that is in effect when the old mmap was used.
> I say "personality-like" because it would reset under exactly the same
> conditions that personality resets itself.

For "personality-like", it would still have to interact
with the existing PER_LINUX32 and PER_LINUX32_3GB flags that
do the exact same thing, so actually using personality might
be better.

We still have a few bits in the personality arguments, and
we could combine them with the existing ADDR_LIMIT_3GB
and ADDR_LIMIT_32BIT flags that are mutually exclusive by
definition, such as

        ADDR_LIMIT_32BIT =      0x0800000, /* existing */
        ADDR_LIMIT_3GB   =      0x8000000, /* existing */
        ADDR_LIMIT_39BIT =      0x0010000, /* next free bit */
        ADDR_LIMIT_42BIT =      0x8010000,
        ADDR_LIMIT_47BIT =      0x0810000,
        ADDR_LIMIT_48BIT =      0x8810000,

This would probably take only one or two personality bits for the
limits that are interesting in practice.

	Arnd

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03  6:08     ` Andy Lutomirski
  2017-01-03 13:18       ` Arnd Bergmann
@ 2017-01-03 16:04       ` Kirill A. Shutemov
  2017-01-03 18:27         ` Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-03 16:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Mon, Jan 02, 2017 at 10:08:28PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
> >> As with other resources you can set the limit lower than current usage.
> >> It would affect only future virtual address space allocations.
> 
> I still don't buy all these use cases:
> 
> >>
> >> Use-cases for new rlimit:
> >>
> >>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >>     its children to use addresses above 47-bits.
> 
> OK, I get this, but only as a workaround for programs that make
> assumptions about the address space and don't use some mechanism (to
> be designed?) to work correctly in spite of a larger address space.

I guess you've misread the case. It's opt-in for large adrress space, not
other way around.

I believe 47-bit VA by default is right way to go to make the transition
without breaking userspace.

> >>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >>     exec(2) allows the child to use addresses above 47-bits.
> 
> Ditto.
> 
> >>
> >>   - Lowering the hard limit to 47-bits would prevent current process all
> >>     its children to use addresses above 47-bits, unless a process has
> >>     CAP_SYS_RESOURCES.
> 
> I've tried and I can't imagine any reason to do this.

That's just if something went wrong and we want to stop an application
from use addresses above 47-bit.

> >>   - It’s also can be handy to lower hard or soft limit to arbitrary
> >>     address. User-mode emulation in QEMU may lower the limit to 32-bit
> >>     to emulate 32-bit machine on 64-bit host.
> 
> I don't understand.  QEMU user-mode emulation intercepts all syscalls.
> What QEMU would *actually* want is a way to say "allocate me some
> memory with the high N bits clear".  mmap-via-int80 on x86 should be
> fixed to do this, but a new syscall with an explicit parameter would
> work, as would a prctl changing the current limit.

Look at mess in mmap_find_vma(). QEmu has to guess where is free virtual
memory. That's unnessesary complex.

prctl would work for this too. new-mmap would *not*: there are more ways
to allocate vitual address space: shmat(), mremap(). Changing all of them
just for this is stupid.

> >>
> >> TODO:
> >>   - port to non-x86;
> >>
> >> Not-yet-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> Cc: linux-api@vger.kernel.org
> >
> > This seems to nicely address the same problem on arm64, which has
> > run into the same issue due to the various page table formats
> > that can currently be chosen at compile time.
> 
> On further reflection, I think this has very little to do with paging
> formats except insofar as paging formats make us notice the problem.
> The issue is that user code wants to be able to assume an upper limit
> on an address, and it gets an upper limit right now that depends on
> architecture due to paging formats.  But someone really might want to
> write a *portable* 64-bit program that allocates memory with the high
> 16 bits clear.  So let's add such a mechanism directly.
> 
> As a thought experiment, what if x86_64 simply never allocated "high"
> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
> were used?  Old glibc would continue working.  Old VMs would work.
> New programs that want to use ginormous mappings would have to use the
> new syscall.  This would be totally stateless and would have no issues
> with CRIU.

Except, we need more than mmap as I mentioned.

And what about stack? I'm not sure that everybody would be happy with
stack in the middle of address space.

> If necessary, we could also have a prctl that changes a
> "personality-like" limit that is in effect when the old mmap was used.
> I say "personality-like" because it would reset under exactly the same
> conditions that personality resets itself.
> 
> Thoughts?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03 16:04       ` Kirill A. Shutemov
@ 2017-01-03 18:27         ` Andy Lutomirski
  2017-01-04 14:19           ` Kirill A. Shutemov
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-03 18:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Arnd Bergmann, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Mon, Jan 02, 2017 at 10:08:28PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> > On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
>> >> As with other resources you can set the limit lower than current usage.
>> >> It would affect only future virtual address space allocations.
>>
>> I still don't buy all these use cases:
>>
>> >>
>> >> Use-cases for new rlimit:
>> >>
>> >>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> >>     its children to use addresses above 47-bits.
>>
>> OK, I get this, but only as a workaround for programs that make
>> assumptions about the address space and don't use some mechanism (to
>> be designed?) to work correctly in spite of a larger address space.
>
> I guess you've misread the case. It's opt-in for large adrress space, not
> other way around.
>
> I believe 47-bit VA by default is right way to go to make the transition
> without breaking userspace.

What I meant was: setting the rlimit to anything other than -1ULL is a
workaround, but otherwise I agree.  This still makes little sense if
set by PAM or other conventional rlimit tools.

>> >>
>> >>   - Lowering the hard limit to 47-bits would prevent current process all
>> >>     its children to use addresses above 47-bits, unless a process has
>> >>     CAP_SYS_RESOURCES.
>>
>> I've tried and I can't imagine any reason to do this.
>
> That's just if something went wrong and we want to stop an application
> from use addresses above 47-bit.

But CAP_SYS_RESOURCES still makes no sense in this context.

>
>> >>   - It’s also can be handy to lower hard or soft limit to arbitrary
>> >>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>> >>     to emulate 32-bit machine on 64-bit host.
>>
>> I don't understand.  QEMU user-mode emulation intercepts all syscalls.
>> What QEMU would *actually* want is a way to say "allocate me some
>> memory with the high N bits clear".  mmap-via-int80 on x86 should be
>> fixed to do this, but a new syscall with an explicit parameter would
>> work, as would a prctl changing the current limit.
>
> Look at mess in mmap_find_vma(). QEmu has to guess where is free virtual
> memory. That's unnessesary complex.
>
> prctl would work for this too. new-mmap would *not*: there are more ways
> to allocate vitual address space: shmat(), mremap(). Changing all of them
> just for this is stupid.

Fair enough.

Except that mmap-via-int80, shmat-via-int80, etc should still work (if
I understand what qemu needs correctly), as would the prctl.

>
>> >>
>> >> TODO:
>> >>   - port to non-x86;
>> >>
>> >> Not-yet-signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >> Cc: linux-api@vger.kernel.org
>> >
>> > This seems to nicely address the same problem on arm64, which has
>> > run into the same issue due to the various page table formats
>> > that can currently be chosen at compile time.
>>
>> On further reflection, I think this has very little to do with paging
>> formats except insofar as paging formats make us notice the problem.
>> The issue is that user code wants to be able to assume an upper limit
>> on an address, and it gets an upper limit right now that depends on
>> architecture due to paging formats.  But someone really might want to
>> write a *portable* 64-bit program that allocates memory with the high
>> 16 bits clear.  So let's add such a mechanism directly.
>>
>> As a thought experiment, what if x86_64 simply never allocated "high"
>> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
>> were used?  Old glibc would continue working.  Old VMs would work.
>> New programs that want to use ginormous mappings would have to use the
>> new syscall.  This would be totally stateless and would have no issues
>> with CRIU.
>
> Except, we need more than mmap as I mentioned.
>
> And what about stack? I'm not sure that everybody would be happy with
> stack in the middle of address space.

I would, personally.  I think that, for very large address spaces, we
should allocate a large block of stack and get rid of the "stack grows
down forever" legacy idea.  Then we would never need to worry about
the stack eventually hitting some other allocation.  And 2^57 bytes is
hilariously large for a default stack.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03 13:18       ` Arnd Bergmann
@ 2017-01-03 18:29         ` Andy Lutomirski
  2017-01-03 22:07           ` Arnd Bergmann
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-03 18:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Tue, Jan 3, 2017 at 5:18 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday, January 2, 2017 10:08:28 PM CET Andy Lutomirski wrote:
>>
>> > This seems to nicely address the same problem on arm64, which has
>> > run into the same issue due to the various page table formats
>> > that can currently be chosen at compile time.
>>
>> On further reflection, I think this has very little to do with paging
>> formats except insofar as paging formats make us notice the problem.
>> The issue is that user code wants to be able to assume an upper limit
>> on an address, and it gets an upper limit right now that depends on
>> architecture due to paging formats.  But someone really might want to
>> write a *portable* 64-bit program that allocates memory with the high
>> 16 bits clear.  So let's add such a mechanism directly.
>>
>> As a thought experiment, what if x86_64 simply never allocated "high"
>> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
>> were used?  Old glibc would continue working.  Old VMs would work.
>> New programs that want to use ginormous mappings would have to use the
>> new syscall.  This would be totally stateless and would have no issues
>> with CRIU.
>
> I can see this working well for the 47-bit addressing default, but
> what about applications that actually rely on 39-bit addressing
> (I'd have to double-check, but I think this was the limit that
> people were most interested in for arm64)?
>
> 39 bits seems a little small to make that the default for everyone
> who doesn't pass the extra flag. Having to pass another flag to
> limit the addresses introduces other problems (e.g. mmap from
> library call that doesn't pass that flag).

That's a fair point.  Maybe my straw man isn't so good.

>
>> If necessary, we could also have a prctl that changes a
>> "personality-like" limit that is in effect when the old mmap was used.
>> I say "personality-like" because it would reset under exactly the same
>> conditions that personality resets itself.
>
> For "personality-like", it would still have to interact
> with the existing PER_LINUX32 and PER_LINUX32_3GB flags that
> do the exact same thing, so actually using personality might
> be better.
>
> We still have a few bits in the personality arguments, and
> we could combine them with the existing ADDR_LIMIT_3GB
> and ADDR_LIMIT_32BIT flags that are mutually exclusive by
> definition, such as
>
>         ADDR_LIMIT_32BIT =      0x0800000, /* existing */
>         ADDR_LIMIT_3GB   =      0x8000000, /* existing */
>         ADDR_LIMIT_39BIT =      0x0010000, /* next free bit */
>         ADDR_LIMIT_42BIT =      0x8010000,
>         ADDR_LIMIT_47BIT =      0x0810000,
>         ADDR_LIMIT_48BIT =      0x8810000,
>
> This would probably take only one or two personality bits for the
> limits that are interesting in practice.

Hmm.  What if we approached this a bit differently?  We could add a
single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
cause PER_LINUX32_3GB etc to be automatically cleared.  When
ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
settable and reading it via prctl returns whatever is implied by the
other personality bits.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03 18:29         ` Andy Lutomirski
@ 2017-01-03 22:07           ` Arnd Bergmann
  2017-01-03 22:09             ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Arnd Bergmann @ 2017-01-03 22:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Tuesday, January 3, 2017 10:29:33 AM CET Andy Lutomirski wrote:
> 
> Hmm.  What if we approached this a bit differently?  We could add a
> single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
> cause PER_LINUX32_3GB etc to be automatically cleared.

Both the ADDR_LIMIT_32BIT and ADDR_LIMIT_3GB flags I guess?

> When
> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
> settable and reading it via prctl returns whatever is implied by the
> other personality bits.

I don't see anything wrong with it, but I'm a bit confused now
what this would be good for, compared to using just prctl.

Is this about setuid clearing the personality but not the prctl,
or something else?

	Arnd

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03 22:07           ` Arnd Bergmann
@ 2017-01-03 22:09             ` Andy Lutomirski
  2017-01-04 13:55               ` Arnd Bergmann
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-03 22:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Tue, Jan 3, 2017 at 2:07 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday, January 3, 2017 10:29:33 AM CET Andy Lutomirski wrote:
>>
>> Hmm.  What if we approached this a bit differently?  We could add a
>> single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
>> cause PER_LINUX32_3GB etc to be automatically cleared.
>
> Both the ADDR_LIMIT_32BIT and ADDR_LIMIT_3GB flags I guess?

Yes.

>
>> When
>> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
>> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
>> settable and reading it via prctl returns whatever is implied by the
>> other personality bits.
>
> I don't see anything wrong with it, but I'm a bit confused now
> what this would be good for, compared to using just prctl.
>
> Is this about setuid clearing the personality but not the prctl,
> or something else?

It's to avid ambiguity as to what happens if you set ADDR_LIMIT_32BIT
and use the prctl.  ISTM it would be nice for the semantics to be
fully defined in all cases.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03 22:09             ` Andy Lutomirski
@ 2017-01-04 13:55               ` Arnd Bergmann
  0 siblings, 0 replies; 72+ messages in thread
From: Arnd Bergmann @ 2017-01-04 13:55 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Andy Lutomirski, linux-arch, Andi Kleen, Catalin Marinas,
	linux-mm, Linux API, X86 ML, Will Deacon, linux-kernel,
	Dave Hansen, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Linus Torvalds, Thomas Gleixner, Kirill A. Shutemov

On Tuesday, January 3, 2017 2:09:16 PM CET Andy Lutomirski wrote:
> >
> >> When
> >> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
> >> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
> >> settable and reading it via prctl returns whatever is implied by the
> >> other personality bits.
> >
> > I don't see anything wrong with it, but I'm a bit confused now
> > what this would be good for, compared to using just prctl.
> >
> > Is this about setuid clearing the personality but not the prctl,
> > or something else?
> 
> It's to avid ambiguity as to what happens if you set ADDR_LIMIT_32BIT
> and use the prctl.  ISTM it would be nice for the semantics to be
> fully defined in all cases.
> 

Ok, got it.

	Arnd

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-03 18:27         ` Andy Lutomirski
@ 2017-01-04 14:19           ` Kirill A. Shutemov
  2017-01-05 17:53             ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-04 14:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Tue, Jan 03, 2017 at 10:27:22AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > And what about stack? I'm not sure that everybody would be happy with
> > stack in the middle of address space.
> 
> I would, personally.  I think that, for very large address spaces, we
> should allocate a large block of stack and get rid of the "stack grows
> down forever" legacy idea.  Then we would never need to worry about
> the stack eventually hitting some other allocation.  And 2^57 bytes is
> hilariously large for a default stack.

The stack in the middle of address space can prevent creating other huuuge
contiguous mapping. Databases may want this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv2 00/29] 5-level paging
  2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
                   ` (28 preceding siblings ...)
  2016-12-27  1:54 ` [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR Kirill A. Shutemov
@ 2017-01-05 16:57 ` Kirill A. Shutemov
  29 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-05 16:57 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: Kirill A. Shutemov, Andi Kleen, Arnd Bergmann, Dave Hansen,
	Andy Lutomirski, linux-arch, linux-mm, linux-kernel

On Tue, Dec 27, 2016 at 04:53:44AM +0300, Kirill A. Shutemov wrote:
> Here is v2 of 5-level paging patchset.
> 
> Please consider applying first 7 patches.

It's probably useful to describe all pieces and the order in which they can
be be merged:

  - The first seven patches of this patchset I would like to get applied now:

    + Detect la57 feature for /proc/cpuinfo.

    + Brings 5-level paging to generic code and convert all architectures
      to it using <asm-generic/5level-fixup.h>

    This is preparation for the next batch of patches.

  - Basic LA57 enabling

    The rest of the patches of the patchset, except rlimit proposal.

    This would enable 5-level paging for kernel.

    Userspace upper address would be limited to current TASK_SIZE_MAX --
    47-bit - PAGE_SIZE, until we will figure out the right interface to
    opt-in full 56-bit VA.

    We still working on getting XEN into shape. We need to get it up and
    running at least for 4-level paging to not regress any configuration.

The reset can be merged independently after basic LA57 enabling:

  - Large VA opt-in mechanism

    I've proposed rlimit handle to enable large VA for userspace.

    Andy is not fan of it. We need to decide what is right way to go.

    Any help with that is welcome.

  - Boottime switch for 5-level paging.

    I haven't started looking into this yet.

  - MPX - MAWA enabling required.

    It requires changes into GCC (libmpx and libmpxwrappers) which are
    not ready yet.

  - Virtualization - EPT5

    There's RFC patchset by Liang Li. Work in progress.

Does it sound reasonable from maintainer's point of view?
Or should I shift priorities somewhere?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-04 14:19           ` Kirill A. Shutemov
@ 2017-01-05 17:53             ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-05 17:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Arnd Bergmann, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API,
	linux-arm-kernel, Catalin Marinas, Will Deacon

On Wed, Jan 4, 2017 at 6:19 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Tue, Jan 03, 2017 at 10:27:22AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>> > And what about stack? I'm not sure that everybody would be happy with
>> > stack in the middle of address space.
>>
>> I would, personally.  I think that, for very large address spaces, we
>> should allocate a large block of stack and get rid of the "stack grows
>> down forever" legacy idea.  Then we would never need to worry about
>> the stack eventually hitting some other allocation.  And 2^57 bytes is
>> hilariously large for a default stack.
>
> The stack in the middle of address space can prevent creating other huuuge
> contiguous mapping. Databases may want this.

Fair enough.  OTOH, 2^47 is nowhere near the middle if we were to put
it near the top of the legacy address space.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2016-12-27  1:54 ` [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR Kirill A. Shutemov
  2016-12-27  2:06   ` Andy Lutomirski
  2017-01-02  8:44   ` Arnd Bergmann
@ 2017-01-05 19:13   ` Dave Hansen
  2017-01-05 19:29     ` Kirill A. Shutemov
  2 siblings, 1 reply; 72+ messages in thread
From: Dave Hansen @ 2017-01-05 19:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	linux-api

On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.

What happens to existing mappings above the limit when this upper limit
is dropped?

Similarly, why do we do with an application running with something
incompatible with the larger address space that tries to raise the
limit?  Say, legacy MPX.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 19:13   ` Dave Hansen
@ 2017-01-05 19:29     ` Kirill A. Shutemov
  2017-01-05 19:39       ` Dave Hansen
  0 siblings, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-05 19:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel, linux-api

On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> > address available to map by userspace.
> 
> What happens to existing mappings above the limit when this upper limit
> is dropped?

Nothing: we only prevent creating new mappings. All existing are not
affected.

The semantics here the same as with other resource limits.

> Similarly, why do we do with an application running with something
> incompatible with the larger address space that tries to raise the
> limit?  Say, legacy MPX.

It has to know what it does. Yes, it can change limit to the point where
application is unusable. But you can to the same with other limits.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 19:29     ` Kirill A. Shutemov
@ 2017-01-05 19:39       ` Dave Hansen
  2017-01-05 20:11         ` Kirill A. Shutemov
  2017-01-05 20:14         ` Andy Lutomirski
  0 siblings, 2 replies; 72+ messages in thread
From: Dave Hansen @ 2017-01-05 19:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, x86, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, Andy Lutomirski,
	linux-arch, linux-mm, linux-kernel, linux-api

On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
> On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
>> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>
>> What happens to existing mappings above the limit when this upper limit
>> is dropped?
> 
> Nothing: we only prevent creating new mappings. All existing are not
> affected.
> 
> The semantics here the same as with other resource limits.
> 
>> Similarly, why do we do with an application running with something
>> incompatible with the larger address space that tries to raise the
>> limit?  Say, legacy MPX.
> 
> It has to know what it does. Yes, it can change limit to the point where
> application is unusable. But you can to the same with other limits.

I'm not sure I'm comfortable with this.  Do other rlimit changes cause
silent data corruption?  I'm pretty sure doing this to MPX would.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 19:39       ` Dave Hansen
@ 2017-01-05 20:11         ` Kirill A. Shutemov
  2017-01-05 20:14         ` Andy Lutomirski
  1 sibling, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-05 20:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Andy Lutomirski, linux-arch, linux-mm, linux-kernel,
	linux-api

On Thu, Jan 05, 2017 at 11:39:16AM -0800, Dave Hansen wrote:
> On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
> > On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
> >> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> >>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >>> address available to map by userspace.
> >>
> >> What happens to existing mappings above the limit when this upper limit
> >> is dropped?
> > 
> > Nothing: we only prevent creating new mappings. All existing are not
> > affected.
> > 
> > The semantics here the same as with other resource limits.
> > 
> >> Similarly, why do we do with an application running with something
> >> incompatible with the larger address space that tries to raise the
> >> limit?  Say, legacy MPX.
> > 
> > It has to know what it does. Yes, it can change limit to the point where
> > application is unusable. But you can to the same with other limits.
> 
> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> silent data corruption?  I'm pretty sure doing this to MPX would.

Maybe it's too ugly, but MPX can set rlim_max to rlim_cur on enabling.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 19:39       ` Dave Hansen
  2017-01-05 20:11         ` Kirill A. Shutemov
@ 2017-01-05 20:14         ` Andy Lutomirski
  2017-01-05 20:49           ` Dave Hansen
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-05 20:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-arch, linux-mm, linux-kernel, Linux API

On Thu, Jan 5, 2017 at 11:39 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
>> On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
>>> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
>>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>>> address available to map by userspace.
>>>
>>> What happens to existing mappings above the limit when this upper limit
>>> is dropped?
>>
>> Nothing: we only prevent creating new mappings. All existing are not
>> affected.
>>
>> The semantics here the same as with other resource limits.
>>
>>> Similarly, why do we do with an application running with something
>>> incompatible with the larger address space that tries to raise the
>>> limit?  Say, legacy MPX.
>>
>> It has to know what it does. Yes, it can change limit to the point where
>> application is unusable. But you can to the same with other limits.
>
> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> silent data corruption?  I'm pretty sure doing this to MPX would.
>

What actually goes wrong in this case?  That is, what combination of
MPX setup of subsequent allocations will cause a problem, and is the
problem worse than just a segfault?  IMO it would be really nice to
keep the messy case confined to MPX.

FWIW, this problem is kind of generic.  If you run code in a process,
MPX or otherwise, that assumes something about pointer values and then
create a pointer that violates its assumptions, you will cause
problems.  For example, some VMs use high bits to store metadata.  If
you feed a pointer that's too big to such code, boom.  This is exactly
why high addresses need to be opt-in.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 20:14         ` Andy Lutomirski
@ 2017-01-05 20:49           ` Dave Hansen
  2017-01-05 21:27             ` Andy Lutomirski
  2017-01-11 14:29             ` Kirill A. Shutemov
  0 siblings, 2 replies; 72+ messages in thread
From: Dave Hansen @ 2017-01-05 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-arch, linux-mm, linux-kernel, Linux API

On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>> silent data corruption?  I'm pretty sure doing this to MPX would.
>>
> What actually goes wrong in this case?  That is, what combination of
> MPX setup of subsequent allocations will cause a problem, and is the
> problem worse than just a segfault?  IMO it would be really nice to
> keep the messy case confined to MPX.

The MPX bounds tables are indexed by virtual address.  They need to grow
if the virtual address space grows.   There's an MSR that controls
whether we use the 48-bit or 57-bit layout.  It basically decides
whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.

The question is what we do with legacy MPX applications.  We obviously
can't let them just allocate a 2GB table and then go let the hardware
pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
table an address >48-bits.

Ideally, I'd like to make sure that legacy MPX can't be enabled if this
RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
legacy MPX is active, that the RLIMIT can't be raised because all hell
will break loose when the new addresses show up.

Remember, we already have (legacy MPX) binaries in the wild that have no
knowledge of this stuff.  So, we can implicitly have the kernel bump
this rlimit around, but we can't expect userspace to do it, ever.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 20:49           ` Dave Hansen
@ 2017-01-05 21:27             ` Andy Lutomirski
  2017-01-05 23:17               ` Dave Hansen
  2017-01-11 14:29             ` Kirill A. Shutemov
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-05 21:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-arch, linux-mm, linux-kernel, Linux API

On Thu, Jan 5, 2017 at 12:49 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>>> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>>> silent data corruption?  I'm pretty sure doing this to MPX would.
>>>
>> What actually goes wrong in this case?  That is, what combination of
>> MPX setup of subsequent allocations will cause a problem, and is the
>> problem worse than just a segfault?  IMO it would be really nice to
>> keep the messy case confined to MPX.
>
> The MPX bounds tables are indexed by virtual address.  They need to grow
> if the virtual address space grows.   There's an MSR that controls
> whether we use the 48-bit or 57-bit layout.  It basically decides
> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
>
> The question is what we do with legacy MPX applications.  We obviously
> can't let them just allocate a 2GB table and then go let the hardware
> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> table an address >48-bits.
>
> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> legacy MPX is active, that the RLIMIT can't be raised because all hell
> will break loose when the new addresses show up.
>
> Remember, we already have (legacy MPX) binaries in the wild that have no
> knowledge of this stuff.  So, we can implicitly have the kernel bump
> this rlimit around, but we can't expect userspace to do it, ever.

If you s/rlimit/prctl, then I think this all makes sense with one
exception.  It would be a bit sad if the personality-setting tool
didn't work if compiled with MPX.

So what if we had a second prctl field that is the value that kicks in
after execve()?

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 21:27             ` Andy Lutomirski
@ 2017-01-05 23:17               ` Dave Hansen
  0 siblings, 0 replies; 72+ messages in thread
From: Dave Hansen @ 2017-01-05 23:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-arch, linux-mm, linux-kernel, Linux API

On 01/05/2017 01:27 PM, Andy Lutomirski wrote:
> On Thu, Jan 5, 2017 at 12:49 PM, Dave Hansen <dave.hansen@intel.com> wrote:
...
>> Remember, we already have (legacy MPX) binaries in the wild that have no
>> knowledge of this stuff.  So, we can implicitly have the kernel bump
>> this rlimit around, but we can't expect userspace to do it, ever.
> 
> If you s/rlimit/prctl, then I think this all makes sense with one
> exception.  It would be a bit sad if the personality-setting tool
> didn't work if compiled with MPX.

Ahh, because if you have MPX enabled you *can't* sanely switch between
the two modes because you suddenly go from having small bounds tables to
having big ones?

It's not the simplest thing in the world to do, but there's nothing
keeping the personality-setting tool from doing all the work.  It can do:

	new_bd = malloc(1TB);
	prctl(MPX_DISABLE_MANAGEMENT);
	memcpy(new_bd, old_bd, LEGACY_MPX_BD_SIZE);
	set_bounds_config(new_bd | ENABLE_BIT);
	prctl(WIDER_VADDR_WIDTH);
	prctl(MPX_ENABLE_MANAGEMENT);
	

> So what if we had a second prctl field that is the value that kicks in
> after execve()?

Yeah, that's a pretty sane way to do it too.  execve() is a nice chokepoint.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-05 20:49           ` Dave Hansen
  2017-01-05 21:27             ` Andy Lutomirski
@ 2017-01-11 14:29             ` Kirill A. Shutemov
  2017-01-11 18:09               ` Andy Lutomirski
  2017-01-11 18:26               ` Dave Hansen
  1 sibling, 2 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-11 14:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, linux-arch, linux-mm,
	linux-kernel, Linux API

On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> >> silent data corruption?  I'm pretty sure doing this to MPX would.
> >>
> > What actually goes wrong in this case?  That is, what combination of
> > MPX setup of subsequent allocations will cause a problem, and is the
> > problem worse than just a segfault?  IMO it would be really nice to
> > keep the messy case confined to MPX.
> 
> The MPX bounds tables are indexed by virtual address.  They need to grow
> if the virtual address space grows.   There's an MSR that controls
> whether we use the 48-bit or 57-bit layout.  It basically decides
> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
> 
> The question is what we do with legacy MPX applications.  We obviously
> can't let them just allocate a 2GB table and then go let the hardware
> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> table an address >48-bits.
> 
> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> legacy MPX is active, that the RLIMIT can't be raised because all hell
> will break loose when the new addresses show up.

I think we can do this. See the patch below.

Basically, we refuse to enable MPX and issue warning in dmesg if there's
anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
be higher than 47-bits too.

Function call from mmap_max_addr() is unfortunate, but I don't see a
way around.

As we add support of MAWA it will get somewhat more complex, but general
idea should be the same.

Build-tested only.

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 07cc4f27ca41..f97b149145f8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1742,7 +1742,6 @@ config X86_SMAP
 config X86_INTEL_MPX
 	prompt "Intel MPX (Memory Protection Extensions)"
 	def_bool n
-	depends on !X86_5LEVEL
 	depends on CPU_SUP_INTEL
 	---help---
 	  MPX provides hardware features that can be used in
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 0b416d4cf73b..ba9005f9bf87 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -56,11 +56,8 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
+int kernel_managing_mpx_tables(struct mm_struct *mm);
 int mpx_handle_bd_fault(void);
-static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
-{
-	return (mm->context.bd_addr != MPX_INVALID_BOUNDS_DIR);
-}
 static inline void mpx_mm_init(struct mm_struct *mm)
 {
 	/*
@@ -80,10 +77,6 @@ static inline int mpx_handle_bd_fault(void)
 {
 	return -EINVAL;
 }
-static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
-{
-	return 0;
-}
 static inline void mpx_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e02917126859..589610a4f099 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -869,6 +869,7 @@ extern int set_tsc_mode(unsigned int val);
 #ifdef CONFIG_X86_INTEL_MPX
 extern int mpx_enable_management(void);
 extern int mpx_disable_management(void);
+extern int kernel_managing_mpx_tables(struct mm_struct *mm);
 #else
 static inline int mpx_enable_management(void)
 {
@@ -878,8 +879,22 @@ static inline int mpx_disable_management(void)
 {
 	return -EINVAL;
 }
+static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
+{
+	return 0;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
+#define mmap_max_addr() \
+({									\
+	unsigned long max_addr = min(TASK_SIZE, rlimit(RLIMIT_VADDR));	\
+	/* At the moment, MPX cannot handle addresses above 47-bits */	\
+	if (max_addr > USER_VADDR_LIM &&				\
+			kernel_managing_mpx_tables(current->mm))	\
+ 		max_addr = USER_VADDR_LIM;				\
+ 	max_addr;							\
+})
+
 extern u16 amd_get_nb_id(int cpu);
 extern u32 amd_get_nodes_per_socket(void);
 
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 324e5713d386..04fa386a165a 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -354,10 +354,22 @@ int mpx_enable_management(void)
 	 */
 	bd_base = mpx_get_bounds_dir();
 	down_write(&mm->mmap_sem);
+
+	/*
+	 * MPX doesn't support addresses above 47-bits yes.
+	 * Make sure nothing is mapped there before enabling.
+	 */
+	if (find_vma(mm, 1UL << 47)) {
+		pr_warn("%s (%d): MPX cannot handle addresses above 47-bits. "
+				"Disabling.", current->comm, current->pid);
+		ret = -ENXIO;
+		goto out;
+	}
+
 	mm->context.bd_addr = bd_base;
 	if (mm->context.bd_addr == MPX_INVALID_BOUNDS_DIR)
 		ret = -ENXIO;
-
+out:
 	up_write(&mm->mmap_sem);
 	return ret;
 }
@@ -516,6 +528,11 @@ static int do_mpx_bt_fault(void)
 	return allocate_bt(mm, (long __user *)bd_entry);
 }
 
+int kernel_managing_mpx_tables(struct mm_struct *mm)
+{
+	return (mm->context.bd_addr != MPX_INVALID_BOUNDS_DIR);
+}
+
 int mpx_handle_bd_fault(void)
 {
 	/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f0f23afe0838..d463b800d8ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3661,9 +3661,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
+#ifndef mmap_max_addr
+#define mmap_max_addr mmap_max_addr
 static inline unsigned long mmap_max_addr(void)
 {
 	return min(TASK_SIZE, rlimit(RLIMIT_VADDR));
 }
+#endif
 
 #endif
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 14:29             ` Kirill A. Shutemov
@ 2017-01-11 18:09               ` Andy Lutomirski
  2017-01-11 18:37                 ` Kirill A. Shutemov
  2017-01-11 18:26               ` Dave Hansen
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-11 18:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	H. Peter Anvin, Andi Kleen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Wed, Jan 11, 2017 at 6:29 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
>> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>> >> silent data corruption?  I'm pretty sure doing this to MPX would.
>> >>
>> > What actually goes wrong in this case?  That is, what combination of
>> > MPX setup of subsequent allocations will cause a problem, and is the
>> > problem worse than just a segfault?  IMO it would be really nice to
>> > keep the messy case confined to MPX.
>>
>> The MPX bounds tables are indexed by virtual address.  They need to grow
>> if the virtual address space grows.   There's an MSR that controls
>> whether we use the 48-bit or 57-bit layout.  It basically decides
>> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
>>
>> The question is what we do with legacy MPX applications.  We obviously
>> can't let them just allocate a 2GB table and then go let the hardware
>> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
>> table an address >48-bits.
>>
>> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
>> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
>> legacy MPX is active, that the RLIMIT can't be raised because all hell
>> will break loose when the new addresses show up.
>
> I think we can do this. See the patch below.
>
> Basically, we refuse to enable MPX and issue warning in dmesg if there's
> anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
> be higher than 47-bits too.
>
> Function call from mmap_max_addr() is unfortunate, but I don't see a
> way around.

How about preventing the max addr from being changed to too high a
value while MPX is on instead of overriding the set value?  This would
have the added benefit that it would prevent silent failures where you
think you've enabled large addresses but MPX is also on and mmap
refuses to return large addresses.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 14:29             ` Kirill A. Shutemov
  2017-01-11 18:09               ` Andy Lutomirski
@ 2017-01-11 18:26               ` Dave Hansen
  1 sibling, 0 replies; 72+ messages in thread
From: Dave Hansen @ 2017-01-11 18:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, linux-arch, linux-mm,
	linux-kernel, Linux API

On 01/11/2017 06:29 AM, Kirill A. Shutemov wrote:
> +#define mmap_max_addr() \
> +({									\
> +	unsigned long max_addr = min(TASK_SIZE, rlimit(RLIMIT_VADDR));	\
> +	/* At the moment, MPX cannot handle addresses above 47-bits */	\
> +	if (max_addr > USER_VADDR_LIM &&				\
> +			kernel_managing_mpx_tables(current->mm))	\
> + 		max_addr = USER_VADDR_LIM;				\
> + 	max_addr;							\
> +})

The bad part about this is that it adds code to a relatively fast path,
and the check that it's doing will not change its result for basically
the entire life of the process.

I'd much rather see this checking done at the point that MPX is enabled
and at the point the limit is changed.  Those are both super-rare paths.

>  extern u16 amd_get_nb_id(int cpu);
>  extern u32 amd_get_nodes_per_socket(void);
>  
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index 324e5713d386..04fa386a165a 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -354,10 +354,22 @@ int mpx_enable_management(void)
>  	 */
>  	bd_base = mpx_get_bounds_dir();
>  	down_write(&mm->mmap_sem);
> +
> +	/*
> +	 * MPX doesn't support addresses above 47-bits yes.
> +	 * Make sure nothing is mapped there before enabling.
> +	 */
> +	if (find_vma(mm, 1UL << 47)) {
> +		pr_warn("%s (%d): MPX cannot handle addresses above 47-bits. "
> +				"Disabling.", current->comm, current->pid);
> +		ret = -ENXIO;
> +		goto out;
> +	}

I don't think allowing userspace to spam unlimited amounts of message
into the kernel log is a good idea. :)  But a WARN_ONCE() might not kill
any puppies.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 18:09               ` Andy Lutomirski
@ 2017-01-11 18:37                 ` Kirill A. Shutemov
  2017-01-11 18:49                   ` Dave Hansen
  0 siblings, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-11 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	H. Peter Anvin, Andi Kleen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Wed, Jan 11, 2017 at 10:09:17AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 6:29 AM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> > On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
> >> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
> >> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> >> >> silent data corruption?  I'm pretty sure doing this to MPX would.
> >> >>
> >> > What actually goes wrong in this case?  That is, what combination of
> >> > MPX setup of subsequent allocations will cause a problem, and is the
> >> > problem worse than just a segfault?  IMO it would be really nice to
> >> > keep the messy case confined to MPX.
> >>
> >> The MPX bounds tables are indexed by virtual address.  They need to grow
> >> if the virtual address space grows.   There's an MSR that controls
> >> whether we use the 48-bit or 57-bit layout.  It basically decides
> >> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
> >>
> >> The question is what we do with legacy MPX applications.  We obviously
> >> can't let them just allocate a 2GB table and then go let the hardware
> >> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> >> table an address >48-bits.
> >>
> >> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> >> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> >> legacy MPX is active, that the RLIMIT can't be raised because all hell
> >> will break loose when the new addresses show up.
> >
> > I think we can do this. See the patch below.
> >
> > Basically, we refuse to enable MPX and issue warning in dmesg if there's
> > anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
> > be higher than 47-bits too.
> >
> > Function call from mmap_max_addr() is unfortunate, but I don't see a
> > way around.
> 
> How about preventing the max addr from being changed to too high a
> value while MPX is on instead of overriding the set value?  This would
> have the added benefit that it would prevent silent failures where you
> think you've enabled large addresses but MPX is also on and mmap
> refuses to return large addresses.

Setting rlimit high doesn't mean that you necessary will get access to
full address space, even without MPX in picture. TASK_SIZE limits the
available address space too.

I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
to unlimited doesn't really means you are not subject to other resource
management.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 18:37                 ` Kirill A. Shutemov
@ 2017-01-11 18:49                   ` Dave Hansen
  2017-01-11 19:20                     ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Dave Hansen @ 2017-01-11 18:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, X86 ML,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, linux-arch, linux-mm, linux-kernel, Linux API

On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
>> How about preventing the max addr from being changed to too high a
>> value while MPX is on instead of overriding the set value?  This would
>> have the added benefit that it would prevent silent failures where you
>> think you've enabled large addresses but MPX is also on and mmap
>> refuses to return large addresses.
> Setting rlimit high doesn't mean that you necessary will get access to
> full address space, even without MPX in picture. TASK_SIZE limits the
> available address space too.

OK, sure...  If you want to take another mechanism into account with
respect to MPX, we can do that.  We'd just need to change every
mechanism we want to support to ensure that it can't transition in ways
that break MPX.

What are you arguing here, though?  Since we *might* be limited by
something else that we should not care about controlling the rlimit?

> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
> to unlimited doesn't really means you are not subject to other resource
> management.

The farther we get into this, the more and more I think using an rlimit
is a horrible idea.  Its semantics aren't a great match, and you seem to
be resistant to making *this* rlimit differ from the others when there's
an entirely need to do so.  We're already being bitten by "legacy"
rlimit.  IOW, being consistent with *other* rlimit behavior buys us
nothing, only complexity.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 18:49                   ` Dave Hansen
@ 2017-01-11 19:20                     ` Andy Lutomirski
  2017-01-11 19:31                       ` Linus Torvalds
  2017-01-11 19:32                       ` Kirill A. Shutemov
  0 siblings, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2017-01-11 19:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, linux-arch, linux-mm,
	linux-kernel, Linux API

On Wed, Jan 11, 2017 at 10:49 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
>>> How about preventing the max addr from being changed to too high a
>>> value while MPX is on instead of overriding the set value?  This would
>>> have the added benefit that it would prevent silent failures where you
>>> think you've enabled large addresses but MPX is also on and mmap
>>> refuses to return large addresses.
>> Setting rlimit high doesn't mean that you necessary will get access to
>> full address space, even without MPX in picture. TASK_SIZE limits the
>> available address space too.
>
> OK, sure...  If you want to take another mechanism into account with
> respect to MPX, we can do that.  We'd just need to change every
> mechanism we want to support to ensure that it can't transition in ways
> that break MPX.
>
> What are you arguing here, though?  Since we *might* be limited by
> something else that we should not care about controlling the rlimit?
>
>> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
>> to unlimited doesn't really means you are not subject to other resource
>> management.
>
> The farther we get into this, the more and more I think using an rlimit
> is a horrible idea.  Its semantics aren't a great match, and you seem to
> be resistant to making *this* rlimit differ from the others when there's
> an entirely need to do so.  We're already being bitten by "legacy"
> rlimit.  IOW, being consistent with *other* rlimit behavior buys us
> nothing, only complexity.

Taking a step back, I think it would be fantastic if we could find a
way to make this work without any inheritable settings at all.
Perhaps we could have a per-mm value that is initialized to 2^47-1 on
execve() and can be raised by ELF note or by prctl()?  Getting it
right for 32-bit would require a bit of thought.  The ELF note would
make a high stack possible and, without the ELF note, we'd get a low
stack but high mmap().  Then the messy bits can be glibc's problem and
a toolchain problem as it should be, given that the only reason we
need a limit at all is because of messy userspace code.

Sure, the low stack prevents the *whole* address space from being used
in one big block for databases, but 2^57 - 2^47 ought to be good
enough.

I'm not 100% sure this is workable but, if it is, it makes everyone's
life easier.  There's no need to muck around with setarch(1) or
similar hacks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 19:20                     ` Andy Lutomirski
@ 2017-01-11 19:31                       ` Linus Torvalds
  2017-01-11 21:46                         ` Andi Kleen
  2017-01-11 19:32                       ` Kirill A. Shutemov
  1 sibling, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2017-01-11 19:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Kirill A. Shutemov, Kirill A. Shutemov,
	Andrew Morton, X86 ML, Thomas Gleixner, Ingo Molnar,
	Arnd Bergmann, H. Peter Anvin, Andi Kleen, linux-arch, linux-mm,
	linux-kernel, Linux API

On Wed, Jan 11, 2017 at 11:20 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Taking a step back, I think it would be fantastic if we could find a
> way to make this work without any inheritable settings at all.
> Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> execve() and can be raised by ELF note or by prctl()?

I definitely think this is the right model. No inheritable settings,
no suid issues, no worries. Make people who want the large address
space (and there aren't going to be a lot of them) just mark their
binaries at compile time.

And as to the stack location: I think it should just be the same
regardless - up in "high" virtual memory in the 47-bit model. Because
as you say, if you actually end up having 57 bits of address space,
that still gives you basically the whole VM for data mappings -
they'll just be up above the stack.

                Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 19:20                     ` Andy Lutomirski
  2017-01-11 19:31                       ` Linus Torvalds
@ 2017-01-11 19:32                       ` Kirill A. Shutemov
  2017-01-11 19:39                         ` Linus Torvalds
  1 sibling, 1 reply; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-11 19:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	H. Peter Anvin, Andi Kleen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Wed, Jan 11, 2017 at 11:20:38AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 10:49 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
> >>> How about preventing the max addr from being changed to too high a
> >>> value while MPX is on instead of overriding the set value?  This would
> >>> have the added benefit that it would prevent silent failures where you
> >>> think you've enabled large addresses but MPX is also on and mmap
> >>> refuses to return large addresses.
> >> Setting rlimit high doesn't mean that you necessary will get access to
> >> full address space, even without MPX in picture. TASK_SIZE limits the
> >> available address space too.
> >
> > OK, sure...  If you want to take another mechanism into account with
> > respect to MPX, we can do that.  We'd just need to change every
> > mechanism we want to support to ensure that it can't transition in ways
> > that break MPX.
> >
> > What are you arguing here, though?  Since we *might* be limited by
> > something else that we should not care about controlling the rlimit?
> >
> >> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
> >> to unlimited doesn't really means you are not subject to other resource
> >> management.
> >
> > The farther we get into this, the more and more I think using an rlimit
> > is a horrible idea.  Its semantics aren't a great match, and you seem to
> > be resistant to making *this* rlimit differ from the others when there's
> > an entirely need to do so.  We're already being bitten by "legacy"
> > rlimit.  IOW, being consistent with *other* rlimit behavior buys us
> > nothing, only complexity.
> 
> Taking a step back, I think it would be fantastic if we could find a
> way to make this work without any inheritable settings at all.
> Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> execve() and can be raised by ELF note or by prctl()?

One thing that inheritance give us is ability to change available address
space from outside of binary. Both ELF note and prctl() doesn't really
work here.

Running legacy binary with full address space is valuable option.
As well as limiting address space for binary with ELF note or prctl() in
case of breakage in a field.

Sure, we can use personality(2) or invent other interface for this. But to
me rlimit covers both normal and emergency use-cases relatively well.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 19:32                       ` Kirill A. Shutemov
@ 2017-01-11 19:39                         ` Linus Torvalds
  0 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2017-01-11 19:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Dave Hansen, Kirill A. Shutemov, Andrew Morton,
	X86 ML, Thomas Gleixner, Ingo Molnar, Arnd Bergmann,
	H. Peter Anvin, Andi Kleen, linux-arch, linux-mm, linux-kernel,
	Linux API

On Wed, Jan 11, 2017 at 11:32 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> Running legacy binary with full address space is valuable option.

I disagree.

It's simply not valuable enough to worry about. Especially when there
is a fairly trivial wrapper approach: just make a full-address-space
wrapper than acts as a binary loader (think "specialized ld.so").

Sure, the wrapper may be "fairly trivial" but not necessarily
pleasant: you have to parse ELF sections etc and basically load the
binary by hand. But there are libraries for that, and loading an ELF
executable isn't rocket surgery, it's just possibly tedious.

            Linus

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-11 19:31                       ` Linus Torvalds
@ 2017-01-11 21:46                         ` Andi Kleen
  0 siblings, 0 replies; 72+ messages in thread
From: Andi Kleen @ 2017-01-11 21:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dave Hansen, Kirill A. Shutemov,
	Kirill A. Shutemov, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, linux-arch, linux-mm,
	linux-kernel, Linux API

On Wed, Jan 11, 2017 at 11:31:25AM -0800, Linus Torvalds wrote:
> On Wed, Jan 11, 2017 at 11:20 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Taking a step back, I think it would be fantastic if we could find a
> > way to make this work without any inheritable settings at all.
> > Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> > execve() and can be raised by ELF note or by prctl()?
> 
> I definitely think this is the right model. No inheritable settings,
> no suid issues, no worries. Make people who want the large address
> space (and there aren't going to be a lot of them) just mark their
> binaries at compile time.

Compile time is inconvenient if you want to test some existing
random binary if it works.

I tried to write a tool which patched ELF notes into binaries
some time ago for another project, but it ran into difficulties
and didn't work everywhere.

An inheritance scheme is much nicer for such use cases.

-Andi

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR
  2017-01-02  8:35           ` Kirill A. Shutemov
@ 2017-01-13 20:11             ` H.J. Lu
  0 siblings, 0 replies; 72+ messages in thread
From: H.J. Lu @ 2017-01-13 20:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Carlos O'Donell, Kirill A. Shutemov,
	Linus Torvalds, Andrew Morton, X86 ML, Thomas Gleixner,
	Ingo Molnar, Arnd Bergmann, H. Peter Anvin, Andi Kleen,
	Dave Hansen, linux-arch, linux-mm, linux-kernel, Linux API

On Mon, Jan 2, 2017 at 12:35 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Fri, Dec 30, 2016 at 06:08:27PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell <carlos@redhat.com> wrote:
>> > On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
>> >> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> >>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>> >>> <kirill.shutemov@linux.intel.com> wrote:
>> >>>> This patch introduces new rlimit resource to manage maximum virtual
>> >>>> address available to userspace to map.
>> >>>>
>> >>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>> >>>> Not all user space is ready to handle wide addresses. It's known that
>> >>>> at least some JIT compilers use high bit in pointers to encode their
>> >>>> information. It collides with valid pointers with 5-level paging and
>> >>>> leads to crashes.
>> >>>>
>> >>>> The patch aims to address this compatibility issue.
>> >>>>
>> >>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>> >>>> address available to map by userspace.
>> >>>>
>> >>>> The default hard limit will be RLIM_INFINITY, which basically means that
>> >>>> TASK_SIZE limits available address space.
>> >>>>
>> >>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>> >>>> with 5-level paging enabled. In this case, soft limit would be
>> >>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>> >>>> paging which known to be safe
>> >>>>
>> >>>> New rlimit resource would follow usual semantics with regards to
>> >>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>> >>>> break application if limits set too wide or too narrow, but this is not
>> >>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>> >>>>
>> >>>> As with other resources you can set the limit lower than current usage.
>> >>>> It would affect only future virtual address space allocations.
>> >>>>
>> >>>> Use-cases for new rlimit:
>> >>>>
>> >>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> >>>>     its children to use addresses above 47-bits.
>> >>>>
>> >>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>> >>>>     exec(2) allows the child to use addresses above 47-bits.
>> >>>>
>> >>>>   - Lowering the hard limit to 47-bits would prevent current process all
>> >>>>     its children to use addresses above 47-bits, unless a process has
>> >>>>     CAP_SYS_RESOURCES.
>> >>>>
>> >>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>> >>>>     address. User-mode emulation in QEMU may lower the limit to 32-bit
>> >>>>     to emulate 32-bit machine on 64-bit host.
>> >>>
>> >>> I tend to think that this should be a personality or an ELF flag, not
>> >>> an rlimit.
>> >>
>> >> My plan was to implement ELF flag on top. Basically, ELF flag would mean
>> >> that we bump soft limit to hard limit on exec.
>> >
>> > Could you clarify what you mean by an "ELF flag?"
>>
>> Some way to mark a binary as supporting a larger address space.  I
>> don't have a precise solution in mind, but an ELF note might be a good
>> way to go here.
>
> + H.J.
>
> There's discussion of proposal of "Program Properties"[1]. It seems fits
> the purpose.
>
> [1] https://sourceware.org/ml/gnu-gabi/2016-q4/msg00000.html
>
> --
>  Kirill A. Shutemov

There is another proposal:

https://fedoraproject.org/wiki/Toolchain/Watermark#Markup_for_ELF_objects

which covers much more than mine.

-- 
H.J.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h
  2016-12-27  1:53 ` [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h Kirill A. Shutemov
@ 2017-01-27 11:06   ` Vlastimil Babka
  2017-01-27 11:30     ` Kirill A. Shutemov
  0 siblings, 1 reply; 72+ messages in thread
From: Vlastimil Babka @ 2017-01-27 11:06 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 12/27/2016 02:53 AM, Kirill A. Shutemov wrote:
> We are going to switch core MM to 5-level paging abstraction.
>
> This is preparation step which adds <asm-generic/5level-fixup.h>
> As with 4level-fixup.h, the new header allows quickly make all
> architectures compatible with 5-level paging in core MM.
>
> In long run we would like to switch architectures to properly folded p4d
> level by using <asm-generic/pgtable-nop4d.h>, but it requires more
> changes to arch-specific code.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/asm-generic/4level-fixup.h |  3 ++-
>  include/asm-generic/5level-fixup.h | 41 ++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h                 |  3 +++
>  3 files changed, 46 insertions(+), 1 deletion(-)
>  create mode 100644 include/asm-generic/5level-fixup.h
>
> diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
> index 5bdab6bffd23..928fd66b1271 100644
> --- a/include/asm-generic/4level-fixup.h
> +++ b/include/asm-generic/4level-fixup.h
> @@ -15,7 +15,6 @@
>  	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
>   		NULL: pmd_offset(pud, address))
>
> -#define pud_alloc(mm, pgd, address)	(pgd)

This...

>  #define pud_offset(pgd, start)		(pgd)
>  #define pud_none(pud)			0
>  #define pud_bad(pud)			0
> @@ -35,4 +34,6 @@
>  #undef  pud_addr_end
>  #define pud_addr_end(addr, end)		(end)
>
> +#include <asm-generic/5level-fixup.h>

... plus this...

> +
>  #endif
> diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
> new file mode 100644
> index 000000000000..b5ca82dc4175
> --- /dev/null
> +++ b/include/asm-generic/5level-fixup.h
> @@ -0,0 +1,41 @@
> +#ifndef _5LEVEL_FIXUP_H
> +#define _5LEVEL_FIXUP_H
> +
> +#define __ARCH_HAS_5LEVEL_HACK
> +#define __PAGETABLE_P4D_FOLDED
> +
> +#define P4D_SHIFT			PGDIR_SHIFT
> +#define P4D_SIZE			PGDIR_SIZE
> +#define P4D_MASK			PGDIR_MASK
> +#define PTRS_PER_P4D			1
> +
> +#define p4d_t				pgd_t
> +
> +#define pud_alloc(mm, p4d, address) \
> +	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
> +		NULL : pud_offset(p4d, address))

... and this, makes me wonder if that broke pud_alloc() for architectures that 
use the 4level-fixup.h. Don't those need to continue having pud_alloc() as (pgd)?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h
  2017-01-27 11:06   ` Vlastimil Babka
@ 2017-01-27 11:30     ` Kirill A. Shutemov
  0 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-27 11:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On Fri, Jan 27, 2017 at 12:06:02PM +0100, Vlastimil Babka wrote:
> On 12/27/2016 02:53 AM, Kirill A. Shutemov wrote:
> >We are going to switch core MM to 5-level paging abstraction.
> >
> >This is preparation step which adds <asm-generic/5level-fixup.h>
> >As with 4level-fixup.h, the new header allows quickly make all
> >architectures compatible with 5-level paging in core MM.
> >
> >In long run we would like to switch architectures to properly folded p4d
> >level by using <asm-generic/pgtable-nop4d.h>, but it requires more
> >changes to arch-specific code.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >---
> > include/asm-generic/4level-fixup.h |  3 ++-
> > include/asm-generic/5level-fixup.h | 41 ++++++++++++++++++++++++++++++++++++++
> > include/linux/mm.h                 |  3 +++
> > 3 files changed, 46 insertions(+), 1 deletion(-)
> > create mode 100644 include/asm-generic/5level-fixup.h
> >
> >diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
> >index 5bdab6bffd23..928fd66b1271 100644
> >--- a/include/asm-generic/4level-fixup.h
> >+++ b/include/asm-generic/4level-fixup.h
> >@@ -15,7 +15,6 @@
> > 	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
> >  		NULL: pmd_offset(pud, address))
> >
> >-#define pud_alloc(mm, pgd, address)	(pgd)
> 
> This...
> 
> > #define pud_offset(pgd, start)		(pgd)
> > #define pud_none(pud)			0
> > #define pud_bad(pud)			0
> >@@ -35,4 +34,6 @@
> > #undef  pud_addr_end
> > #define pud_addr_end(addr, end)		(end)
> >
> >+#include <asm-generic/5level-fixup.h>
> 
> ... plus this...
> 
> >+
> > #endif
> >diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
> >new file mode 100644
> >index 000000000000..b5ca82dc4175
> >--- /dev/null
> >+++ b/include/asm-generic/5level-fixup.h
> >@@ -0,0 +1,41 @@
> >+#ifndef _5LEVEL_FIXUP_H
> >+#define _5LEVEL_FIXUP_H
> >+
> >+#define __ARCH_HAS_5LEVEL_HACK
> >+#define __PAGETABLE_P4D_FOLDED
> >+
> >+#define P4D_SHIFT			PGDIR_SHIFT
> >+#define P4D_SIZE			PGDIR_SIZE
> >+#define P4D_MASK			PGDIR_MASK
> >+#define PTRS_PER_P4D			1
> >+
> >+#define p4d_t				pgd_t
> >+
> >+#define pud_alloc(mm, p4d, address) \
> >+	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
> >+		NULL : pud_offset(p4d, address))
> 
> ... and this, makes me wonder if that broke pud_alloc() for architectures
> that use the 4level-fixup.h. Don't those need to continue having pud_alloc()
> as (pgd)?

Okay, that's very hacky, but works:

For 4level-fixup.h case we have __PAGETABLE_PUD_FOLDED set, so
__pud_alloc() will always succeed (see <linux/mm.h>). And pud_offset()
from 4level-fixup.h always returns pgd.

I've tested this on alpha with qemu.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK
  2016-12-27  1:53 ` [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK Kirill A. Shutemov
@ 2017-01-27 13:24   ` Vlastimil Babka
  2017-01-27 13:55     ` Kirill A. Shutemov
  0 siblings, 1 reply; 72+ messages in thread
From: Vlastimil Babka @ 2017-01-27 13:24 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin
  Cc: Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On 12/27/2016 02:53 AM, Kirill A. Shutemov wrote:
> We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
> abstraction for properly (in opposite to 5level-fixup.h hack) folded
> p4d level. The new header will be included from pgtable-nopud.h.
>
> If an architecture uses <asm-generic/nop*d.h>, we cannot use
> 5level-fixup.h directly to quickly convert the architecture to 5-level
> paging as it would conflict with pgtable-nop4d.h.
>
> With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
> inclusion <asm-genenric/nop*d.h> to 5level-fixup.h.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/asm-generic/pgtable-nop4d-hack.h | 62 ++++++++++++++++++++++++++++++++

At risk of bikeshedding and coming from somebody not familiar with this code... 
IMHO it would be somewhat more intuitive and consistent to name the file 
"pgtable-nopud-hack.h" as it's about the pud stuff, not p4d stuff, and acts as 
an alternative implementation to pgtable-nopud.h, not pgtable-nop4d.h

Thanks,
Vlastimil

>  include/asm-generic/pgtable-nopud.h      |  5 +++
>  2 files changed, 67 insertions(+)
>  create mode 100644 include/asm-generic/pgtable-nop4d-hack.h
>
> diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
> new file mode 100644
> index 000000000000..752fb7511750
> --- /dev/null
> +++ b/include/asm-generic/pgtable-nop4d-hack.h
> @@ -0,0 +1,62 @@
> +#ifndef _PGTABLE_NOP4D_HACK_H
> +#define _PGTABLE_NOP4D_HACK_H
> +
> +#ifndef __ASSEMBLY__
> +#include <asm-generic/5level-fixup.h>
> +
> +#define __PAGETABLE_PUD_FOLDED
> +
> +/*
> + * Having the pud type consist of a pgd gets the size right, and allows
> + * us to conceptually access the pgd entry that this pud is folded into
> + * without casting.
> + */
> +typedef struct { pgd_t pgd; } pud_t;
> +
> +#define PUD_SHIFT	PGDIR_SHIFT
> +#define PTRS_PER_PUD	1
> +#define PUD_SIZE	(1UL << PUD_SHIFT)
> +#define PUD_MASK	(~(PUD_SIZE-1))
> +
> +/*
> + * The "pgd_xxx()" functions here are trivial for a folded two-level
> + * setup: the pud is never bad, and a pud always exists (as it's folded
> + * into the pgd entry)
> + */
> +static inline int pgd_none(pgd_t pgd)		{ return 0; }
> +static inline int pgd_bad(pgd_t pgd)		{ return 0; }
> +static inline int pgd_present(pgd_t pgd)	{ return 1; }
> +static inline void pgd_clear(pgd_t *pgd)	{ }
> +#define pud_ERROR(pud)				(pgd_ERROR((pud).pgd))
> +
> +#define pgd_populate(mm, pgd, pud)		do { } while (0)
> +/*
> + * (puds are folded into pgds so this doesn't get actually called,
> + * but the define is needed for a generic inline function.)
> + */
> +#define set_pgd(pgdptr, pgdval)	set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })
> +
> +static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
> +{
> +	return (pud_t *)pgd;
> +}
> +
> +#define pud_val(x)				(pgd_val((x).pgd))
> +#define __pud(x)				((pud_t) { __pgd(x) })
> +
> +#define pgd_page(pgd)				(pud_page((pud_t){ pgd }))
> +#define pgd_page_vaddr(pgd)			(pud_page_vaddr((pud_t){ pgd }))
> +
> +/*
> + * allocating and freeing a pud is trivial: the 1-entry pud is
> + * inside the pgd, so has no extra memory associated with it.
> + */
> +#define pud_alloc_one(mm, address)		NULL
> +#define pud_free(mm, x)				do { } while (0)
> +#define __pud_free_tlb(tlb, x, a)		do { } while (0)
> +
> +#undef  pud_addr_end
> +#define pud_addr_end(addr, end)			(end)
> +
> +#endif /* __ASSEMBLY__ */
> +#endif /* _PGTABLE_NOP4D_HACK_H */
> diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
> index 810431d8351b..5e49430a30a4 100644
> --- a/include/asm-generic/pgtable-nopud.h
> +++ b/include/asm-generic/pgtable-nopud.h
> @@ -3,6 +3,10 @@
>
>  #ifndef __ASSEMBLY__
>
> +#ifdef __ARCH_USE_5LEVEL_HACK
> +#include <asm-generic/pgtable-nop4d-hack.h>
> +#else
> +
>  #define __PAGETABLE_PUD_FOLDED
>
>  /*
> @@ -58,4 +62,5 @@ static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
>  #define pud_addr_end(addr, end)			(end)
>
>  #endif /* __ASSEMBLY__ */
> +#endif /* !__ARCH_USE_5LEVEL_HACK */
>  #endif /* _PGTABLE_NOPUD_H */
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK
  2017-01-27 13:24   ` Vlastimil Babka
@ 2017-01-27 13:55     ` Kirill A. Shutemov
  0 siblings, 0 replies; 72+ messages in thread
From: Kirill A. Shutemov @ 2017-01-27 13:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, x86,
	Thomas Gleixner, Ingo Molnar, Arnd Bergmann, H. Peter Anvin,
	Andi Kleen, Dave Hansen, Andy Lutomirski, linux-arch, linux-mm,
	linux-kernel

On Fri, Jan 27, 2017 at 02:24:58PM +0100, Vlastimil Babka wrote:
> On 12/27/2016 02:53 AM, Kirill A. Shutemov wrote:
> >We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
> >abstraction for properly (in opposite to 5level-fixup.h hack) folded
> >p4d level. The new header will be included from pgtable-nopud.h.
> >
> >If an architecture uses <asm-generic/nop*d.h>, we cannot use
> >5level-fixup.h directly to quickly convert the architecture to 5-level
> >paging as it would conflict with pgtable-nop4d.h.
> >
> >With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
> >inclusion <asm-genenric/nop*d.h> to 5level-fixup.h.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >---
> > include/asm-generic/pgtable-nop4d-hack.h | 62 ++++++++++++++++++++++++++++++++
> 
> At risk of bikeshedding and coming from somebody not familiar with this
> code... IMHO it would be somewhat more intuitive and consistent to name the
> file "pgtable-nopud-hack.h" as it's about the pud stuff, not p4d stuff, and
> acts as an alternative implementation to pgtable-nopud.h, not
> pgtable-nop4d.h

Well, on other hand we hack-in p4d level here...

I don't really care. Either way works for me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2017-01-27 14:02 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-27  1:53 [PATCHv2 00/29] 5-level paging Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 01/29] x86/cpufeature: Add 5-level paging detecton Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 02/29] asm-generic: introduce 5level-fixup.h Kirill A. Shutemov
2017-01-27 11:06   ` Vlastimil Babka
2017-01-27 11:30     ` Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 03/29] asm-generic: introduce __ARCH_USE_5LEVEL_HACK Kirill A. Shutemov
2017-01-27 13:24   ` Vlastimil Babka
2017-01-27 13:55     ` Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 04/29] arch, mm: convert all architectures to use 5level-fixup.h Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 05/29] asm-generic: introduce <asm-generic/pgtable-nop4d.h> Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 06/29] mm: convert generic code to 5-level paging Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 07/29] mm: introduce __p4d_alloc() Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 08/29] x86: basic changes into headers for 5-level paging Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 09/29] x86: trivial portion of 5-level paging conversion Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 10/29] x86/gup: add 5-level paging support Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 11/29] x86/ident_map: " Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 12/29] x86/mm: add support of p4d_t in vmalloc_fault() Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 13/29] x86/power: support p4d_t in hibernate code Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 14/29] x86/kexec: support p4d_t Kirill A. Shutemov
2016-12-27  1:53 ` [PATCHv2 15/29] x86: convert the rest of the code to " Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 16/29] x86: detect 5-level paging support Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 17/29] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 18/29] x86/mm: define virtual memory map for 5-level paging Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 19/29] x86/paravirt: make paravirt code support " Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 20/29] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 21/29] x86/dump_pagetables: support 5-level paging Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 22/29] x86/mm: extend kasan to " Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 23/29] x86/espfix: " Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 24/29] x86/mm: add support of additional page table level during early boot Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 25/29] x86/mm: add sync_global_pgds() for configuration with 5-level paging Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 26/29] x86/mm: make kernel_physical_mapping_init() support " Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 27/29] x86/mm: add support for 5-level paging for KASLR Kirill A. Shutemov
2016-12-27  1:54 ` [PATCHv2 28/29] x86: enable 5-level paging support Kirill A. Shutemov
2016-12-27  1:54 ` [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR Kirill A. Shutemov
2016-12-27  2:06   ` Andy Lutomirski
2016-12-27  2:24     ` Kirill A. Shutemov
2016-12-27  3:22       ` Andy Lutomirski
2017-01-02  9:09         ` Kirill A. Shutemov
2016-12-29  2:53       ` Carlos O'Donell
2016-12-31  2:08         ` Andy Lutomirski
2017-01-02  8:35           ` Kirill A. Shutemov
2017-01-13 20:11             ` H.J. Lu
2017-01-02  8:44   ` Arnd Bergmann
2017-01-03  6:08     ` Andy Lutomirski
2017-01-03 13:18       ` Arnd Bergmann
2017-01-03 18:29         ` Andy Lutomirski
2017-01-03 22:07           ` Arnd Bergmann
2017-01-03 22:09             ` Andy Lutomirski
2017-01-04 13:55               ` Arnd Bergmann
2017-01-03 16:04       ` Kirill A. Shutemov
2017-01-03 18:27         ` Andy Lutomirski
2017-01-04 14:19           ` Kirill A. Shutemov
2017-01-05 17:53             ` Andy Lutomirski
2017-01-05 19:13   ` Dave Hansen
2017-01-05 19:29     ` Kirill A. Shutemov
2017-01-05 19:39       ` Dave Hansen
2017-01-05 20:11         ` Kirill A. Shutemov
2017-01-05 20:14         ` Andy Lutomirski
2017-01-05 20:49           ` Dave Hansen
2017-01-05 21:27             ` Andy Lutomirski
2017-01-05 23:17               ` Dave Hansen
2017-01-11 14:29             ` Kirill A. Shutemov
2017-01-11 18:09               ` Andy Lutomirski
2017-01-11 18:37                 ` Kirill A. Shutemov
2017-01-11 18:49                   ` Dave Hansen
2017-01-11 19:20                     ` Andy Lutomirski
2017-01-11 19:31                       ` Linus Torvalds
2017-01-11 21:46                         ` Andi Kleen
2017-01-11 19:32                       ` Kirill A. Shutemov
2017-01-11 19:39                         ` Linus Torvalds
2017-01-11 18:26               ` Dave Hansen
2017-01-05 16:57 ` [PATCHv2 00/29] 5-level paging Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).